When we create our Credit Risk assessment or Fraud prevention machine learning models at Zopa, we use a variety of algorithms. These include classification and regression trees, more complex algorithms like neuronal networks and simpler linear models like Logistic Regression. Complex algorithms tend to be more accurate than simpler linear models. However, linear models can become very powerful when built with the right data.
In today’s blog I will talk about Logistic Regression , and the ways we turn it into a powerful analytical machine learning model at Zopa.
Why use logistic regression?
Many people choose to work with Logistic Regression models because they are very easy to interpret. It is extremely straightforward to understand and explain what we have learned from them, and they not only provide accurate predictions, but also insight into the problem at hand.
More excitingly, by implementing a trick or two, we can get Logistic Regression algorithms to make incredibly accurate predictions!
What are these tricks? Keep on reading and you will find out…
What is Logistic Regression?
Before we can understand what Logistic Regression is, we need to understand what a simple Linear Regression is. So let’s take one step back.
What is linear regression?
Linear Regression lives up to its name. It is a straightforward approach that tries to predict an outcome based on one or more characteristics, by assuming a linear relationship between the outcome and those characteristics.
To understand it better let’s look at Figure 1. In this example, the outcome is the number of sales of a product, and the characteristic is the number of TV advertisements shown in a day. We can see that the more the advertisement is shown on TV, the higher the number of product sales. In addition, we can also see that the relationship between the two lies almost perfectly on a line. Therefore, we can say that there is a linear relationship between the number of sales and the number of advertisements shown per day.
Figure 1. Example illustrating the linear dependency of the outcome (number of sales of a product per day) on the number of advertisements shown on TV per day.
Linear Regression is very useful when trying to predict a continuous variable, such as the number of product sales, the price at which we can sell a house, or someone’s salary. These variables are described as “continuous”, because they can take a large number (sometimes infinite) numerical values. We can sell 1, 10s or 100s of product units after an advertisement is shown on TV, or the house price can be anything from £200,000 to a million, and so on.
At Zopa, we often want to predict variables that, instead of taking a range of continuous values, take only two discrete values. For example, will this person be able to repay a loan? Is this application fraudulent? These questions, take a yes / no answer – the question, has only two possible outcomes. And we treat the answer mathematically as 1: Yes, or 0: No. Here is where Logistic Regression comes in handy.
To understand the situation better, let’s look at Figure 2. In this scenario, we want to predict whether an application is fraudulent based on the number of previous applications the customer made to Zopa. The yellow dots indicate fraudulent loan applications and the green dots show non-fraudulent applications. On the x-axis, we plot the number of previous applications the customer has made.
Figure 2. Example illustrating Logistic Regression. We want to predict if a loan application is fraudulent based on the number of previous applications the customer has made to Zopa. The yellow and green dots indicate fraudulent and non-fraudulent applications. The dotted black line indicates the linear relationship assumed by the Logistic Regression model. The blue line indicates the outcome of the Logistic Function, or in other words, the probability of an application being fraudulent. If the probability is higher than 0.5 then the application is fraudulent, whereas if the probability is smaller than 0.5 then it is a genuine application.
Non-linear transformation: the logistic function
Logistic Regression also assumes a linear relation between the characteristics, in this case the number of previous applications, and the question we want to address, in this case fraud. This linear relationship is indicated with the dotted line in Figure 2.
Logistic Regression adds one additional step: it applies a non-linear transformation that converts the linear relationship, i.e., the dotted line, into a probability that can take values between 0 and 1. The non-linear transformation is called the logistic function and is shown in blue in Figure 2.
The result is now very easy to interpret: if the result of the logistic function is greater than 0.5, then the answer to our question is 1 or ‘Yes, it is a fraudulent application’. Otherwise, the answer is ‘No, it is a genuine application’.
Making Logistic Regression even more powerful
Very often, there isn’t any obvious linear relationship between the variables and what we are trying to predict. In these cases, a Logistic Regression will make a poor prediction. What can we do about this?
Trick one: finding hidden linear relations
In the scenario in Figure 3, we have a product to sell, a mortgage, which is typically bought by people between 30 and 50 years of age (red dots). Younger or older people (blue and green dots) are not interested in mortgages. If we look at the Logistic Regression result shown in blue, we can see that it does a pretty bad job at predicting whether we will be able to sell the mortgage based on the age of the customer.
Figure 3. Example illustrating Logistic Regression outcome on a non-linear association between the variable age and the outcome ‘sale of a mortgage’. We can see that mortgages are very popular among people of 30-50 years of age, but will not sell to older or younger people. The Logistic Regression does a poor job at making predictions of sale, because it can’t identify non-linear associations.
The Logistic Regression can tell us that customers over a certain age are likely to buy the mortgage, but is unable to identify customers within an age bracket who are likely to buy. This is because Logistic Regression assumes a linear relationship between the customer age and the probability of sale: that the older the customer, the more likely they are to buy. So how can we modify the variable age so that it becomes linear with the probability of mortgage sale?
Discretisation of the variable
There is a procedure called discretisation of the variable, where we divide the variable into discrete groups. Following the example in Figure 3, we could divide the variable into three groups. People younger than 30 belong to group 1, people between 30 and 50 belong to group 3, and people older than 50 belong to group 2. So now when we plot the probability of sale vs the ‘discretised’ age variable, we observe something like in Figure 4.
Figure 4. Example illustrating Logistic Regression outcome on a ‘discretised’, previously non-linear association, between the variable age and the outcome ‘sale of a mortgage’. Now the Logistic Regression becomes better at predicting the sale of a mortgage. It understands that if the customer belongs to group 3 then they are likely to buy, but not if the belong to groups 1 or 2.
Now the Logistic Regression can do a much better job at predicting whether a customer will buy the mortgage. If the customer belongs to groups 1 or 2,meaning a customer is younger than 30 (group 1) or older than 50 (group 2), then the probability of sale is small. However, if the customer is in group 3 (aged between 30 and 50) then they are very likely to buy the mortgage. In other words, the higher the group number, the more likely the customer is to buy the mortgage. Genius!
There is one more thing though: how do we assign the numbers to the groups? Real life situations are not as obvious as the example in Figure 3. How can we make this process automatic and independent of human observation?
Enter the classification trees
At Zopa we find the discrete groups using a classification tree. We build a single classification tree between the variable and the target (that’s age and sold mortgage in our example above). The tree identifies the ages that are most effective at separating the target (mortgages) in to sold (1) and not-sold (0). In other words, the tree identifies groups that behave similarly in respect to the target, in this case age groups that have similar mortgage-buying habits. Let’s look at Figure 5 to understand how it works.
Figure 5. Example illustrating a tree based method to automatically find the groups into which we can ‘discretise’ our variable prior to using it in Logistic Regression. The tree will identify those groups that are more similar in their behaviour towards buying a mortgage. For details see text.
In this scenario, we have 100 customers of all ages with the same buying tendencies: customers between 30 and 50 years will generally buy a mortgage, but customers outside this age range won’t. There are of course young and old customers that will buy a mortgage, and a fraction of people between 30 and 50 who will not, as shown in the table:
People under 30 years: 2 will buy, 18 will not buy.
People between 30 and 50 years: 48 will buy, 12 will not buy.
People older than 50 years: 4 will buy, 16 will not buy.
We then proceed to build a classification tree. The tree will identify the first best division by asking the question ‘is the customer older than 30?’ If no, then send the customer to the left group 1, otherwise send it to the right and ask the next question: ‘Is the customer under 50?’ If no, then send the customer to group 2, otherwise include them in the rightmost group, 3.
Now the tree has identified the three major groups, but how will we assign numbers to those groups? Simple – we will assign to each group their probability of buying a mortgage.
For group 1 that is 2/20 = 0.1; for group 2 it is 4/20*0.80 = 0.16 and for group 3 it is 48/60*0.8 = 0.64.
After this exercise we will end up with a situation very similar to that one in Figure 3, but instead of having arbitrarily numbered the groups 1 to 3, we now have numbered them based on a decision tree as 0.1, 0.16 and 0.64. Still the linear relation between ‘discretised’ age and probability of buying a mortgage is captured: the higher the group number the customer belongs to, the more likely they are to buy the mortgage!
At Zopa, when we decide how we are going to modify (pre-process) our variables and which machine learning models we are going to build, we consider approaches used by winners of competitions like the Data Mining and Knowledge Discovery Competition (KDD), Kaggle and others. We don’t just follow the text book, instead we research and adopt what works best. In fact, we learnt the trick we have just described from a white paper published after KDD competition.
Trick two, building several Logistic Regression and asking them to vote
Imagine you want to make an important decision, for example, whether to sell your house. You would probably ask several state agents for a sale estimate, rather than going with the first one you get, right ?
We do exactly the same. Why trust one single Logistic Regression model when we can actually build several and ‘evaluate’ the predictions of all of them? This process of building several models and averaging the predictions of all of them is called ‘bagging of the predictors’. Bagging of predictors was described by Leo Breiman in the late 1990s and it improves the performance of a model dramatically.
How does bagging work?
It’s very simple. We generate many different datasets by extracting a random sample from the original one. Each dataset will be slightly different from the other, yet representative of the original scenario.
We then use each one of these datasets to build a Logistic Regression model. Because each Logistic Regression is built on a slightly different dataset, it ends up being different from the other. Finally, we average the output probability of all the Logistic Regression models to obtain the final probability. We observed that by ‘Bagging the predictors’, we increased the performance of our predictions by at least 6%, which translates into an appreciable business value.
Logistic Regression models are popular because they are easy to build and easy to interpret. By implementing a few modifications on the variables we use, we can overcome the fact that, in real life, linear relationships between causes and effects are rare. In addition, by building multiple Logistic Regressions using variations of the original dataset, we can improve the performance of our Logistic Regression models considerably. We end up with an easy to interpret model, and more importantly, one that makes particularly accurate predictions.
Disclaimer: All figures are for illustrative purposes only. They do not contain real data, real insight, nor are they mathematically fit to represent the shown results.
(Image credit to pixabay.)