At Zopa, we do a lot of machine learning.
We make our machines learn to evaluate credit risk and fraud. We train them to produce fair pricing for our customers, and we train them to optimize the outcome of marketing campaigns. We do this because we love our customers, and we want to provide them with a fairer, simpler, more accurate and tailored service.
In order to make our machines learn, we have developed a standalone toolkit written in Python, which we’ve blogged about before: Predictor. But what machine learning methods have we implemented in Predictor? Do we use the all-time-classic logistic regression model? Do we use state-of-the-art sophisticated models at Zopa? Have we brought deep learning to Zopa? The answer to all of these questions is yes! And we will describe more of Zopa’s approach to machine learning in a series of blogs that follow on from our first post on The Birth of Predictor.
Let’s take it one step at a time. Today, we’ll dive into the use of Classification Trees as machine learning models.
Let’s start from the beginning. What is a Classification Tree?
We can think of a Classification Tree as a sequence of yes/no answer questions that will help us classify something as one thing or the other. In a famous example quoted by Leo Breiman, the creator of Random Forest Trees, a series of yes/no answer questions like the ones depicted in the diagram are followed to predict whether a patient is at high risk of a heart attack.(1) More precisely, when a patient is admitted to hospital, a number of variables are measured, and then according to whether these variables answer yes or no to the questions, the patient is deemed to be high or low risk for heart attack. (1)
Trees are simple, beautiful and very easy to interpret. They are also easy to follow so practitioners and junior analysts can make better assessments.
Limitations of Classification Trees
However, they have a few limitations. For a start, they are not extremely accurate. They also tend to become less and less accurate as more yes/no questions are asked. This occurs because the more questions we ask about a dataset, the more the tree learns about that specific dataset. In other words, it is as if the tree ‘memorizes’ the answers for that dataset, and therefore it can predict it extremely well - ‘perfectly’. But as soon as it is presented with an example outside of that dataset, it performs very poorly, because the answer is not in its memory, or as we like to say, the tree has not learnt to ‘generalize’. This phenomenon is called overfitting, and it is very important to prevent it when building machine learning models. Unless we prevent the model from overfitting, it can seriously decrease its predictive power.
Classification Trees in the big data era
Back to Classification Trees. In today’s era of big data, when we usually handle hundreds, if not thousands, of variables at a time, a single Classification Tree is not optimal to produce the best prediction. Therefore, subsequent models, which creatively combine the use of multiple Classification Trees, have been successfully developed in order to make more accurate predictions and to be able to use as much data as one can or wants. Among these methods, we find Random Forests and Gradient Boosted Trees.
Random Forests is a technique developed in the early ’90s that consists of building multiple Classification Trees that will then be averaged to get considerably more accurate predictions.
Imagine you want to make an important decision, for example you want to buy a house. Would you ask only one estate agent? Or would you rather ask several and then make your conclusion pondering all the answers you got from the various agents?
Random Forests operate on a similar principle. We build hundreds or thousands of trees and then ask each one of them whether we should classify something as one thing or the other. There are of course a few rules that need to be followed in order to build trees that provide complementary information; we don’t want all the trees to be exactly the same.
Building a Random Forest
The procedure to build the different trees goes like this: from all the posible yes-no answer questions that we can answer with a dataset, only a subset of the questions is selected to build each tree. This means, we don’t give all the information to each tree; rather a limited portion of it is fed into each one of them. This way, we make sure that each tree provides complementary yes/no answer questions to the other trees. The way we pass different information to each tree is basically random - hence the name Random Forests. Each tree is given a random subset of the yes-no questions to make its predictions. In addition, we don’t allow the tree to learn from the entire dataset. Rather, we allow them to see only a proportion of it. This is, we give each tree a random sub-selection of all our data to base their answers (i.e., predictions) on. And then, once we have built the hundreds or thousands of trees, we ask them to vote. Simple!
Gradient Boosted Trees
Gradient Boosted Trees differ from Random Forest in that instead of building several trees and then averaging the solution, each tree learns from the ‘mistakes’ of the previous tree. That is, a first tree is built based on a series of yes/no answer questions as usual. Then we evaluate how that single tree classifies the things we want to classify. Going back to the heart attack example, this means we look at whether the tree classified a patient correctly as high or low risk. Some of the patients will be classified correctly, and some won’t.
The following tree considers a penalty added to those cases that were misclassified in order to be more careful when building its yes/no answer question set. And the procedure continues tree after tree.
There are a variety of Gradient Boosting Trees methods, for example Adaboost or XGBoost, which differ, among other things, in the way they implement the penalty to the incorrectly classified data points. These methods also differ in the amount of time they need to formulate their yes/no answer questions set. Gradient Boosting Trees methods, by producing trees that build over the ‘mistakes’ of the previous ones, tend to typically produce more accurate predictions. And that is why we like them so much at Zopa.
Machine learning at Zopa - Trees AND Forests
At Zopa, we use a combination of both Random Forests and Gradient Boosted Trees, in particular XGBoost, to build our prediction models for Credit Risk and fraud, for example. We have seen that these models tend to make the most accurate predictions for our customers. Depending on what we are trying to predict, we sometimes build a single model, for example XGBoost, or sometimes we build both XGBoost and Random Forests and combine the outcome of both models to further improve the accuracy of our predictions, and therefore be able to make better data-informed decisions. This procedure allows us to serve our customers with a fairer and more tailored service.
As a peek into the coming blogs, I want to share that at Zopa, we also build logistic regression models (with a trick) and are currently diving into Deep Learning. So, stay tuned for our future posts!
(1) Classification and regression trees. Book. Breiman L, Friedman J, Stone C and Olshen R. 1984.
(Image credit to Anthony Gotter)