Machine Learning at Zopa
At Zopa, we love our customers. We also love data. What is the best way to combine these two loves? Machine learning! We do a lot of machine learning (ML) here at Zopa, and with a new series of blog posts, we will share a glimpse of our exciting journey.
Let’s start from the beginning. Zopa is a P2P lending company (the inventor of P2P actually!), matching lender investors to retail borrowers. As a young start up, back in 2005, we were predicting borrowers’ default probabilities using the score from simple, generic ML models obtained by the credit-risk agencies. To gain a competitive advantage, we progressively increased the sophistication of these models by making them more specific to Zopa’s loan types and borrower population. However, all this progress was happening off-site, being contracted out to external parties.
A few years ago, we realised that for Zopa to remain a pioneer in P2P lending and fin-tech, it would have to develop its own infrastructure, know how, and team of experts in ML. We wanted to use ML to transform the way that important decisions were made in areas of the company including credit-risk assessment — but also extending to fraud prevention, marketing, and pricing.
At the start of this journey we had to take some important decisions.
- Should we use open source or a proprietary ML solution (e.g., SAS, RapidMiner)?
- If open source, should we use a premade package (e.g., WEKA) or code something up?
- If we code something up, which language do we use? Python? R? Other?
- Or maybe should we instead use some premade package wrapped in our own custom code?
- In the end, do we really need to have some centralised choice that all should follow, or rather, should we let each of our data scientists choose what is best for them?
We ended up creating a centralised-for-all solution coded in Python and based on the usual suspects of Scikit-Learn, XGBoost, Keras, py-Earth, Pandas, and Matplotlib. Our solution was in the form of a stand-alone Python package that we (unimaginatively) named Predictor. Predictor provided us with a streamlined, automated way to generate ML models, implementing all steps of model creation such as, variable type-identification, vetting, and preprocessing, hyper-parameter optimisation, and model training, combining, and assessment (we’ll break down all these steps in later posts!). Once it is given an unprocessed dataset and a simple YAML configuration file, Predictor takes care of the rest.
We asked all our data scientists to use Predictor for their modelling needs. However, we also encouraged them to improve/extend it as they saw fit, provided they push their changes back to the main branch for all to use.
Predictor’s streamlined model development allowed us to rapidly evaluate the potential of various datasets and modelling ideas. The fact that it was used by all data scientists made the procedure of model creation and evaluation consistent and systematic, hence with little friction.
The fact that Predictor was written in Python allowed us to easily integrate it into our Production systems. As a result, the same code used by data scientists to create ML models was also used by production systems to query the same exact models. The whole setup gave us a considerable advantage compared to more traditional institutions in which porting of models from data science to production requires a considerable “translation” overhead.
Today, we are using Predictor to run six ML models in Production evaluating the credit risk and fraud potential of thousands of loan applications every day. We also run it offline to optimise our marketing campaigns and pricing strategy.
Where does the future take us? We are fascinated by the recent revolution in deep learning and are currently evaluating its usefulness in modelling time-series data. We are also intrigued by the potential of social data (e.g. LinkedIn) to provide a more ‘holistic’ picture of an individual when used alongside our existing ‘traditional’ data. We will investigate the potential of this new datasets whilst keeping in mind all the applicable UK and EU regulations regarding data protection and information security.
In the next blog posts, we will talk more about Predictor and its architecture, our ML models productionised or not, and the learnings from Zopa’s data-science journey. Stay tuned!