(Reposted from www.peadarcoyle.com)
I’ve been working with Machine Learning models both in academic and industrial settings for a few years now, and I recently watched the excellent Scalable ML from Mikio Braun, to learn more about Scala and Spark.
His video series talks about the practicalities of ‘big data’, and it made me think about what I wish I knew earlier about Machine Learning. I’ve boiled it down to three main things:
1. Getting models into production is a lot more than just microservices
2. Feature selection and feature extraction are really hard to learn from a book
3. The evaluation phase is really important
Let’s get into the details.
Getting models into production is a lot more than just microservices
I’ve been speaking about things like Data Products for a few years now and there’s a lot more to this than just microservices. Models decay in production, there’s a whole deployment, evaluation phase and in some cases, a hand-off period between research and development work and production work. At Zopa, we invest heavily in our own tooling for production, and the improvements in throughput, auditability and the deployment experience are worth it. Read more about Predictor here.
Feature selection and feature extraction are really hard to learn
Something that I couldn’t learn from a book, but tried to, is feature selection and extraction. In reality, these skills are only learned by Kaggle competitions and real world projects. You can only really learn about the various tricks and methods by implementing them or using them in real-world projects. Trust me, this eats up a lot of the work flow of the data science process and feature extraction and feature selection deserve their own blog post.
The evaluation phase is really important
Unless you apply your models to test data — you’re not doing predictive analytics. Evaluation techniques such as cross-validation, evaluation metrics, etc, are all invaluable. The same goes for simply splitting your data into test data and training data. Life often doesn’t hand you a dataset with these things defined, so there is a lot of creativity and empathy involved in defining these two sets on a real-world dataset. There’s a good intro to this on Scikit-Learn docs.
The explanations by Mikio Braun are worth a read. I love his diagrams – here’s a useful example, in case you’re not familiar with training sets and testing sets.
We don’t often discuss evaluation of models in papers, conferences or even when we talk about what techniques we use to solve problems. ‘We used SVM on that’ doesn’t really tell me anything. It doesn’t tell me your data sources, your feature selection, your evaluation methods, how you got into production, or how you used cross-validation or model-debugging. We need a lot more commentary about these ‘dirty’ aspects of machine learning. And I wish I knew that a lot earlier 🙂
Ian Ozsvald has some great remarks on ‘Data Science Delivered’ – a good read for any professional (junior or senior) building machine learning models for a living. It’s also a useful resource for recruiters hiring data scientists or managers interacting with data science teams – if you’re looking for questions to ask your data science team, why not try: ‘how did you handle that dirty data’?
Peadar Coyle is an Senior Data Scientist at Zopa working on Risk modelling. He’s an international speaker, an Open Source contributor to PyMC3. In his spare time, he enjoys craft beer and tag rugby.