Data science opportunities at Zopa are diverse and varied. While we have talked at length in the past about our machine learning algorithms and our main tools, another important class of challenges at Zopa concerns combinatorial optimisation, especially linear programming at scale. These often demand highly customised solutions, thus calling upon our data scientists’ ability to implement new ideas from scratch and write high-quality, maintainable code.
In this technical blog post, we will share with you how we leverage the PyData stack to design our algorithms, how we identify computational bottlenecks, and how we optimise an algorithm’s execution speed.
How we approach optimisation
A prominent example of a combinatorial problem is bin packing. For concreteness, consider the following example: given a fixed number of bins and a fixed number of balls, how would you allocate those balls between the bins to maximize their value, given the constraints on, say, allowed volume and allowed weight? This problem is NP-hard, meaning that a fast (polynomial-time) algorithm that finds a correct solution for all possible inputs has not yet been discovered.
At our scale, using off-the-shelf linear programming solvers, such as PuLP or Google Optimisation Tools, turned out to be a prohibitively slow approach that increases the chance of execution failures. As a result, we develop our own algorithms suited to our specific needs. Our goal is to make algorithms that are fast, reliable, and at the same time, as simple as possible. Most of the time, they are not standard textbook problems and require creative thinking driven by business knowledge.
First, optimise for research speed
How can we approach our bin packing example? First, we know that the number of balls is several orders of magnitude larger than the number of bins. Secondly, we wish to allow every bin to be considered for a ball that can fit in it. Therefore, it is much less costly to iterate through all balls and consider each bin for allocation than vice versa. Thirdly, one bin’s suitability for the ball can be calculated independently of another’s, because suitability is only with reference to individual targets.
Consequently, we can design a greedy algorithm which finds an approximate solution in a single scan over all balls. The algorithm assigns each ball to the most appropriate bin by optimising for a series of heuristics which are combined into a single score. This approach is common for NP-hard optimisation problems.
To implement our ideas, we use Python and its rich open-source ecosystem. We are big fans of the PyData stack at Zopa and find it very productive for prototyping our research ideas. Often, to get the data we would query our databases from a Python environment and store it in a pandas DataFrame, a 2D container for structured data. By performing computation on the DataFrame directly, we produce code that is extremely readable, intuitive and can be quickly iterated on.
Once we finish prototyping and are happy with the simulated outcomes of an algorithm, we estimate its time cost were we to deploy it on our platform. Our algorithms may need to cope with millions of iterations every day. Slow algorithms impede the research process, preventing fast trial-and-error iterations, and increase the strain on production systems. In our case, computation costs justify time investment into speeding up an optimisation algorithm.
Profile to find the 80/20
When performance is paramount, it always helps to profile the code first to identify where to focus optimisation efforts.
Python provides a convenient in-built profiling library cProfile that is very simple to use. However, it only times function calls, which may hide bottlenecks within highly modularised code that also relies on scientific computing packages like NumPy or pandas. Another option is the LineProfiler package that conveniently provides a ‘magic’ method in the IPython IDE or Jupyter notebook, both commonly used for prototyping. LineProfiler is particularly useful as it can time every individual line in the code of any function executed as part of the original function call.
Let’s consider a simple example. As part of the algorithm, we need to update the metrics of the bins as we add balls to them. Those metrics are stored in a DataFrame and hence the update requires indexing the appropriate cell and assigning the new value. Typically, that is done using the .loc or .iloc methods that permit a wide range of types – from callables to strings – corresponding to row indexes and column names. Using the pandas API, the updates require only a few lines of code. However, with the LineProfiler we can quickly see that they are more expensive – especially since this needs to be done many times – than they may appear:
Pandas indexing is slow – in part, this is the price we pay for its flexibility, as behind the scenes a lot of data type checks must take place. There are several possible improvements. First, we can notice that we are modifying one cell at a time and hence can use the .at or .iat methods instead, which deliver 2-3x speedups. This would be a straightforward modification to the code. Alternatively, we can forgo the DataFrame altogether and use a data structure with much faster indexing such as NumPy array (NumPy is another well-known and hugely influential package in the Python ecosystem), as we did in the end.
Slow pandas? Use NumPy instead
While pandas is famous for its ease and flexibility, it may not always be sufficiently fast and can become a computational bottleneck. In order to optimise code that relies on pandas, it is helpful to understand how it works under the hood. Many articles have been written on how pandas is implemented and therefore we will only give a brief overview.
At its core, the pandas DataFrame is an abstraction on top of a NumPy array which holds values of a single type. On top of this array, pandas provides labelled axes: column names and indexes. An index is essentially a big, immutable mapping from labels to specific rows of the DataFrame. Every column of the DataFrame is a separate NumPy array, which is how pandas allows us to hold data of heterogenous types. When we store data as a DataFrame, it is loaded in memory, with Axes (column names and index labels) and Blocks stored separately, where each Block contains columns of the same type.
Why is this important? Knowing how pandas manages the data under the hood clarifies why certain operations are computationally expensive and should be avoided. For example, appending to a DataFrame is expensive because every Block will have to be copied and resized, and a new Index will have to be created. More generally, manipulating data in the DataFrame will incur additional overhead, relative to directly modifying the underlying NumPy array, due to data type checking, maintaining indexes and inferring output shapes – all of which are written in pure Python, rather than a more low-level, faster C language as in the case of NumPy.
Returning to our example above, accessing an element from the NumPy array turns out to be orders of magnitude faster:
However, the modifications to the code are no longer trivial. A sizable portion of the code will need to be re-written to accommodate a different structure and different indexing syntax. Moreover, NumPy arrays do not hold labelled data, therefore we no longer can access data by column names or indexed rows. It may not be an issue when we have a few columns and their positions are fixed, but hard-coding columns by position invites bugs that would be easy to miss.
But we know that DataFrame Axes essentially provide a mapping from labels to specific rows and columns. Consequently, we can recreate this mapping using other Pythonic data structures such as a dictionary or a namedtuple. We can still take DataFrames as an input to the algorithm and create a map from column names to a corresponding integer position. We can also output a DataFrame by creating it only once at the end – this drastically reduces computation time. All the in-between heavy lifting will be done on NumPy arrays.
At this stage, the code might be sufficiently performant for a large number of real-life business applications. However, when one needs to go the extra mile, further improvements can be obtained by compiling Python programs using tools such as Numba. This will be the topic of our next blog post.