Back to Zopa Blog

Optimisation algorithms at Zopa, Part II: speeding up Python with Numba

In the previous post, we discussed how to optimise an algorithm prototype in Pandas by using optimal methods or shipping computation to Numpy. In the second part, we will focus on just-in-time compilation of Python code.

In general, and especially when high performance is key, we should vectorize our data science code. Loops in Python are notoriously slow, because it is a dynamically typed language. Vectorization entails expressing computation as batch operations, rather than element-by-element in a for loop. In Python, vectorization enables us to ship computation to NumPy, which calls fast C or Fortran routines.

The case for Numba

Suppose we have implemented vectorized functions to calculate the metrics for every bin, and they are called as part of the scoring function. Here is what the LineProfiler would display:

Note: column and variable names are for illustrative purposes only

Clearly, the functions that calculate metrics of the bins take up a significant proportion of time, despite the code being vectorized NumPy. At this point, the performance may be sufficient to stop modifying the algorithm – which, as we already saw, may entail making it more verbose. However, if you need to go that extra mile, there are several options that allow us to compile Python into lower-level and faster code. [1] We have been very impressed with the recent progress of Numba, a dynamic, just-in-time compiler for Python sponsored by Anaconda (the company which releases the popular Anaconda distributions for scientific computing in Python and R).

The open source community has already demonstrated that Numba is able to outperform pure NumPy code. The biggest attraction of using Numba is how little modification to the codebase it requires. In contrast to Cython, another popular method of speeding up Python by compiling it into C code, optimizing code using Numba may be as trivial as adding a single decorator to your function.

In addition, not all code can be vectorized. An algorithm may contain an iterative procedure which depends on the results from the previous step. In Python, we end up with a for loop that is inherently slow. Luckily, the code is only dependant on NumPy with which Numba works well. Hence, we can decorate the entire routine and obtain speed improvements at little cost. It should be noted, however, that Numba will attempt to cast the array to a numeric type, therefore any strings such as user IDs would first need to be mapped to integers or floats for the compiler to work. That is certainly a downside to using Numba, rather than Cython: Numba only accepts a limited set of types. It is not difficult to reach the point where performance gains are outweighed by the verbosity of the extra code required to make it work. Nonetheless, Numba is perfectly suited for computation on NumPy numeric arrays and admits the namedtuple type, which consequently worked out well for our bin packing example.

Looking under the hood

The three main decorators provided by Numba are jit() (and its variants), vectorize() and guvectorize(). While they can all speed up Python and/or NumPy code, there are subtle and important differences.

The jit() decorator works in the following way: at runtime, Numba takes Python bytecode (which is compiled by CPython), translates it (incl., where appropriate, type inference) and subsequently compiles into low-level machine code using the LLVM compiler. By default, it will attempt to translate Python code to low-level, optimised instructions and if it fails at some point, it will revert to Python mode in which there are unlikely to be any performance gains. To prevent this from happening, we have to set the nopython argument in the jit decorator to True, or, equivalently, use the njit() decorator – in that case, if Numba fails to compile, it will throw an exception and halt execution.

Source: Intel

In contrast, the vectorize() and guvectorize() decorators return different objects. The former allows us to create a universal NumPy function (ufunc), which are functions that operate element-by-element on NumPy arrays and support such functionality as array broadcasting or type casting. The guvectorize() decorator returns a generalized universal function, which operates over an arbitrary number of array elements.

Importantly, because different decorators return different objects, we cannot mix them. This is particularly relevant for moderately sophisticated algorithms where compiled functions call each other. If we call the jitted function within a NumPy ufunc, Numba will throw an exception as it will not recognize the type of the jitted function. Which decorator should be selected? The jit() decorator provides the most flexibility and even allows to release the Global Interpreter Lock (GIL) for true parallelization. On the other hand, the jit() decorator does not offer the functionality of ufuncs such as array broadcasting and hence vectorized code may have to be rewritten with loop unrolling. For our application, we found that to combine metric calculation functions with the iterative procedure of processing each item it was optimal to use the jit() decorator. However, to minimize code changes and maximize readability it would have been preferable to apply the vectorize() decorator to the vectorized NumPy functions and keep the loop over items in Python.

Seeing the big picture

Let’s walk through a real-case example of how we applied Numba to optimise the code. Suppose we are interested in the weighted value of the bin. Here is a code snippet of how such a function might look in Python and NumPy:

As you can see, it is simply a weighted average of ball values. We can examine how much time this function takes by running another IPython magic command %timeit: 27.1μs is not bad at all. Now let’s see what we can achieve by explicitly writing the for loop and applying the jit() decorator:

So, Numba brings a 2.1x speedup for a small modification to the code. Because Numba involves compilation at runtime, the first function call will generally be much slower, but we can cache the compiled function and incur overhead costs only once. When we consider that the algorithm may feature dozens of such functions and the loop over hundreds of thousands of balls, we can gauge that the overall performance gain will be tangible.

The curious reader may ask, how is this possible? Computation on NumPy arrays already utilises highly efficient C routines. Roughly speaking, the reason is that behind the scenes, NumPy only optimises isolated operations, such as summation of two terms, and has to store temporary results in memory, whereas at runtime Numba compiles the entire function and is thus able to optimise all at once, avoiding overhead. In other words, unlike NumPy, a just-in-time compiler like Numba can see ‘the big picture’.

Finally, a note about type signatures and parallelization. As you can see, our decorator above specifies the data types into which Numba will attempt to coerce the inputs. If you do not provide type signatures, Numba will attempt to infer them at runtime. For computation dependant on floating point precision, it may be safer to specify the type signatures manually (several types can also be provided). Numba also supports parallelization with a simple argument to the decorator, although speed improvements will only occur for sufficiently large arrays. We found that for arrays of around 100 values, parallelization significantly slowed down the code due to setup overhead.

Learnings and prospects

Before concluding, we must raise an old, but pertinent adage about the evils of premature optimisation – rewriting the code to only perform computation on NumPy arrays and only use the data structures admitted by Numba will generally make the codebase harder to read and harder to maintain. If a project is at an early stage and the specification of deliverables may change multiple times, the time cost of working with more complex code may easily outweigh any time savings from performance gains.

Nonetheless, we were very impressed with the performance gains obtained from using only NumPy and Numba for computation. Relative to our initial prototypes, we observed 50x speedups – with some parts reaching as high as 100x speedup. The cost of optimisation was relatively low, with most of our algorithm remaining Pythonic, and we look forward to using Numba to speed up our machine learning code. It is still a relatively young library, with occasionally obscure exceptions and few resources in comparison to Cython, but it can deliver quick wins when the focus of the project shifts to performance.

Footnotes

[1] Our options at this point were largely PyPy, Cython or Numba. PyPy was ruled out since scientific computing packages in Python, such as NumPy, are not optimised to work with it. Cython would fare well, but would have required more modifications to the code.