os.chdir(path)

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
# 3. to use cython
%load_ext watermark
%load_ext autoreload 
%autoreload 2
%load_ext cython

import numpy as np
from numba import jit, njit, prange

%watermark -a 'Ethen' -d -t -v -p numpy,cython,numba

Ethen 2017-08-25 10:26:10 

CPython 3.5.2
IPython 5.4.1

numpy 1.13.1
cython 0.25.2
numba 0.34.0

High Performance Python¶

Cython¶

Cython is a superset of the Python programming language, designed to give C-like performance with code which is mostly written in Python. In short it aims to give the simplicity of Python and efficiency of C. If you like some additional motivation to try it out consider listening to a 20 minute-ish talk from Pycon. Youtube: Cython as a Game Changer for Efficiency

We can write it in notebooks by loading the cython magic.

%%cython
def hello_snippet():
    """
    after loading the cython magic, we can
    run the cython code (this code isn't 
    different from normal python code)
    """
    print('hello cython')

hello_snippet()

hello cython

Or write it as a .pyx file and use setup.py to compile it:

helloworld.pyx

# cython hello world
def hello():
    print('Hello, World!')

setup.py

# compiling the .pyx module
from distutils.core import setup
from Cython.Build import cythonize

# key-value pairs that tell disutils the name
# of the application and which extensions it
# needs to build, for the cython modules, we
# use glob patterns, e.g. *.pyx or simply pass in
# the filename.pyx
setup(
    name = 'Hello',
    ext_modules = cythonize('*.pyx'),
)

# after that run 
# python setup.py build_ext --inplace
# in the command line, and we can import it like
# normal python modules

from helloworld import hello
hello()

Hello, World!

Static Typing¶

Cython extends the Python language with static type declarations. This increases speed by not needing to do type-checks when running the program. The way we do this in Cython is by adding the cdef keyword.

We'll write a simple program that increments j by 1 for 1000 times and compare the speed difference when adding the type declaration.

%%cython

def example():
    """simply increment j by 1 for 1000 times"""
    # declare the integer type before using it
    cdef int i, j = 0
    for i in range(1000):
        j += 1
    
    return j

def example_py():
    j = 0
    for i in range(1000):
        j += 1
    
    return j

%timeit example()

The slowest run took 28.37 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 47.7 ns per loop

%timeit example_py()

10000 loops, best of 3: 49.4 µs per loop

Notice the runtime difference (look at the units)

Functions¶

To declare functions we use the cpdef keyword.

%%cython

cpdef int compute_sum(int a, int b):
    return a + b

compute_sum(5, 3)

8

Notice apart from declaring the function using the cpdef keyword, we also specify the return type to be a integer and a two input argument to be integers.

There's still an overhead to calling functions, so if the function is small and is in a computationally expensive for loop, then we can add the inline keyword in the function declaration. By doing this, it will replace function call solely with the function body, thus reducing the time to call the function multiple times.

%%cython

cpdef inline int compute_sum(int a, int b):
    return a + b

Numpy¶

Typed memoryviews allow even more efficient numpy manipulation since again, it does not incur the python overhead.

%%cython

import numpy as np

# declare memoryviews by using : in the []
cdef double[:, :] b = np.zeros((3, 3), dtype = 'float64')
b[1] = 1

# it now prints memoryview instead of original numpy array
print(b[0])

<MemoryView of 'ndarray' object>

Pairwise Distance Example¶

We'll start by implementing a pure python version of the function that will give us a good benchmark for comparison with Cython alternatives below.

import numpy as np

def euclidean_distance(x1, x2):
    dist = np.sqrt(np.sum((x1 - x2) ** 2))
    return dist

def pairwise_python(X, metric = 'euclidean'):
    if metric == 'euclidean':
        dist_func = euclidean_distance
    else:
        raise ValueError("unrecognized metric")    

    n_samples = X.shape[0]
    D = np.zeros((n_samples, n_samples))
    
    # We could exploit symmetry to reduce the number of computations required,
    # i.e. distance D[i, j] = D[j, i]
    # by only looping over its upper triangle
    for i in range(n_samples):
        for j in range(i + 1, n_samples):
            dist = dist_func(X[i], X[j])
            D[i, j] = dist
            D[j, i] = dist

    return D

X = np.random.random((1000, 3))
%timeit pairwise_python(X)

1 loop, best of 3: 3.53 s per loop

We'll try re-writing this into Cython using type memoryview. The key thing with Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations on numpy arrays. This usually means writing out all of the loops by hand and operating on single array elements at a time.

All the commented .pyx code can be found in the github folder. You can simply run python setup.py install to install pairwise1.pyx and pairwise2.pyx.

# pairwise1.pyx
from pairwise1 import pairwise1

# test optimized code on a larger matrix
X = np.random.random((5000, 3))
%timeit pairwise1(X)

1 loop, best of 3: 607 ms per loop

We can see the huge speedup over the pure python version! It turns out, though, that we can do even better. If we look in the code, the slicing operation when we call X[i] and X[j] must generate a new numpy array each time. So this time, we will directly slice the X array without creating new array each time.

# pairwise2.pyx
from pairwise2 import pairwise2
%timeit pairwise2(X)

1 loop, best of 3: 231 ms per loop

We now try utilize Cython's parallel functionality. For some reason can't compile the parallel version when following Cython's documentation on compiling parallel version that utilizes OPENMP (a multithreading API), will come back to this in the future. Had to take a different route by installing it as if it was a package. You can simply run python setup_parallel.py install to install pairwise3.pyx.

# pairwise3.pyx
from pairwise3 import pairwise3
%timeit pairwise3(X)

10 loops, best of 3: 88.9 ms per loop

We've touch upon an exmaple of utilizing Cython to speed up or CPU intensive numerical operations. Though, to get the full advantage out of Cython, it's still good to know some C/C++ programming (things like void type, pointers, standard library).

Numba¶

Numba is an LLVM compiler for python code, which allows code written in Python to be converted to highly efficient compiled code in real-time. To use it, we simply add a @jit (just in time compilation) decorator to our function. We can add arguments to the decorator to specify the input type, but it is recommended not to and simply let Numba decide when and how to optimize.

@jit
def pairwise_numba1(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.zeros((M, M), dtype = np.float64)
    for i in range(M):
        for j in range(i + 1, M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            
            dist = np.sqrt(d)
            D[i, j] = dist
            D[j, i] = dist
    
    return D

# a nice speedup from the raw python code given the 
# little amount of effort that we had to put in 
# (just adding the jit decorator)
%timeit pairwise_numba1(X)

1 loop, best of 3: 236 ms per loop

The @jit decorator tells Numba to compile the function. The argument types will be inferred by Numba when function is called. If Numba can't infer the types, it will fall back to a python object; When this happens, we probably won't see any significant speed up. The numba documentation lists out what python and numpy features are supported.

A number of keyword-only arguments can be passed to the @jit decorator. e.g. nopython. Numba has two compilation modes: nopython mode and object mode. The former produces much faster code, but has limitations that can force Numba to fall back to the latter. To prevent Numba from falling back, and instead raise an error, pass nopython = True to the decorator, so it becomes @jit(nopython = True). Or we can be even lazier and simply use the @njit decorator.

The latest version (released around mid July 2017) 0.34.0 also allows use to write parallel code by specifying the parallel = True argument to the decorator and changing range to prange to perform explicit parallel loops. Note that we must ensure the loop does not have cross iteration dependencies.

@njit(parallel = True)
def pairwise_numba2(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.zeros((M, M), dtype = np.float64)
    for i in prange(M):
        for j in range(i + 1, M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            
            dist = np.sqrt(d)
            D[i, j] = dist
            D[j, i] = dist
    
    return D

%timeit pairwise_numba2(X)

The slowest run took 5.27 times longer than the fastest. This could mean that an intermediate result is being cached.
1 loop, best of 3: 104 ms per loop

Note that we add the @njit decorator, we are marking a function for optimization by Numba's JIT (just in time) compiler, meaning the python code is compiled on the fly into optimized machine code during the first time we invoke the function call. In other words, we can see some additional speed boost the next time we call the function since we won't have the initial compilation overhead.

%timeit pairwise_numba2(X)

10 loops, best of 3: 105 ms per loop

Little change to the code shows significant gain in speed!!! This insane functionality allows us to prototype algorithms with numpy without lossing the speed of C++ code. If you read it up to this part, consider giving the numba project a star.

For more information, this is a pretty good Pydata talk that illustrates the potential of numba. Youtube: Numba: Flexible analytics written in Python.

Table of Contents