# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))
from formats import load_style
load_style(plot_style = False)
os.chdir(path)
# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
# 3. to use cython
%load_ext watermark
%load_ext autoreload
%autoreload 2
%load_ext cython
import numpy as np
from numba import jit, njit, prange
%watermark -a 'Ethen' -d -t -v -p numpy,cython,numba
Cython is a superset of the Python programming language, designed to give C-like performance with code which is mostly written in Python. In short it aims to give the simplicity of Python and efficiency of C. If you like some additional motivation to try it out consider listening to a 20 minute-ish talk from Pycon. Youtube: Cython as a Game Changer for Efficiency
We can write it in notebooks by loading the cython magic.
%%cython
def hello_snippet():
"""
after loading the cython magic, we can
run the cython code (this code isn't
different from normal python code)
"""
print('hello cython')
hello_snippet()
Or write it as a .pyx file and use setup.py
to compile it:
helloworld.pyx
# cython hello world
def hello():
print('Hello, World!')
setup.py
# compiling the .pyx module
from distutils.core import setup
from Cython.Build import cythonize
# key-value pairs that tell disutils the name
# of the application and which extensions it
# needs to build, for the cython modules, we
# use glob patterns, e.g. *.pyx or simply pass in
# the filename.pyx
setup(
name = 'Hello',
ext_modules = cythonize('*.pyx'),
)
# after that run
# python setup.py build_ext --inplace
# in the command line, and we can import it like
# normal python modules
from helloworld import hello
hello()
Cython extends the Python language with static type declarations. This increases speed by not needing to do type-checks when running the program. The way we do this in Cython is by adding the cdef
keyword.
We'll write a simple program that increments j by 1 for 1000 times and compare the speed difference when adding the type declaration.
%%cython
def example():
"""simply increment j by 1 for 1000 times"""
# declare the integer type before using it
cdef int i, j = 0
for i in range(1000):
j += 1
return j
def example_py():
j = 0
for i in range(1000):
j += 1
return j
%timeit example()
%timeit example_py()
Notice the runtime difference (look at the units)
To declare functions we use the cpdef
keyword.
%%cython
cpdef int compute_sum(int a, int b):
return a + b
compute_sum(5, 3)
Notice apart from declaring the function using the cpdef
keyword, we also specify the return type to be a integer and a two input argument to be integers.
There's still an overhead to calling functions, so if the function is small and is in a computationally expensive for loop, then we can add the inline
keyword in the function declaration. By doing this, it will replace function call solely with the function body, thus reducing the time to call the function multiple times.
%%cython
cpdef inline int compute_sum(int a, int b):
return a + b
Typed memoryviews allow even more efficient numpy manipulation since again, it does not incur the python overhead.
%%cython
import numpy as np
# declare memoryviews by using : in the []
cdef double[:, :] b = np.zeros((3, 3), dtype = 'float64')
b[1] = 1
# it now prints memoryview instead of original numpy array
print(b[0])
We'll start by implementing a pure python version of the function that will give us a good benchmark for comparison with Cython alternatives below.
import numpy as np
def euclidean_distance(x1, x2):
dist = np.sqrt(np.sum((x1 - x2) ** 2))
return dist
def pairwise_python(X, metric = 'euclidean'):
if metric == 'euclidean':
dist_func = euclidean_distance
else:
raise ValueError("unrecognized metric")
n_samples = X.shape[0]
D = np.zeros((n_samples, n_samples))
# We could exploit symmetry to reduce the number of computations required,
# i.e. distance D[i, j] = D[j, i]
# by only looping over its upper triangle
for i in range(n_samples):
for j in range(i + 1, n_samples):
dist = dist_func(X[i], X[j])
D[i, j] = dist
D[j, i] = dist
return D
X = np.random.random((1000, 3))
%timeit pairwise_python(X)
We'll try re-writing this into Cython using type memoryview. The key thing with Cython is to avoid using Python objects and function calls as much as possible, including vectorized operations on numpy arrays. This usually means writing out all of the loops by hand and operating on single array elements at a time.
All the commented .pyx
code can be found in the github folder. You can simply run python setup.py install
to install pairwise1.pyx
and pairwise2.pyx
.
# pairwise1.pyx
from pairwise1 import pairwise1
# test optimized code on a larger matrix
X = np.random.random((5000, 3))
%timeit pairwise1(X)
We can see the huge speedup over the pure python version! It turns out, though, that we can do even better. If we look in the code, the slicing operation when we call X[i] and X[j] must generate a new numpy array each time. So this time, we will directly slice the X array without creating new array each time.
# pairwise2.pyx
from pairwise2 import pairwise2
%timeit pairwise2(X)
We now try utilize Cython's parallel functionality. For some reason can't compile the parallel version when following Cython's documentation on compiling parallel version that utilizes OPENMP (a multithreading API), will come back to this in the future. Had to take a different route by installing it as if it was a package. You can simply run python setup_parallel.py install
to install pairwise3.pyx
.
# pairwise3.pyx
from pairwise3 import pairwise3
%timeit pairwise3(X)
We've touch upon an exmaple of utilizing Cython to speed up or CPU intensive numerical operations. Though, to get the full advantage out of Cython, it's still good to know some C/C++ programming (things like void type, pointers, standard library).
Numba is an LLVM compiler for python code, which allows code written in Python to be converted to highly efficient compiled code in real-time. To use it, we simply add a @jit
(just in time compilation) decorator to our function. We can add arguments to the decorator to specify the input type, but it is recommended not to and simply let Numba decide when and how to optimize.
@jit
def pairwise_numba1(X):
M = X.shape[0]
N = X.shape[1]
D = np.zeros((M, M), dtype = np.float64)
for i in range(M):
for j in range(i + 1, M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
dist = np.sqrt(d)
D[i, j] = dist
D[j, i] = dist
return D
# a nice speedup from the raw python code given the
# little amount of effort that we had to put in
# (just adding the jit decorator)
%timeit pairwise_numba1(X)
The @jit
decorator tells Numba to compile the function. The argument types will be inferred by Numba when function is called. If Numba can't infer the types, it will fall back to a python object; When this happens, we probably won't see any significant speed up. The numba documentation lists out what python and numpy features are supported.
A number of keyword-only arguments can be passed to the @jit
decorator. e.g. nopython
. Numba has two compilation modes: nopython mode and object mode. The former produces much faster code, but has limitations that can force Numba to fall back to the latter. To prevent Numba from falling back, and instead raise an error, pass nopython = True
to the decorator, so it becomes @jit(nopython = True). Or we can be even lazier and simply use the @njit
decorator.
The latest version (released around mid July 2017) 0.34.0
also allows use to write parallel code by specifying the parallel = True
argument to the decorator and changing range
to prange
to perform explicit parallel loops. Note that we must ensure the loop does not have cross iteration dependencies.
@njit(parallel = True)
def pairwise_numba2(X):
M = X.shape[0]
N = X.shape[1]
D = np.zeros((M, M), dtype = np.float64)
for i in prange(M):
for j in range(i + 1, M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
dist = np.sqrt(d)
D[i, j] = dist
D[j, i] = dist
return D
%timeit pairwise_numba2(X)
Note that we add the @njit
decorator, we are marking a function for optimization by Numba's JIT (just in time) compiler, meaning the python code is compiled on the fly into optimized machine code during the first time we invoke the function call. In other words, we can see some additional speed boost the next time we call the function since we won't have the initial compilation overhead.
%timeit pairwise_numba2(X)
Little change to the code shows significant gain in speed!!! This insane functionality allows us to prototype algorithms with numpy without lossing the speed of C++ code. If you read it up to this part, consider giving the numba project a star.
For more information, this is a pretty good Pydata talk that illustrates the potential of numba. Youtube: Numba: Flexible analytics written in Python.