Data Science is Software¶

Developer life hacks for Data Scientist.

Section 1: Environment Reproducibility¶

watermark extension¶

Tell everyone when you ran the notebook, and packages' version that you were using. Listing these dependency information at the top of a notebook is especially useful for nbviewer, blog posts and other media where you are not sharing the notebook as executable code.

# once it is installed, we'll just need this in future notebooks:
%load_ext watermark
%watermark -a "Ethen" -d -t -v -p numpy,pandas,seaborn,watermark,matplotlib

Ethen 2018-09-17 21:32:12 

CPython 3.6.4
IPython 6.4.0

numpy 1.14.1
pandas 0.23.0
seaborn 0.8.1
watermark 1.6.0
matplotlib 2.2.2

Here, we're only importing the watermark extension, but it's also a good idea to do all of our other imports at the first cell of the notebook.

Create A Separate Environment¶

Continuum's conda tool provides a way to create isolated environments. The conda env functionality let's you created an isolated environment on your machine, that way we can

Start from "scratch" on each project
Choose Python 2 or 3 as appropriate

To create an empty environment:

conda create -n <name> python=3

Note: python=2 will create a Python 2 environment; python=3 will create a Python 3 environment.

To work in a particular virtual environment:

source activate <name>

To leave a virtual environment:

source deactivate

Note: on Windows, the commands are just activate and deactivate, no need to type source.

There are other Python tools for environment isolation, but none of them are perfect. If you're interested in the other options, virtualenv and pyenv both provide environment isolation. There are sometimes compatibility issues between the Anaconda Python distribution and these packages, so if you've got Anaconda on your machine you can use conda env to create and manage environments.

Create a new environment for every project you work on

The pip requirements.txt file¶

It's a convention in the Python ecosystem to track a project's dependencies in a file called requirements.txt. We recommend using this file to keep track of your MRE, "Minimum reproducible environment". An example of requirements.txt might look something like the following:

pandas>=0.19.2
matplotlib>=2.0.0

The format for a line in the requirements file is:

Syntax	Result
`package_name`	for whatever the latest version on PyPI is
`package_name==X.X.X`	for an exact match of version X.X.X
`package_name>=X.X.X`	for at least version X.X.X

Now, contributors can create a new virtual environment (using conda or any other tool) and install your dependencies just by running:

pip install -r requirements.txt

Never again run `pip install [package]`. Instead, update `requirements.txt` and run `pip install -r requirements.txt`. And for data science projects, favor `package>=0.0.0` rather than `package==0.0.0`, this prevents you from having many versions of large packages (e.g. numpy, scipy, pandas) with complex dependencies sitting around

Usually the package version will adhere to semantic versioning. Let’s take 0.19.2 as an example and break down what each number represents.

(0.19.2) The first number in this chain is called the major version.
(0.19.2) The second number is called the minor version.
(0.19.2) The third number is called the patch version.

These versions are incremented when code changes are introduced. Depending on the nature of the change, a different number is incremented.

The major version (first number) is incremented when there's backwards incompatible changes, i.e. changes that break the old API are released. Usually, when major versions are released there’s a guide released with how to update from the old version to the new one
The minor version (second number) is incremented when backwards compatible changes. Functionality is added (or speed improvements) that does not break any existing functionality, at least the public API that end-users would use
The patch version (third number) is for backwards compatible bug fixes. Bug fixes are in contrast here with features (adding functionality). These patches go out when something is wrong with existing functionality or when improvements to existing functionality are implemented

Both the requirements.txt file and conda virtual environment are ways to isoloate each project's environment and dependencies so we or other people that are trying to reproduce our work can save a lot of time recreating the environment.

Separation of configuration from codebase¶

There are some things you don't want to be openly reproducible: your private database url, your AWS credentials for downloading the data, your SSN, which you decided to use as a hash. These shouldn't live in source control, but may be essential for collaborators or others reproducing your work.

This is a situation where we can learn from some software engineering best practices. The 12-factor app principles give a set of best-practices for building web applications. Many of these principles are relevant for best practices in the data-science codebases as well.

Using a dependency manifest like requirements.txt satisfies II. Explicitly declare and isolate dependencies. Another important principle is III. Store config in the environment:

An app’s config is everything that is likely to vary between deploys (staging, production, developer environments, etc). Apps sometimes store config as constants in the code. This is a violation of twelve-factor, which requires strict separation of config from code. Config varies substantially across deploys, code does not. A litmus test for whether an app has all config correctly factored out of the code is whether the codebase could be made open source at any moment, without compromising any credentials.

The dotenv package allows you to easily store these variables in a file that is not in source control (as long as you keep the line .env in your .gitignore file!). You can then reference these variables as environment variables in your application with os.environ.get('VARIABLE_NAME').

import os
from dotenv import load_dotenv

# load the .env file
load_dotenv('.env')

True

# obtain the value of the variable
os.environ.get('FOO')

Note that there's also configparser.

Section 2: Writing code for reusability¶

If the code prints out some output and we want the reader to see it within some context (e.g. presenting a data story), then jupyter notebook it a ideal place for it to live. However, we wish to use the same piece of code in multiple notebooks then we should save it to a standalone .py file to prevent copying and pasting the same piece of code every single time. Finally, if the code is going to used in multiple data analysis project then we should consider creating a package for it.

No more docs-guessing¶

Don't edit-run-repeat to try to remember the name of a function or argument. Jupyter provides great docs integration and easy ways to remember the arguments to a function.

To check the doc, we can simply add a question mark ? after the method, or press Shift Tab (press both at the same time) inside the bracket of the method and it will print out argument to the method. Also the Tab key can be used for auto-completion of methods and arguments for a method.

Consider the following example. To follow along, please download the dataset pumps_train_values.csv from the following link and move it to the ../data/raw file path, or change the pump_data_path below to where you like to store it.

import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

pump_data_path = os.path.join('..', 'data', 'raw', 'pumps_train_values.csv')
df = pd.read_csv(pump_data_path)
df.head(1)

After reading in the data, we discovered that the data provides an id column and we wish to change it to the index column. But we forgot the parameter to do so.

# we can do ?pd.read_csv or just check the 
# documentation online since it usually looks nicer ...
df = pd.read_csv(pump_data_path, index_col = 0)
df.head(1)

No more copying-pasting¶

# 1. magic for inline plot
# 2. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

plt.rcParams['figure.figsize'] = 8, 6

# create a chart, and we might be tempted to
# paste the code for 'construction_year'
# paste the code for 'gps_height'
plot_data = df['amount_tsh']
sns.kdeplot(plot_data, bw = 1000)
plt.show()

After making this plot, we might want to do the same for other numeric variables. To do this we can copy the entire cell and modify the parameters. This might be ok in a draft, but after a while the notebook can become quite unmanageable.

When we realize we're starting to step on our own toes, that we are no longer effective and the development become clumsy, it is time to organize the notebook. Start over, copy the good code, rewrite and generalize bad one.

Back to our original task of plotting the same graph for other numeric variables, instead of copying and pasting the cell multiple times, we should refactor this a little bit to not repeat ourselves, i.e., create a function to do it instead of copying and pasting. And for the function, write appropriate docstrings.

def kde_plot(dataframe, variable, upper = None, lower = None, bw = 0.1):
    """ 
    Plots a density plot for a variable with optional upper and
    lower bounds on the data (inclusive)
    
    Parameters
    ----------
    dataframe : DataFrame
    
    variable : str
        input column, must exist in the input dataframe.
        
    upper : int
        upper bound for the input column, i.e. data points
        exceeding this threshold will be excluded.
    
    lower : int
        lower bound for the input column, i.e. data points
        below this threshold will be excluded.
    
    bw : float, default 0.1
        bandwidth for density plot's line.
    
    References
    ----------
    Numpy style docstring
    - http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html#example-numpy
    """
    plot_data = dataframe[variable]
    
    if upper is not None:
        plot_data = plot_data[plot_data <= upper]
    
    if lower is not None:
        plot_data = plot_data[plot_data >= lower]

    sns.kdeplot(plot_data, bw = bw)
    plt.show()

kde_plot(df, variable = 'amount_tsh', bw = 1000, lower = 0)
kde_plot(df, variable = 'construction_year', bw = 1, lower = 1000, upper = 2016)
kde_plot(df, variable = 'gps_height', bw = 100)

No more copy-pasting between notebooks¶

Have a method that gets used in multiple notebooks? Refactor it into a separate .py file so it can live a happy life! Note: In order to import your local modules, you must do three things:

put the .py file in a separate folder.
add an empty __init__.py file to the folder so the folder can be recognized as a package.
add that folder to the Python path with sys.path.append.

# add local python functions
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join('..', 'src')
sys.path.append(src_dir)

# import my method from the source code,
# which drops rows with 0 in them
from features.build_features import remove_invalid_data

df = remove_invalid_data(pump_data_path)
df.shape

(821, 39)

Jupyter notebook is smart about importing methods. Hence, after importing the method for the first time it will use that version, even if we were to change it afterwards. To overcome this "issue" we can use a jupyter notebook extension to tell it to reload the method every time it changes.

# Load the "autoreload" extension
# it comes with jupyter notebook
%load_ext autoreload

# always reload all modules
%autoreload 2

# or we can reload modules marked with "%aimport"
# import my method from the source code

# %autoreload 1
# %aimport features.build_features

I'm too good! Now this code is useful to other projects!¶

Importing local code is great if you want to use it in multiple notebooks, but once you want to use the code in multiple projects or repositories, it gets complicated. This is when we get serious about isolation!

We can build a python package to solve that! In fact, there is a cookiecutter to create Python packages.

Once we create this package, we can install it in "editable" mode, which means that as we change the code the changes will get picked up if the package is used. The process looks like

# install cookiecutter first
pip install cookiecutter

cookiecutter https://github.com/wdm0006/cookiecutter-pipproject
cd package_name
pip install -e .

Now we can have a separate repository for this code and it can be used across projects without having to maintain code in multiple places.

Section 3 Don't let others break your toys¶

Include tests.

numpy.testing¶

Provides useful assertion methods for values that are numerically close and for numpy arrays.

# the randomly generated data from the normal distribution with a mean of 1
# should have a mean that's almost equal to 0, hence no error occurs
import numpy as np
data = np.random.normal(0.0, 1.0, 1000000)
np.testing.assert_almost_equal(np.mean(data), 0.0, decimal = 2)

Also check the docs for numpy.isclose and numpy.allclose. When making assertions about data, especially where small probabilistic changes or machine precision may result in numbers that aren't exactly equal. Consider using this instead of == for numbers involved in anything where randomness may influence the results

engarde decorators¶

A library that lets you practice defensive program -- specifically with pandas DataFrame objects. It provides a set of decorators that check the return value of any function that returns a DataFrame and confirms that it conforms to the rules.

# pip install engarde
import engarde.decorators as ed


test_data = pd.DataFrame({'a': np.random.normal(0, 1, 100),
                          'b': np.random.normal(0, 1, 100)})
@ed.none_missing()
def process(dataframe):
    dataframe.loc[10, 'a'] = 1 # change the 1 to np.nan and the code assertion will break
    return dataframe

process(test_data).head()

engarde has an awesome set of decorators:

none_missing - no NaNs (great for machine learning--sklearn does not care for NaNs)
has_dtypes - make sure the dtypes are what you expect
verify - runs an arbitrary function on the dataframe
verify_all - makes sure every element returns true for a given function

More can be found in the docs.

Creating a test suite with pytest¶

Creating a test suite with pytest to start checking the functions we've written. To pytest test_ prefixed test functions or methods are test items. For more info, check the getting started guide.

The term "test fixtures" refers to known objects or mock data used to put other pieces of the system to the the test. We want these to have the same, known state every time.

For those familiar with unittest, this might be data that you read in as part of the setUp method. pytest does things a bit differently; you define functions that return expected fixtures, and use a special decorator so that your tests automatically get passed the fixture data when you add the fixture function name as an argument.

We need to set up a way to get some data in here for testing. There are two basic choices — reading in the actual data or a known subset of it, or making up some smaller, fake data. You can choose whatever you think works best for your project.

Remove the failing test from above and copy the following into your testing file:

import pytest
import pandas as pd

@pytest.fixture()
def df():
    """read in the raw data file and return the dataframe"""
    pump_data_path = os.path.join('..', 'data', 'raw', 'pumps_train_values.csv')
    df = pd.read_csv(pump_data_path)
    return df


def test_df_fixture(df):
    assert df.shape == (59400, 40)

    useful_columns = ['amount_tsh', 'gps_height', 'longitude', 'latitude', 'region',
                      'population', 'construction_year', 'extraction_type_class',
                      'management_group', 'quality_group', 'source_type',
                      'waterpoint_type', 'status_group']
    
    for column in useful_columns:
        assert column in df.columns

When can then run py.test from the command line, where the testing code resides.

Other Tips and Tricks¶

Version control: Use version control such as github! Except for big data file where you might turn to other cloud database, s3, etc. If you are in fact using github, you might also be interested in the nbdime (diffing and merging of Jupyter Notebooks) project. It makes checking jupyter notebook changes so much easier.

Logging: Use logging to record to process instead of printing.

Issue tracking: Keep track of the bugs. A minimal useful bug database should include:

The observed behavior. Complete steps to reproduce the bug
The expected behavior
Who is it assigned to
Whether it has been fixed or not

Set up a clear workflow: Establish the workflow before diving into the project. This includes using a unified file structure for the project, e.g. cookiecutter-data-science

Use joblib for caching output:

You thought your neural network for three days and now you are ready to build on top of it. But you forgot to plug your laptop to a power source and it runs out of batteries. So you scream: Why didn’t I pickle!? The answer is: because it is pain in the back. Managing file names, checking if the file exists, saving, loading ... What to do instead? Use joblib.

from sklearn.externals.joblib import Memory

memory = Memory(cachedir='/tmp', verbose=0)

@memory.cache
def computation(p1, p2):
    ...

With three lines of code, we get caching of the output of any function. Joblib traces parameters passed to a function, and if the function has been called with the same parameters it returns the return value cached on a disk.

Jupyter Notebooks Extension:

The Jupyter notebook extension project contains a collection of extensions that adds functionality to the default jupyter notebook. Some useful ones that I enjoy using includes:

Table of content
Code snippet
Code folding
Table beautifier
Auto equation numbering
etc.

Taking some time to configure them will most likely make working with notebooks even more pleasant.

	amount_tsh	date_recorded	funder	gps_height	installer	longitude	latitude	wpt_name	num_private	basin	...	payment_type	water_quality	quality_group	quantity	quantity_group	source	source_type	source_class	waterpoint_type	waterpoint_type_group
id
69572	6000.0	2011-03-14	Roman	1390	Roman	34.938093	-9.856322	none	0	Lake Nyasa	...	annually	soft	good	enough	enough	spring	spring	groundwater	communal standpipe	communal standpipe

	a	b
0	1.238563	0.215904
1	0.498143	-1.178031
2	-0.146521	0.072514
3	-0.138808	1.070784
4	-0.218587	0.654166

Table of Contents