# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))
from formats import load_style
load_style(css_style = 'custom2.css', plot_style = False)
os.chdir(path)
Developer life hacks for Data Scientist.
# once it is installed, we'll just need this in future notebooks:
%load_ext watermark
%watermark -a "Ethen" -d -t -v -p numpy,pandas,seaborn,watermark,matplotlib
Here, we're only importing the watermark extension, but it's also a good idea to do all of our other imports at the first cell of the notebook.
Continuum's conda
tool provides a way to create isolated environments. The conda env
functionality let's you created an isolated environment on your machine, that way we can
To create an empty environment:
conda create -n <name> python=3
Note: python=2
will create a Python 2 environment; python=3
will create a Python 3 environment.
To work in a particular virtual environment:
source activate <name>
To leave a virtual environment:
source deactivate
Note: on Windows, the commands are just activate
and deactivate
, no need to type source
.
There are other Python tools for environment isolation, but none of them are perfect. If you're interested in the other options, virtualenv
and pyenv
both provide environment isolation. There are sometimes compatibility issues between the Anaconda Python distribution and these packages, so if you've got Anaconda on your machine you can use conda env
to create and manage environments.
It's a convention in the Python ecosystem to track a project's dependencies in a file called requirements.txt
. We recommend using this file to keep track of your MRE, "Minimum reproducible environment". An example of requirements.txt
might look something like the following:
pandas>=0.19.2
matplotlib>=2.0.0
The format for a line in the requirements file is:
Syntax | Result |
---|---|
package_name |
for whatever the latest version on PyPI is |
package_name==X.X.X |
for an exact match of version X.X.X |
package_name>=X.X.X |
for at least version X.X.X |
Now, contributors can create a new virtual environment (using conda or any other tool) and install your dependencies just by running:
pip install -r requirements.txt
Usually the package version will adhere to semantic versioning. Let’s take 0.19.2 as an example and break down what each number represents.
These versions are incremented when code changes are introduced. Depending on the nature of the change, a different number is incremented.
Both the requirements.txt
file and conda
virtual environment are ways to isoloate each project's environment and dependencies so we or other people that are trying to reproduce our work can save a lot of time recreating the environment.
There are some things you don't want to be openly reproducible: your private database url, your AWS credentials for downloading the data, your SSN, which you decided to use as a hash. These shouldn't live in source control, but may be essential for collaborators or others reproducing your work.
This is a situation where we can learn from some software engineering best practices. The 12-factor app principles give a set of best-practices for building web applications. Many of these principles are relevant for best practices in the data-science codebases as well.
Using a dependency manifest like requirements.txt
satisfies II. Explicitly declare and isolate dependencies. Another important principle is III. Store config in the environment:
An app’s config is everything that is likely to vary between deploys (staging, production, developer environments, etc). Apps sometimes store config as constants in the code. This is a violation of twelve-factor, which requires strict separation of config from code. Config varies substantially across deploys, code does not. A litmus test for whether an app has all config correctly factored out of the code is whether the codebase could be made open source at any moment, without compromising any credentials.
The dotenv
package allows you to easily store these variables in a file that is not in source control (as long as you keep the line .env
in your .gitignore
file!). You can then reference these variables as environment variables in your application with os.environ.get('VARIABLE_NAME')
.
import os
from dotenv import load_dotenv
# load the .env file
load_dotenv('.env')
# obtain the value of the variable
os.environ.get('FOO')
Note that there's also configparser.
If the code prints out some output and we want the reader to see it within some context (e.g. presenting a data story), then jupyter notebook it a ideal place for it to live. However, we wish to use the same piece of code in multiple notebooks then we should save it to a standalone .py
file to prevent copying and pasting the same piece of code every single time. Finally, if the code is going to used in multiple data analysis project then we should consider creating a package for it.
Don't edit-run-repeat to try to remember the name of a function or argument. Jupyter provides great docs integration and easy ways to remember the arguments to a function.
To check the doc, we can simply add a question mark
?
after the method, or pressShift Tab
(press both at the same time) inside the bracket of the method and it will print out argument to the method. Also theTab
key can be used for auto-completion of methods and arguments for a method.
Consider the following example. To follow along, please download the dataset pumps_train_values.csv
from the following link and move it to the ../data/raw
file path, or change the pump_data_path
below to where you like to store it.
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
pump_data_path = os.path.join('..', 'data', 'raw', 'pumps_train_values.csv')
df = pd.read_csv(pump_data_path)
df.head(1)
After reading in the data, we discovered that the data provides an id
column and we wish to change it to the index column. But we forgot the parameter to do so.
# we can do ?pd.read_csv or just check the
# documentation online since it usually looks nicer ...
df = pd.read_csv(pump_data_path, index_col = 0)
df.head(1)
# 1. magic for inline plot
# 2. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
plt.rcParams['figure.figsize'] = 8, 6
# create a chart, and we might be tempted to
# paste the code for 'construction_year'
# paste the code for 'gps_height'
plot_data = df['amount_tsh']
sns.kdeplot(plot_data, bw = 1000)
plt.show()
After making this plot, we might want to do the same for other numeric variables. To do this we can copy the entire cell and modify the parameters. This might be ok in a draft, but after a while the notebook can become quite unmanageable.
When we realize we're starting to step on our own toes, that we are no longer effective and the development become clumsy, it is time to organize the notebook. Start over, copy the good code, rewrite and generalize bad one.
Back to our original task of plotting the same graph for other numeric variables, instead of copying and pasting the cell multiple times, we should refactor this a little bit to not repeat ourselves, i.e., create a function to do it instead of copying and pasting. And for the function, write appropriate docstrings.
def kde_plot(dataframe, variable, upper = None, lower = None, bw = 0.1):
"""
Plots a density plot for a variable with optional upper and
lower bounds on the data (inclusive)
Parameters
----------
dataframe : DataFrame
variable : str
input column, must exist in the input dataframe.
upper : int
upper bound for the input column, i.e. data points
exceeding this threshold will be excluded.
lower : int
lower bound for the input column, i.e. data points
below this threshold will be excluded.
bw : float, default 0.1
bandwidth for density plot's line.
References
----------
Numpy style docstring
- http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html#example-numpy
"""
plot_data = dataframe[variable]
if upper is not None:
plot_data = plot_data[plot_data <= upper]
if lower is not None:
plot_data = plot_data[plot_data >= lower]
sns.kdeplot(plot_data, bw = bw)
plt.show()
kde_plot(df, variable = 'amount_tsh', bw = 1000, lower = 0)
kde_plot(df, variable = 'construction_year', bw = 1, lower = 1000, upper = 2016)
kde_plot(df, variable = 'gps_height', bw = 100)
Have a method that gets used in multiple notebooks? Refactor it into a separate .py
file so it can live a happy life! Note: In order to import your local modules, you must do three things:
__init__.py
file to the folder so the folder can be recognized as a package.sys.path.append
.# add local python functions
import sys
# add the 'src' directory as one where we can import modules
src_dir = os.path.join('..', 'src')
sys.path.append(src_dir)
# import my method from the source code,
# which drops rows with 0 in them
from features.build_features import remove_invalid_data
df = remove_invalid_data(pump_data_path)
df.shape
Jupyter notebook is smart about importing methods. Hence, after importing the method for the first time it will use that version, even if we were to change it afterwards. To overcome this "issue" we can use a jupyter notebook extension to tell it to reload the method every time it changes.
# Load the "autoreload" extension
# it comes with jupyter notebook
%load_ext autoreload
# always reload all modules
%autoreload 2
# or we can reload modules marked with "%aimport"
# import my method from the source code
# %autoreload 1
# %aimport features.build_features
Importing local code is great if you want to use it in multiple notebooks, but once you want to use the code in multiple projects or repositories, it gets complicated. This is when we get serious about isolation!
We can build a python package to solve that! In fact, there is a cookiecutter to create Python packages.
Once we create this package, we can install it in "editable" mode, which means that as we change the code the changes will get picked up if the package is used. The process looks like
# install cookiecutter first
pip install cookiecutter
cookiecutter https://github.com/wdm0006/cookiecutter-pipproject
cd package_name
pip install -e .
Now we can have a separate repository for this code and it can be used across projects without having to maintain code in multiple places.
Include tests.
Provides useful assertion methods for values that are numerically close and for numpy arrays.
# the randomly generated data from the normal distribution with a mean of 1
# should have a mean that's almost equal to 0, hence no error occurs
import numpy as np
data = np.random.normal(0.0, 1.0, 1000000)
np.testing.assert_almost_equal(np.mean(data), 0.0, decimal = 2)
Also check the docs for numpy.isclose and numpy.allclose. When making assertions about data, especially where small probabilistic changes or machine precision may result in numbers that aren't exactly equal. Consider using this instead of == for numbers involved in anything where randomness may influence the results
A library that lets you practice defensive program -- specifically with pandas DataFrame
objects. It provides a set of decorators that check the return value of any function that returns a DataFrame
and confirms that it conforms to the rules.
# pip install engarde
import engarde.decorators as ed
test_data = pd.DataFrame({'a': np.random.normal(0, 1, 100),
'b': np.random.normal(0, 1, 100)})
@ed.none_missing()
def process(dataframe):
dataframe.loc[10, 'a'] = 1 # change the 1 to np.nan and the code assertion will break
return dataframe
process(test_data).head()
engarde
has an awesome set of decorators:
none_missing
- no NaNs (great for machine learning--sklearn does not care for NaNs)has_dtypes
- make sure the dtypes are what you expectverify
- runs an arbitrary function on the dataframeverify_all
- makes sure every element returns true for a given functionMore can be found in the docs.
Creating a test suite with pytest to start checking the functions we've
written. To pytest test_
prefixed test functions or methods are test items. For more info, check the getting started guide.
The term "test fixtures" refers to known objects or mock data used to put other pieces of the system to the the test. We want these to have the same, known state every time.
For those familiar with unittest
, this might be data that you read in as part of the setUp
method. pytest
does things a bit differently; you define functions that return expected fixtures, and use a special decorator so that your tests automatically get passed the fixture data when you add the fixture function name as an argument.
We need to set up a way to get some data in here for testing. There are two basic choices — reading in the actual data or a known subset of it, or making up some smaller, fake data. You can choose whatever you think works best for your project.
Remove the failing test from above and copy the following into your testing file:
import pytest
import pandas as pd
@pytest.fixture()
def df():
"""read in the raw data file and return the dataframe"""
pump_data_path = os.path.join('..', 'data', 'raw', 'pumps_train_values.csv')
df = pd.read_csv(pump_data_path)
return df
def test_df_fixture(df):
assert df.shape == (59400, 40)
useful_columns = ['amount_tsh', 'gps_height', 'longitude', 'latitude', 'region',
'population', 'construction_year', 'extraction_type_class',
'management_group', 'quality_group', 'source_type',
'waterpoint_type', 'status_group']
for column in useful_columns:
assert column in df.columns
When can then run py.test
from the command line, where the testing code resides.
Version control: Use version control such as github! Except for big data file where you might turn to other cloud database, s3, etc. If you are in fact using github, you might also be interested in the nbdime (diffing and merging of Jupyter Notebooks) project. It makes checking jupyter notebook changes so much easier.
Logging: Use logging to record to process instead of printing.
Issue tracking: Keep track of the bugs. A minimal useful bug database should include:
Set up a clear workflow: Establish the workflow before diving into the project. This includes using a unified file structure for the project, e.g. cookiecutter-data-science
Use joblib for caching output:
You thought your neural network for three days and now you are ready to build on top of it. But you forgot to plug your laptop to a power source and it runs out of batteries. So you scream: Why didn’t I pickle!? The answer is: because it is pain in the back. Managing file names, checking if the file exists, saving, loading ... What to do instead? Use joblib
.
from sklearn.externals.joblib import Memory
memory = Memory(cachedir='/tmp', verbose=0)
@memory.cache
def computation(p1, p2):
...
With three lines of code, we get caching of the output of any function. Joblib traces parameters passed to a function, and if the function has been called with the same parameters it returns the return value cached on a disk.
Jupyter Notebooks Extension:
The Jupyter notebook extension project contains a collection of extensions that adds functionality to the default jupyter notebook. Some useful ones that I enjoy using includes:
Taking some time to configure them will most likely make working with notebooks even more pleasant.