# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))
from formats import load_style
load_style(css_style='custom2.css', plot_style=False)
os.chdir(path)
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,matplotlib
The gist behind time series analysis is that we are given some quantitative measures about the past and we wish to use these informations to predict the future to enable better planning, decision-making and so on. The main difference between time series problem and traditional prediction problems is that: in traditional prediction problems such as image classification, the data points there are assumed to be independent of one another. Whereas, time series analysis' data points have a temporal nature in them, i.e. The time dimension adds an explicit ordering to our data points that should be preserved because they can provide additional/important information to the learning algorithms.
This is not to say machine learning methods like supervised learning can't be used for time series forecasting, but before we apply these supervised learning methods on our time series data, we need to do some preprocessing step to make them applicable. There are 4 classes of time series based features that we can create out of our time series dataset.
We give some examples of how the lag features and window features are constructed in the following two sections.
Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem by using previous time steps as input variables and the next time step as the output variable. Let's make this concrete with an example. Imagine we have a time series as follows:
time, measure
1, 100
2, 110
3, 108
4, 115
5, 120
Given the original data above, we can re-frame into a format that's applicable for a supervised learning model:
X, y
?, 100
100, 110
110, 108
108, 115
115, 120
120, ?
The use of prior time steps to predict the next time step is called the sliding window method. Note that it might also be refer to as the lag method. And the number of previous time steps to look at is called the window width or size of the lag. In the example above, we are using a window size of 1.
This sliding window approach forms the basis for how we can turn any time series dataset into a supervised learning problem and it can also be used on a time series that has more than one value, or so-called multivariate time series. An example of this is shown below:
Let's assume we have the contrived multivariate time series dataset below with two observed features at each time step. And we are concerned with predicting measure2. Then we would go from:
time, measure1, measure2
1, 0.2, 88
2, 0.5, 89
3, 0.7, 87
4, 0.4, 88
5, 1.0, 90
to:
X1, X2, y
?, ?, 88
0.2, 88, 89
0.5, 89, 87
0.7, 87, 88
0.4, 88, 90
1.0, 90, ?
One step beyond adding raw lagged value is to add aggregated/summary statistics of values at previous time steps. The most common summary statistics is the mean. Using an univariate time series as an example.
window = pd.Series([100, 110, 108, 115, 120])
window
Pandas provides a few variants such as rolling, expanding and exponentially moving weights for calculating these type of window statistics. e.g. rolling()
function that creates a new data structure with the window of values at each time step. Here, we've creating a rolling window size of 3 and calculates the mean for each of the window. As we can see the first non NaN value is the third row, which is calculated by the mean of the previous 3 records (100 + 110 + 108) / 3.
window.rolling(window=3).mean()
If we are using pandas, one useful function that can help transform time series data into a format that's applicable for supervised learning problem is the shift()
function. Given a DataFrame, the shift()
(some other libraries call it lag
) function can be used to create copies of columns that are pushed forward or backward.
Let's first look at an example of the shift function in action. We start off by defining a toy time series dataset as a sequence of 10 numbers then use the shift function to create the "lagged" time series.
df = pd.DataFrame({'t': [x for x in range(10)]})
# shift all the observations up by one time step
df['t+1'] = df['t'].shift(-1)
df
Running the code chunk above gives us two columns in the dataset. The first contains the original observations and the second has the shifted observation. Note that the last row would have to be discarded because of the NaN value (there's no value to shift up).
From the output above, we can see that shifting the series forward one time step gives us a primitive supervised learning problem, where the first row shows the input value of 0.0 corresponding to the output of the second column 1.0.
Moving on to an actual dataset, we will use a real mobile game data that depicts ads watched per hour.
ads_col = 'Ads'
time_col = 'Time'
input_path = os.path.join('data', 'ads.csv')
ads = pd.read_csv(input_path, index_col=time_col, parse_dates=[time_col])
print('dimension: ', ads.shape)
ads.head()
from pandas.plotting import register_matplotlib_converters
# required datetime converter for matplotlib plotting
register_matplotlib_converters()
plt.figure(figsize=(15, 7))
plt.plot(ads[ads_col])
plt.title('Ads watched (hourly data)')
plt.grid(True)
plt.show()
Given this 1 dimensional time series data, we will need to do some feature-engineering to make it applicable for a downstream supervised learning method. Including:
label_col = 'y'
data = pd.DataFrame(ads[ads_col].copy())
data.columns = [label_col]
# add the lag of the target variable from 6 steps back up to 24
for i in range(6, 25):
data['lag_{}'.format(i)] = data[label_col].shift(i)
data['hour'] = data.index.hour
data['weekday'] = data.index.weekday
data['is_weekend'] = data['weekday'].isin([5, 6]).astype(np.int)
data = data.dropna()
data.head()
To generate the window statistics, we'll generate the mean of the lagged features we've created.
lag_cols = [col for col in data.columns if 'lag' in col]
data['rolling_mean'] = data[lag_cols].mean(axis=1)
# extract out the features and labels into separate variables
y = data[label_col].values
data = data.drop(label_col, axis=1)
X = data.values
feature_names = data.columns
print('dimension: ', X.shape)
data.head()
We didn't rely on the .rolling()
method to generate our rolling mean as we can compute it by taking the mean of the lagged features we've already generated, but the code chunk below also shows how we can generate the same statistics using the .rolling()
method.
ads[ads_col].shift(6).rolling(window=19).mean().dropna()
Now that we've prepared the data, the next couple of code chunks trains a RandomForest model on the training set (based on a time series train and test split), and evaluates the mean absolute percentage error on the test set and also looks at the feature importance that comes with the RandomForest model.
def timeseries_train_test_split(X, y, test_size):
"""Perform train-test split with respect to time series structure."""
test_index = int(len(X) * (1 - test_size))
X_train = X[:test_index]
X_test = X[test_index:]
y_train = y[:test_index]
y_test = y[test_index:]
return X_train, X_test, y_train, y_test
def mean_absolute_percentage_error(y_true, y_pred):
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
def plot_model_results(X, y, test_size=0.3, plot_intervals=False, plot_anomalies=False):
"""
- Plots modelled vs original values.
- Prediction intervals (95% confidence interval).
- Anomalies (points that resides outside the confidence interval).
"""
X_train, X_test, y_train, y_test = timeseries_train_test_split(X, y, test_size)
# we are using random forest here, feel free to swap this out
# with your favorite regression model
model = RandomForestRegressor(max_depth=6, n_estimators=50)
model.fit(X_train, y_train)
prediction = model.predict(X_test)
plt.figure(figsize=(15, 7))
x = range(prediction.size)
plt.plot(x, prediction, label='prediction', linewidth=2.0)
plt.plot(x, y_test, label='actual', linewidth=2.0)
if plot_intervals:
timeseries_cv = TimeSeriesSplit(n_splits=5)
cv = cross_val_score(model, X_train, y_train,
cv=timeseries_cv, scoring='neg_mean_absolute_error')
mae = -1 * cv.mean()
deviation = cv.std()
# hard-coded to be 95% confidence interval
scale = 1.96
margin_error = mae + scale * deviation
lower = prediction - margin_error
upper = prediction + margin_error
fill_alpha = 0.2
fill_color = '#66C2D7'
plt.fill_between(x, lower, upper, color=fill_color, alpha=fill_alpha, label='95% CI')
if plot_anomalies:
anomalies = np.array([np.nan] * len(y_test))
anomalies[y_test < lower] = y_test[y_test < lower]
anomalies[y_test > upper] = y_test[y_test > upper]
plt.plot(anomalies, 'o', markersize=10, label='Anomalies')
error = mean_absolute_percentage_error(prediction, y_test)
plt.title('Mean absolute percentage error {0:.2f}%'.format(error))
plt.legend(loc='best')
plt.tight_layout()
plt.grid(True)
return model
model = plot_model_results(X, y, plot_intervals=True)
def vis_importance(estimator, feature_names, threshold=0.05, filtered_names=None):
"""
Visualize the relative importance of predictors.
Parameters
----------
estimator : sklearn-like ensemble tree model
A tree estimator that contains the attribute
``feature_importances_``.
feature_names : str 1d array or list[str]
Feature names that corresponds to the
feature importance.
threshold : float, default 0.05
Features that have importance scores lower than this
threshold will not be presented in the plot, this assumes
the feature importance sum up to 1.
filtered_names : str 1d array or list[str], default None
Feature names that we wish to exclude from the visualization
regardless of whether they were in the top features or not.
"""
if not hasattr(estimator, 'feature_importances_'):
msg = '{} does not have the feature_importances_ attribute'
raise ValueError(msg.format(estimator.__class__.__name__))
imp = estimator.feature_importances_
feature_names = np.array(feature_names)
if filtered_names is not None:
keep = ~np.in1d(feature_names, filtered_names, assume_unique=True)
mask = np.logical_and(imp > threshold, keep)
else:
mask = imp > threshold
importances = imp[mask]
idx = np.argsort(importances)
scores = importances[idx]
names = feature_names[mask]
names = names[idx]
y_pos = np.arange(1, len(scores) + 1)
if hasattr(estimator, 'estimators_'):
# apart from the mean feature importance, for scikit-learn we can access
# each individual tree's feature importance and compute the standard deviation
tree_importances = np.array([tree.feature_importances_
for tree in estimator.estimators_])
importances_std = np.std(tree_importances[:, mask], axis=0)
scores_std = importances_std[idx]
plt.barh(y_pos, scores, align='center', xerr=scores_std)
else:
plt.barh(y_pos, scores, align='center')
plt.yticks(y_pos, names)
plt.xlabel('Importance')
plt.title('Feature Importance Plot')
# change default style figure and font size
plt.rcParams['figure.figsize'] = 8, 6
plt.rcParams['font.size'] = 12
vis_importance(model, feature_names, threshold=0.01)
Looking at the feature importance plot, the most important feature is lag_24
(each lag is an hour for this dataset), which is not a surprise as yesterday's value is usually a good indicator of what the value is going to be today ...
Hopefully, after reading through this documentation, you have an understanding of how we can re-frame our time series problem into a supervised machine learning problem and perform model training and forecasting using supervised machine learning models.
Side Note:
In communicating forecasting result, a useful summary can be something like:
"If nothing unexpected happens we expect to be within ±x %, but if assumptions a, b, or c perform differently than expected, we might be as much as ±y % off."