In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', 'notebook_format'))

from formats import load_style
load_style(plot_style = False)
Out[1]:
In [2]:
os.chdir(path)

import os
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from time import time
from collections import Counter
from keras.utils import to_categorical
from keras.utils.data_utils import get_file
from keras.models import Sequential, load_model
from keras.layers import Embedding, LSTM, Dense
from keras.callbacks import EarlyStopping, ModelCheckpoint

# 1. magic so that the notebook will reload external python modules
# 2. magic for inline plot
# 3. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
# 4. magic to print version
%load_ext autoreload
%autoreload 2
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
%load_ext watermark
%watermark -a 'Ethen' -d -t -v -p keras,numpy,matplotlib,tensorflow
Using TensorFlow backend.
Ethen 2018-05-07 20:18:55 

CPython 3.6.4
IPython 6.3.1

keras 2.1.5
numpy 1.14.3
matplotlib 2.2.2
tensorflow 1.7.0

Keras RNN (Recurrent Neural Network) - Language Model

Language Modeling (LM) is one of the foundational task in the realm of natural language processing (NLP). At a high level, the goal is to predict the n + 1 token in a sequence given the n tokens preceding it. A well trained language model are used in applications such as machine translation, speech recognition or to be more concrete business applications such as Swiftkey.

Language Model can operate either at the word level, sub-word level or character level, each having its own unique set of benefits and challenges. In practice word-level LMs tends to perform better than character-level LMs, but suffer from increased computational cost due to large vocabulary sizes. Apart from that it also requires more data preprocessing such as dealing with infrequent words and out of vocabulary words. On the other hand, character-level LMs do not face these issues as the vocabulary only consists of a limited set of characters. This, however, is not without drawbacks. Character-level LMs is more prone to vanishing gradient problems, as given a sentence "I am happy", a word-level LM would potentially treat this as 3 time steps (3 words/tokens), while a character-level LM would treat this as 8 time steps (8 characters), hence as the number of words/tokens in a sentence increase, the time step that the character-level LM needs to capture would be substantially higher than that of a word-level LM. To sum it up in one sentence. The distinction between word-level LMs and character-level LMs suggests that achieving state-of-art result for these two tasks often requires different network architectures and are usually not readily transferable.

Implementation

This documentation demonstrates the basic workflow of:

  • Preparing text for developing a word-level language model.
  • Train an neural network that contains an embedding and LSTM layer then used the learned model to generate new text with similar properties as the input text.
In [3]:
def elapsed(sec):
    """
    Converts elapsed time into a more human readable format.

    Examples
    --------
    from time import time

    start = time()
    # do something that's worth timing, like training a model
    elapse = time() - start
    elapsed(elapse)
    """
    if sec < 60:
        return str(sec) + ' seconds'
    elif sec < (60 * 60):
        return str(sec / 60) + ' minutes'
    else:
        return str(sec / (60 * 60)) + ' hours'
In [4]:
path = get_file('nietzsche.txt', origin = 'https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with open(path, encoding = 'utf-8') as f:
    raw_text = f.read()

print('corpus length:', len(raw_text))
print('example text:', raw_text[:150])
Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt
606208/600901 [==============================] - 1s 2us/step
corpus length: 600893
example text: PREFACE


SUPPOSING that Truth is a woman--what then? Is there not ground
for suspecting that all philosophers, in so far as they have been
dogmatists

As with all text analysis, there are many preprocessing steps that needs to be done to make the corpus more ready for downstream modeling, here we'll stick to some really basic ones as this is not the main focus here. Steps includes:

  • We will be splitting the text into words/tokens based on spaces, and from the first few words, we can see that some words are separated by "--", hence we'll replace that with a space.
  • Removing punctuation marks and retain only alphabetical words.
In [5]:
# ideally, we would save the cleaned text, to prevent
# doing this step every single time
tokens = raw_text.replace('--', ' ').split()
cleaned_tokens = []
table = str.maketrans('', '', string.punctuation)
for word in tokens:
    word = word.translate(table)
    if word.isalpha():
        cleaned_tokens.append(word.lower())

print('sampled original text: ', tokens[:10])
print('sampled cleaned text: ', cleaned_tokens[:10])
sampled original text:  ['PREFACE', 'SUPPOSING', 'that', 'Truth', 'is', 'a', 'woman', 'what', 'then?', 'Is']
sampled cleaned text:  ['preface', 'supposing', 'that', 'truth', 'is', 'a', 'woman', 'what', 'then', 'is']

The next step is to map each distinct word into integer so we can convert words into integers and feed them into our model later.

In [6]:
# build up vocabulary,
# rare words will also be considered out of vocabulary words,
# this will be represented by an unknown token
min_count = 2
unknown_token = '<unk>'
word2index = {unknown_token: 0}
index2word = [unknown_token]

filtered_words = 0
counter = Counter(cleaned_tokens)
for word, count in counter.items():
    if count >= min_count:
        index2word.append(word)
        word2index[word] = len(word2index)
    else:
        filtered_words += 1

num_classes = len(word2index)
print('vocabulary size: ', num_classes)
print('filtered words: ', filtered_words)
vocabulary size:  5090
filtered words:  5097

Recall that a language model's task is to take $n$ words and predict the $n + 1$ word, hence a key design decision is how long the input sequence should be. There is no one size fits all solution to this problem. Here, we will split them into sub-sequences with a fixed length of 40 and map the original word to indices.

In [7]:
# create semi-overlapping sequences of words with
# a fixed length specified by the maxlen parameter
step = 3
maxlen = 40
X = []
y = []
for i in range(0, len(cleaned_tokens) - maxlen, step):
    sentence = cleaned_tokens[i:i + maxlen]
    next_word = cleaned_tokens[i + maxlen]
    X.append([word2index.get(word, 0) for word in sentence])
    y.append(word2index.get(next_word, 0))

# keras expects the target to be in one-hot encoded format,
# ideally we would use a generator that performs this conversion
# only on the batch of data that is currently required by the model
# to be more memory-efficient
X = np.array(X)
Y = to_categorical(y, num_classes)
print('sequence dimension: ', X.shape)
print('target dimension: ', Y.shape)
print('example sequence:\n', X[0])
sequence dimension:  (33342, 40)
target dimension:  (33342, 5090)
example sequence:
 [ 1  2  3  4  5  6  7  8  9  5 10 11 12 13  0  3 14 15 16 17 18 19 20 21
 22 23 21 24 25 26 27  3 28 29 30 31 32  0 33 34]
In [8]:
# define the network architecture: a embedding followed by LSTM
embedding_size = 50
lstm_size = 256
model1 = Sequential()
model1.add(Embedding(num_classes, embedding_size, input_length = maxlen))
model1.add(LSTM(lstm_size))
model1.add(Dense(num_classes, activation = 'softmax'))
model1.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
print(model1.summary())
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 40, 50)            254500    
_________________________________________________________________
lstm_1 (LSTM)                (None, 256)               314368    
_________________________________________________________________
dense_1 (Dense)              (None, 5090)              1308130   
=================================================================
Total params: 1,876,998
Trainable params: 1,876,998
Non-trainable params: 0
_________________________________________________________________
None
In [9]:
def build_model(model, address = None):
    """
    Fit the model if the model checkpoint does not exist or else
    load it from that address.
    """
    if address is not None or not os.path.isfile(address):
        stop = EarlyStopping(monitor = 'val_loss', min_delta = 0, 
                             patience = 5, verbose = 1, mode = 'auto')
        save = ModelCheckpoint(address, monitor = 'val_loss', 
                               verbose = 0, save_best_only = True)
        callbacks = [stop, save]

        start = time()
        history = model.fit(X, Y, batch_size = batch_size, 
                            epochs = epochs, verbose = 1,
                            validation_split = validation_split,
                            callbacks = callbacks)
        elapse = time() - start
        print('elapsed time: ', elapsed(elapse))
        model_info = {'history': history, 'elapse': elapse, 'model': model}
    else:
        model = load_model(address)
        model_info = {'model': model}

    return model_info
  

epochs = 40
batch_size = 32
validation_split = 0.2
address1 = 'lstm_weights1.hdf5'
print('model checkpoint address: ', address1)
model_info1 = build_model(model1, address1)
model checkpoint address:  lstm_weights1.hdf5
Train on 26673 samples, validate on 6669 samples
Epoch 1/40
18080/26673 [===================>..........] - ETA: 31s - loss: 6.428226673/26673 [==============================] - 102s 4ms/step - loss: 6.3431 - val_loss: 6.2005
Epoch 2/40
14656/26673 [===============>..............] - ETA: 42s - loss: 5.826126673/26673 [==============================] - 101s 4ms/step - loss: 5.8272 - val_loss: 6.1464
Epoch 3/40
13312/26673 [=============>................] - ETA: 48s - loss: 5.595326673/26673 [==============================] - 102s 4ms/step - loss: 5.5813 - val_loss: 6.2016
Epoch 4/40
12800/26673 [=============>................] - ETA: 50s - loss: 5.303026673/26673 [==============================] - 103s 4ms/step - loss: 5.3414 - val_loss: 6.2657
Epoch 5/40
12576/26673 [=============>................] - ETA: 51s - loss: 5.020926673/26673 [==============================] - 103s 4ms/step - loss: 5.0778 - val_loss: 6.3739
Epoch 6/40
12512/26673 [=============>................] - ETA: 49s - loss: 4.694426673/26673 [==============================] - 100s 4ms/step - loss: 4.7756 - val_loss: 6.5210
Epoch 7/40
12480/26673 [=============>................] - ETA: 51s - loss: 4.377426673/26673 [==============================] - 102s 4ms/step - loss: 4.4323 - val_loss: 6.6638
Epoch 00007: early stopping
elapsed time:  11.904992830753326 minutes

In order to test the trained model, one can compare the model's predicted word against what the actual word sequence are in the dataset.

In [10]:
def check_prediction(model, num_predict):
    true_print_out = 'Actual words: '
    pred_print_out = 'Predicted words: '
    for i in range(num_predict):
        x = X[i]
        prediction = model.predict(x[np.newaxis, :], verbose = 0)
        index = np.argmax(prediction)
        true_print_out += index2word[y[i]] + ' '
        pred_print_out += index2word[index] + ' '

    print(true_print_out)
    print(pred_print_out)


num_predict = 10
model = model_info1['model']
check_prediction(model, num_predict)
Actual words: they paid to been unseemly <unk> certainly never to and 
Predicted words: the paid to to <unk> the are been to and 

Despite not being a perfect match, we can see that there is still a rough correspondence between the predicted token versus the actual one. To train the network which can perform better at language modeling requires a much larger corpus and more training and optimization. But, hopefully, this post has given us a basic understanding on the general process of building a language model.

The following section lists out some ideas worth trying:

  • Sentence-wise model. When generating the sub-sequences for the language model, we could perform a sentence detection first by splitting the documents into sentences then pad each sentence to a fixed length (length can be determined by the longest sentence length).
  • Simplify vocabulary. Perform further text preprocessing such as removing stop words or stemming.
  • Hyperparameter tuning. e.g. size of embedding layer, LSTM layer, include dropout, etc. See if a different hyperparameter setting leads to a better model. Although, if we wish to build a stacked LSTM layer using keras then some changes to the code above is required, elaborated below:

When stacking LSTM layers, rather than using the last hidden state as the output to the next layer (e.g. the Dense layer) all the hidden states will be used as an input to the subsequent LSTM layer. In other words, a stacked LSTM will have an output for every time step as oppose to 1 output across all time steps. The diagram depicts the pattern for what 2 layers would look like:

The next couple of code chunks illustrates the difference. So suppose we have two input example (batch size of 2) both having a fixed time step of 3.

In [11]:
from keras.models import Model
from keras.layers import Input

# using keras' functional API
seq_len = 3
n_features = 1
hidden_size = 4
data = np.array([[0.1, 0.2, 0.3], [0.15, 0.45, 0.25]]).reshape(-1, seq_len, n_features)

inputs = Input(shape = (seq_len, n_features))
lstm = LSTM(hidden_size)(inputs)
model = Model(inputs = inputs, outputs = lstm)
prediction = model.predict(data)
print('dimension: ', prediction.shape)
prediction
dimension:  (2, 4)
Out[11]:
array([[ 0.06472953,  0.07302882, -0.0113507 , -0.05956742],
       [ 0.08739619,  0.09793343, -0.01274136, -0.07241688]],
      dtype=float32)

Looking at the output by the LSTM layer, we can see that it outputs a single (the last) hidden state for the input sequence. If we're to build a stacked LSTM layer, then we would need to access the hidden state output for each time step. This can be done by setting return_sequences argument to True when defining our LSTM layer, as shown below:

In [12]:
inputs = Input(shape = (seq_len, n_features))
lstm = LSTM(hidden_size, return_sequences = True)(inputs)
model = Model(inputs = inputs, outputs = lstm)

# three-dimensional output, so apart from the batch size and
# lstm hidden layer's size there's also an additional dimension
# for the number of time steps
prediction = model.predict(data)
print('dimension: ', prediction.shape)
prediction
dimension:  (2, 3, 4)
Out[12]:
array([[[-0.01126682, -0.0064276 ,  0.00249478, -0.00055401],
        [-0.03018942, -0.01813748,  0.00614582, -0.00209547],
        [-0.05421135, -0.03431558,  0.01020968, -0.0047514 ]],

       [[-0.01700315, -0.00955516,  0.00374703, -0.00082847],
        [-0.06392946, -0.03527492,  0.01297658, -0.00386212],
        [-0.06742323, -0.04802553,  0.01229313, -0.00726138]]],
      dtype=float32)

When stacking LSTM layers, we should specify return_sequences = True so that the next LSTM layer has access to all the previous layer's hidden states.

In [13]:
# two-layer LSTM example, this is not trained
model2 = Sequential()
model2.add(Embedding(num_classes, embedding_size, input_length = maxlen))
model2.add(LSTM(256, return_sequences = True))
model2.add(LSTM(256))
model2.add(Dense(num_classes, activation = 'softmax'))
model2.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
print(model2.summary())
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 40, 50)            254500    
_________________________________________________________________
lstm_4 (LSTM)                (None, 40, 256)           314368    
_________________________________________________________________
lstm_5 (LSTM)                (None, 256)               525312    
_________________________________________________________________
dense_2 (Dense)              (None, 5090)              1308130   
=================================================================
Total params: 2,402,310
Trainable params: 2,402,310
Non-trainable params: 0
_________________________________________________________________
None

Reference