# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))

from formats import load_style
load_style(css_style='custom2.css', plot_style=False)

os.chdir(path)

# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2

import os
import torch
import evaluate
import datasets
import collections
import transformers
import numpy as np
import pandas as pd
import torch.nn as nn
import torch.nn.functional as F
from tqdm.auto import tqdm
from time import perf_counter
from torch.utils.data import DataLoader
from datasets import (
    load_dataset,
    disable_progress_bar
)
from transformers import (
    pipeline,
    Trainer,
    TrainingArguments,
    AutoTokenizer,
    AutoModelForQuestionAnswering,
    DataCollatorWithPadding,
    EarlyStoppingCallback,
    IntervalStrategy
)

device = "cuda" if torch.cuda.is_available() else "cpu"
cache_dir = None

%watermark -a 'Ethen' -d -u -iv

/usr/local/lib/python3.8/dist-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Author: Ethen

Last updated: 2022-12-10

torch       : 1.10.0a0+git36449ea
datasets    : 2.7.1
transformers: 4.21.3
numpy       : 1.23.2
evaluate    : 0.3.0
pandas      : 1.4.3

Fine Tuning Pre-trained Encoder on Question Answer Task¶

In this document, we'll be going over how to train an extractive question and answer model using a pre-trained language encoder model via huggingface's transformers library. Throughout this process, we'll also:

Introduce some advanced tokenizer functionalities that are needed for token level based tasks such as question answering.
Explain some of the pre-processing and post-processing that are needed for question and answering task.

There are many different forms of question answering, but the one we will be discussing today is termed open book extractive question answering. Open book allows our model to retrieve relevant information from some context, similar to open book exams where students can refer to their books for relevant information during an exam, in this setup, our model can look up information from external sources. Extractive means our model will extract the most relevant span of texts or snippets from these contexts to answer incoming question. Although span based answers are more constrained compared to free form answers, they come with the benefit of being easier to evaluate.

Similar to a lot modern recommendation systems out there, there are three main components to these type of systems: a vector database for storing our data encoded in vector representation, a retrieval model for efficiently retrieving top-N context, lastly a reader model that identifies the span of text from a range of context. In this document, we'll be focusing on the reader model part.

To piggyback on modern today's pre-trained language model for reader model fine-tuning. We need two inputs: question and context, as well as two labels identifying answer's start and end positions within that context. The following diagram depicts this notion very nicely [4].

Slightly more formally, after feeding our input sentence through an encoder layer and obtaining the embedding vector $\mathbf{h}^{(i)}$ for every $i_{th}$ token, we learn two additional weights, one for start position, $\mathbf{W}_s$ and the other for end position, $\mathbf{W}_e$. These two weights will be used to define: for each token, the probability distribution of belonging to start and end position: $\text{softmax}(\mathbf{h}^{(i)}\mathbf{W}_s)$, $\text{softmax}(\mathbf{h}^{(i)}\mathbf{W}_e)$

The dataset we'll be using is SQuAD (Standford Question Answering Dataset). This data contains a question, a context, and potentially answer. Where the answer to every question is a segment of text, a.k.a span, from a corresponding context. We can decide whether to experiment with SQuAD or SQuAD 2.0. SQuAD version 2.0 is a superset of the existing dataset containing unanswerable questions. This nature makes it more challenging to do well on version 2.0, as not only does the model need to identify correct answers, but also need to determine when no answer is supported by a given context and abstain from spitting out unreliable guesses.

# experiment with different public model checkpoints
model_checkpoint = "distilbert-base-uncased"
task_name = "squad" # "squad_v2"

datasets = load_dataset(task_name, cache_dir=cache_dir)
datasets

Found cached dataset squad (/root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 377.80it/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

Printing out a sample format, hopefully field names are all quite self explanatory. The one thing that's worth clarifying is answer_start field contains starting character index of each answer inside the corresponding context.

datasets["train"][0]

{'id': '5733be284776f41900661182',
 'title': 'University_of_Notre_Dame',
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'answers': {'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}}

Tokenizer¶

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, cache_dir=cache_dir)

After passing raw text through a tokenizer, a single word can be split into multiple tokens. e.g. in the example below, @huggingface is split into multiple tokens, @ hugging, and ##face. This can cause some issues for our token level labels, as our original label was mapped to a single word @huggingface. To resolve this, we'll need to use offsets mapping returned by the tokenizer, which gives us a tuple indicating each sub token's start and end position relative to the original token it was split from. For special tokens, offset mapping's start and end position will both be set to 0.

word = "@huggingface"
tokenized = tokenizer(word, return_offsets_mapping=True)
tokenized

{'input_ids': [101, 1030, 17662, 12172, 102], 'attention_mask': [1, 1, 1, 1, 1], 'offset_mapping': [(0, 0), (0, 1), (1, 8), (8, 12), (0, 0)]}

def convert_id_to_string(tokenizer, input_ids):
    strings = []
    for input_id in input_ids:
        string = tokenizer.convert_ids_to_tokens(input_id)
        strings.append(string)

    return strings


def convert_offset_mapping_to_string(tokenized, offset_mapping, word):
    strings = []
    for offset in offset_mapping:
        start = offset[0]
        end = offset[1]
        if end != 0:
            strings.append(word[start:end])
            
    return strings

# excluding for special tokens, the two should be identical
strings = convert_id_to_string(tokenizer, tokenized["input_ids"])
print("input ids' string: ", strings)

strings = convert_offset_mapping_to_string(tokenizer, tokenized["offset_mapping"], word)
print("offset mapping string: ", strings)

input ids' string:  ['[CLS]', '@', 'hugging', '##face', '[SEP]']
offset mapping string:  ['@', 'hugging', 'face']

Another specific preprocessing detail for question answering task is appropriate ways to deal with long documents. In many other tasks, we typically truncate documents that are longer than our model's maximum sequence/sentence length, but here, removing some parts of the context might result in losing a section of the document that contains our answer. To deal with this, we will allow one (long) example in our dataset to give several input features by turning on return_overflowing_tokens. Commonly referred to as chunks, each chunk's length will be shorter than the model's maximum length (configurable hyper-parameter). Also, just in case a particular answer lies at the point where we splitted a long context, we will allow some overlap between chunks/features controlled by a hyper-parameter doc_stride, sometimes commonly known as sliding window.

examples = [
    "We are going to split this sentence",
    "This sentence is longer, we are also going to split it"
]
tokenized = tokenizer(
    examples,
    truncation=True,
    return_overflowing_tokens=True,
    max_length=6,
    stride=2
)
print("number of examples: ", len(examples))
print("number of tokenized features: ", len(tokenized["input_ids"]))
tokenized

number of examples:  2
number of tokenized features:  8

{'input_ids': [[101, 2057, 2024, 2183, 2000, 102], [101, 2183, 2000, 3975, 2023, 102], [101, 3975, 2023, 6251, 102], [101, 2023, 6251, 2003, 2936, 102], [101, 2003, 2936, 1010, 2057, 102], [101, 1010, 2057, 2024, 2036, 102], [101, 2024, 2036, 2183, 2000, 102], [101, 2183, 2000, 3975, 2009, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1]], 'overflow_to_sample_mapping': [0, 0, 0, 1, 1, 1, 1, 1]}

Our two input sentences/examples has been split into 8 tokenized features. From the overflow_to_sample_mapping field, we can see which original example these 8 features map to.

# if we print out the batched input ids, we'll see each one
# of our sentences has been split to multiple chunks/features
for input_id, sample_mapping in zip(tokenized["input_ids"], tokenized["overflow_to_sample_mapping"]):
    chunk = tokenizer.decode(input_id)
    print("Chunk: ", chunk)
    print("Orignal input: ", examples[sample_mapping])

Chunk:  [CLS] we are going to [SEP]
Orignal input:  We are going to split this sentence
Chunk:  [CLS] going to split this [SEP]
Orignal input:  We are going to split this sentence
Chunk:  [CLS] split this sentence [SEP]
Orignal input:  We are going to split this sentence
Chunk:  [CLS] this sentence is longer [SEP]
Orignal input:  This sentence is longer, we are also going to split it
Chunk:  [CLS] is longer, we [SEP]
Orignal input:  This sentence is longer, we are also going to split it
Chunk:  [CLS], we are also [SEP]
Orignal input:  This sentence is longer, we are also going to split it
Chunk:  [CLS] are also going to [SEP]
Orignal input:  This sentence is longer, we are also going to split it
Chunk:  [CLS] going to split it [SEP]
Orignal input:  This sentence is longer, we are also going to split it

Last thing we'll mention is the sequence_ids attribute. When feeding pairs of input to a tokenizer, we can use it to distinguish first and second portion of a given sentence. In question and answering this will be helpful for identifying whether the predicted answer's start and end position falls inside context portion of a given document, instead of question portion. If we look at a sample output, we'll notice that special tokens will be mapped to None, whereas our context, which is passed as the second part of our paired input will receive a value of 1.

tokenized = tokenizer(
    ["question section"],
    ["context section"]
)
tokenized.sequence_ids(0)

[None, 0, 0, None, 1, 1, None]

Upon introducing these advanced tokenizer usages, the next few code cell showcase how to put them in use and creates a function for preprocessing our question answer dataset into a format that's suited for downstream modeling. Note:

When performing truncation, we should only truncate the context, never the question. Configured via truncation="only_second"
Given that we split a single document into several chunks, it can happen that a given chunk doesn't contain a valid answer, in this case, we will set question answer task's label, start_position and end_position, to index 0 (special token [CLS]'s index).
We'll be padding every feature to maximum length, as most of the context will be reaching that threshold, there's no real benefit of performing dynamic padding.

# maximum length of a feature (question and context)
max_length = 384
# overlap between two part of the context
doc_stride = 128

def prepare_qa_train(examples):
    """Prepare training data, input features plus label for question answering dataset."""
    answers = examples["answers"]
    examples["question"] = [question.strip() for question in examples["question"]]
    
    # Tokenize our examples with truncation and padding, but keep overflows using a stride.
    # This results in one example potentially generating several features when a context is
    # long, each of those features having a context that overlaps a bit the previous
    # feature's context to prevent chopping off answer span.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        stride=doc_stride,
        padding="max_length"
    )
    sample_mapping = tokenized_examples["overflow_to_sample_mapping"]
    offset_mapping = tokenized_examples["offset_mapping"]

     # We will label impossible answers with CLS token's index.
    cls_index = 0

    # start_positions and end_positions will be the labels for extractive question answering
    tokenized_examples["start_positions"] = []
    tokenized_examples["end_positions"] = []
    for i, offset in enumerate(offset_mapping):
        input_ids = tokenized_examples["input_ids"][i]

        sample_index = sample_mapping[i]
        answer = answers[sample_index]
        
        # if no answers are given, set CLS index as answer
        if len(answer["answer_start"]) == 0:
            tokenized_examples["start_positions"].append(cls_index)
            tokenized_examples["end_positions"].append(cls_index)
        else:
            start_char = answer["answer_start"][0]
            end_char = start_char + len(answer["text"][0])

            sequence_ids = tokenized_examples.sequence_ids(i)

            # find the context's corresponding start and end token index
            token_start_index = 0
            while sequence_ids[token_start_index] != 1:
                token_start_index += 1

            token_end_index = len(input_ids) - 1
            while sequence_ids[token_end_index] != 1:
                token_end_index -= 1

            # if answer is within the context offset, move the token_start_index and token_end_index
            # to two ends of the answer else label it with cls index
            offset_start_char = offset[token_start_index][0]
            offset_end_char = offset[token_end_index][1]
            if offset_start_char <= start_char and offset_end_char >= end_char:
                while token_start_index < len(offset) and offset[token_start_index][0] <= start_char:
                    token_start_index += 1
                start_position = token_start_index - 1

                while offset[token_end_index][1] >= end_char:
                    token_end_index -= 1
                end_position = token_end_index + 1

                tokenized_examples["start_positions"].append(start_position)
                tokenized_examples["end_positions"].append(end_position)
            else:
                tokenized_examples["start_positions"].append(cls_index)
                tokenized_examples["end_positions"].append(cls_index)

    return tokenized_examples

We test our preprocessing function on a sample text to ensure our somewhat complicated preprocessing function works as expected, i.e. the start and end position of a tokenized answer matches the original un-tokenized version.

examples = datasets["train"][0:2]
answers = examples["answers"]

tokenized_examples = prepare_qa_train(examples)

start_positions = tokenized_examples["start_positions"]
end_positions = tokenized_examples["end_positions"]
for i, input_ids in enumerate(tokenized_examples["input_ids"]):
    start = start_positions[i]
    end = end_positions[i] + 1
    string = tokenizer.decode(input_ids[start:end])
    print("expected answer:", answers[i]["text"][0])
    print("preprocessing answer:", string)

expected answer: Saint Bernadette Soubirous
preprocessing answer: saint bernadette soubirous
expected answer: a copper statue of Christ
preprocessing answer: a copper statue of christ

# prevents progress bar from flooding our document
disable_progress_bar()

tokenized_datasets = datasets.map(
    prepare_qa_train,
    batched=True,
    remove_columns=datasets["train"].column_names,
    num_proc=8
)
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping', 'start_positions', 'end_positions'],
        num_rows: 88524
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping', 'start_positions', 'end_positions'],
        num_rows: 10784
    })
})

Model¶

Upon preparing our dataset, fine-tuning a question answer model on top of pre-trained language model will be similar to other tasks, where we initialize a AutoModelForQuestionAnswering model, and follow the standard fine-tuning process.

model_name = model_checkpoint.split("/")[-1]
fine_tuned_model_checkpoint = f"{model_name}-fine_tuned-{task_name}"

if os.path.isdir(fine_tuned_model_checkpoint):
    do_train = False
    model = AutoModelForQuestionAnswering.from_pretrained(fine_tuned_model_checkpoint, cache_dir=cache_dir)
else:
    do_train = True
    model = AutoModelForQuestionAnswering.from_pretrained(model_checkpoint, cache_dir=cache_dir)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

os.environ['DISABLE_MLFLOW_INTEGRATION'] = 'TRUE'

args = TrainingArguments(
    output_dir=fine_tuned_model_checkpoint,
    learning_rate=0.0001,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=2,
    weight_decay=0.01,
    fp16=True,
    # we set it to evaluate/save per epoch to avoid flowing console
    evaluation_strategy=IntervalStrategy.EPOCH,
    save_strategy=IntervalStrategy.EPOCH,
    save_total_limit=2,
    do_train=do_train
)

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
)

Using cuda_amp half precision backend

if trainer.args.do_train:
    train_output = trainer.train()
    # saving the model which allows us to leverage
    # .from_pretrained(model_path)
    trainer.save_model(fine_tuned_model_checkpoint)

The following columns in the training set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: offset_mapping, overflow_to_sample_mapping. If offset_mapping, overflow_to_sample_mapping are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
/root/.local/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 88524
  Num Epochs = 2
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 2768

The following columns in the evaluation set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: offset_mapping, overflow_to_sample_mapping. If offset_mapping, overflow_to_sample_mapping are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10784
  Batch size = 64
Saving model checkpoint to distilbert-base-uncased-fine_tuned-squad/checkpoint-1384
Configuration saved in distilbert-base-uncased-fine_tuned-squad/checkpoint-1384/config.json
Model weights saved in distilbert-base-uncased-fine_tuned-squad/checkpoint-1384/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-fine_tuned-squad/checkpoint-1384/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-fine_tuned-squad/checkpoint-1384/special_tokens_map.json
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: offset_mapping, overflow_to_sample_mapping. If offset_mapping, overflow_to_sample_mapping are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 10784
  Batch size = 64
Saving model checkpoint to distilbert-base-uncased-fine_tuned-squad/checkpoint-2768
Configuration saved in distilbert-base-uncased-fine_tuned-squad/checkpoint-2768/config.json
Model weights saved in distilbert-base-uncased-fine_tuned-squad/checkpoint-2768/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-fine_tuned-squad/checkpoint-2768/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-fine_tuned-squad/checkpoint-2768/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to distilbert-base-uncased-fine_tuned-squad
Configuration saved in distilbert-base-uncased-fine_tuned-squad/config.json
Model weights saved in distilbert-base-uncased-fine_tuned-squad/pytorch_model.bin
tokenizer config file saved in distilbert-base-uncased-fine_tuned-squad/tokenizer_config.json
Special tokens file saved in distilbert-base-uncased-fine_tuned-squad/special_tokens_map.json

Evaluation¶

Evaluating our model also requires a bit more work on postprocessing front, hence we'll first use transformer's pipeline object for confirming the model we just trained is indeed learning by seeing if its predicted answer resembles ground truth answer.

example = datasets["validation"][0]
qa_pipeline = pipeline(
    "question-answering",
    model=fine_tuned_model_checkpoint,
    tokenizer=fine_tuned_model_checkpoint
)

output = qa_pipeline({
    "question": example["question"],
    "context": example["context"]
})
answer_text = example["answers"]["text"][0]
print("output answer matches expected answer: ", output["answer"] == answer_text) 
output

loading configuration file distilbert-base-uncased-fine_tuned-squad/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-fine_tuned-squad",
  "activation": "gelu",
  "architectures": [
    "DistilBertForQuestionAnswering"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.21.3",
  "vocab_size": 30522
}

loading configuration file distilbert-base-uncased-fine_tuned-squad/config.json
Model config DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased-fine_tuned-squad",
  "activation": "gelu",
  "architectures": [
    "DistilBertForQuestionAnswering"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.21.3",
  "vocab_size": 30522
}

loading weights file distilbert-base-uncased-fine_tuned-squad/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForQuestionAnswering.

All the weights of DistilBertForQuestionAnswering were initialized from the model checkpoint at distilbert-base-uncased-fine_tuned-squad.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForQuestionAnswering for predictions without further training.
Didn't find file distilbert-base-uncased-fine_tuned-squad/added_tokens.json. We won't load it.
loading file distilbert-base-uncased-fine_tuned-squad/vocab.txt
loading file distilbert-base-uncased-fine_tuned-squad/tokenizer.json
loading file None
loading file distilbert-base-uncased-fine_tuned-squad/special_tokens_map.json
loading file distilbert-base-uncased-fine_tuned-squad/tokenizer_config.json

output answer matches expected answer:  True

{'score': 0.845287561416626,
 'start': 177,
 'end': 191,
 'answer': 'Denver Broncos'}

For evaluation, we'll preprocess our dataset in a slightly different manner:

First, we technically don't need to generate labels.
Second, the "fun" part is to map our model's prediction back to original context's span. As a reminder, some of our features are overflowed inputs for the same given example. We'll be using example id for creating this mapping.
Last, but not least, we'll set the question part of our input sequence to None, this is for efficiently detecting if our predicted answer span is within the context portion of input sentence as opposed to the question portion.

def prepare_qa_test(examples):
    examples["question"] = [question.strip() for question in examples["question"]]

    # Tokenize our examples with truncation and padding, but keep overflows using a stride.
    # This results in one example potentially generating several features when a context is
    # long, each of those features having a context that overlaps a bit the previous
    # feature's context to prevent chopping off answer span.
    tokenized_examples = tokenizer(
        examples["question"],
        examples["context"],
        max_length=max_length,
        truncation="only_second",
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        stride=doc_stride
    )
    sample_mapping = tokenized_examples["overflow_to_sample_mapping"]

    tokenized_examples["example_id"] = []
    for i in range(len(tokenized_examples["input_ids"])):
        sequence_ids = tokenized_examples.sequence_ids(i)
        sample_index = sample_mapping[i]
        tokenized_examples["example_id"].append(examples["id"][sample_index])

        # for offset mapping that are not part of context, set it to None so it's easy to determine
        # if a token positiion is part of the context or not
        offset_mapping = []
        for k, offset in enumerate(tokenized_examples["offset_mapping"][i]):
            if sequence_ids[k] != 1:
                offset = None

            offset_mapping.append(offset)

        tokenized_examples["offset_mapping"][i] = offset_mapping

    return tokenized_examples

validation_features = datasets["validation"].map(
    prepare_qa_test,
    batched=True,
    remove_columns=datasets["validation"].column_names,
    num_proc=8
)
validation_features

Dataset({
    features: ['input_ids', 'attention_mask', 'offset_mapping', 'overflow_to_sample_mapping', 'example_id'],
    num_rows: 10784
})

With our features, we can generate prediction which is a pair of start and end logits.

raw_predictions = trainer.predict(validation_features)
raw_predictions.predictions

The following columns in the test set don't have a corresponding argument in `DistilBertForQuestionAnswering.forward` and have been ignored: offset_mapping, example_id, overflow_to_sample_mapping. If offset_mapping, example_id, overflow_to_sample_mapping are not expected by `DistilBertForQuestionAnswering.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 10784
  Batch size = 64

(array([[  -8.055,   -9.14 ,   -9.33 , ..., -100.   , -100.   , -100.   ],
        [  -8.13 ,   -9.06 ,   -9.31 , ..., -100.   , -100.   , -100.   ],
        [  -8.4  ,   -8.65 ,   -9.53 , ..., -100.   , -100.   , -100.   ],
        ...,
        [  -5.02 ,   -8.89 ,   -9.48 , ..., -100.   , -100.   , -100.   ],
        [  -3.217,   -9.12 ,   -9.04 , ..., -100.   , -100.   , -100.   ],
        [  -4.484,   -9.04 ,   -9.46 , ..., -100.   , -100.   , -100.   ]],
       dtype=float16),
 array([[  -8.19 ,   -9.32 ,   -9.195, ..., -100.   , -100.   , -100.   ],
        [  -8.26 ,   -9.37 ,   -9.234, ..., -100.   , -100.   , -100.   ],
        [  -8.125,   -9.37 ,   -8.805, ..., -100.   , -100.   , -100.   ],
        ...,
        [  -4.406,   -9.34 ,   -9.24 , ..., -100.   , -100.   , -100.   ],
        [  -2.547,   -9.22 ,   -9.   , ..., -100.   , -100.   , -100.   ],
        [  -3.742,   -9.195,   -9.04 , ..., -100.   , -100.   , -100.   ]],
       dtype=float16))

Having our original example, preprocessed features, generated predictions, we'll perform a final post-processing to generate predicted answer for each example. This process mainly involves:

Creating a map between examples and features.
Looping through all the examples, and for each example, loop through all its features to pick the best start and end logit combination from the n_best start and end logits.
During this picking out best answer span process, we'll automatically eliminate answers that are not inside the context, gives negative length (start position greater than end position), as well as answers that are too long (configurable with a max_answer_length parameter).
The "proper" way of computing score for each answer is to convert start and end logit into probability using a softmax operation, then taking a product of these two probabilities. Here, we'll skip the softmax and obtain the logit scores by summing start and end logits instead (log(ab)=log(a)+log(b))

def postprocess_qa_predictions(
    examples,
    features,
    raw_predictions,
    n_best_size = 20,
    max_answer_length = 30,
    no_answer = False
):
    print(f"Post-processing {len(examples)} example predictions split into {len(features)} features.")

    all_start_logits, all_end_logits = raw_predictions

    # build a dictionary that stores examples to features/chunks mapping
    # key : example, value : list of features
    example_id_to_index = {k: i for i, k in enumerate(examples["id"])}
    features_per_example = collections.defaultdict(list)
    for i, feature in enumerate(features):
        features_per_example[example_id_to_index[feature["example_id"]]].append(i)

    cls_index = 0
    predictions = collections.OrderedDict()

    # for each example, loop through all its features/chunks for finding the best one
    for example_index, example in enumerate(tqdm(examples)):
        feature_indices = features_per_example[example_index]

        min_null_score = None
        valid_answers = []
        context = example["context"]
        for feature_index in feature_indices:
            # model prediction for this feature
            start_logits = all_start_logits[feature_index]
            end_logits = all_end_logits[feature_index]

            offset_mapping = features[feature_index]["offset_mapping"]
            
            # update minimum null prediction's score
            feature_null_score = start_logits[cls_index] + end_logits[cls_index]
            if min_null_score is None or min_null_score < feature_null_score:
                min_null_score = feature_null_score
            
            # loop through all possibilities for `n_best_size` start and end logits.
            start_indexes = np.argsort(start_logits)[-1:-n_best_size - 1:-1].tolist()
            end_indexes = np.argsort(end_logits)[-1:-n_best_size - 1:-1].tolist()
            for start_index in start_indexes:
                for end_index in end_indexes:
                    # Don't consider out-of-scope answers, either because indices
                    # are out of bounds or correspond to input_ids that
                    # are not part of the context section.
                    if (
                        start_index >= len(offset_mapping)
                        or end_index >= len(offset_mapping)
                        or offset_mapping[start_index] is None
                        or len(offset_mapping[start_index]) < 2
                        or offset_mapping[end_index] is None
                        or len(offset_mapping[end_index]) < 2
                    ):
                        continue
                    # Don't consider answers with a length that is either < 0 or > max_answer_length.
                    if end_index < start_index or end_index - start_index + 1 > max_answer_length:
                        continue

                    start_char = offset_mapping[start_index][0]
                    end_char = offset_mapping[end_index][1]
                    valid_answers.append(
                        {
                            "text": context[start_char:end_char],
                            "score": start_logits[start_index] + end_logits[end_index]
                        }
                    )

        if len(valid_answers) > 0:
            best_answer = max(valid_answers, key=lambda x: x["score"])
        else:
            # In the very rare edge case we have not a single non-null prediction,
            # we create a fake prediction to avoid failure.
            best_answer = {"text": "", "score": 0.0}

        example_id = example["id"]
        if no_answer:
            answer = best_answer["text"] if best_answer["score"] > min_null_score else ""
            predictions[example_id] = answer
        else:
            predictions[example_id] = best_answer["text"]

    return predictions

final_predictions = postprocess_qa_predictions(
    datasets["validation"],
    validation_features,
    raw_predictions.predictions
)
print("output answer matches expected answer: ", final_predictions[example["id"]] == answer_text)

Post-processing 10570 example predictions split into 10784 features.

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10570/10570 [00:13<00:00, 768.69it/s]

output answer matches expected answer:  True

Squad primarily uses two metrics for model evaluation.

Exact Match: Measures percentage of predictions that perfectly matches any one of the ground truth answers.
Macro F1: Measures average overlap between prediction and ground truth answer.

For context, screenshot below shows performance reported by the original Squad 2 paper [5].

squad_metric = evaluate.load(task_name, cache_dir=cache_dir)
formatted_predictions = [
    {"id": example_id, "prediction_text": answer}
    for example_id, answer in final_predictions.items()
]
references = [{"id": example["id"], "answers": example["answers"]} for example in datasets["validation"]]
squad_metric.compute(predictions=formatted_predictions, references=references)

{'exact_match': 76.8306527909177, 'f1': 85.18462636575374}

That's a wrap for this document. We went through nitty gritty details on how to pre-process our inputs and post-process our outputs for fine-tuning a cross attention model on top of pre-trained language model.

Reference¶

[1] Notebook: Fine-tuning a model on a question-answering task
[2] Github: Huggingface Question Answering Examples
[3] Huggingface Course: Chapter 7 Main NLP tasks - Question answering
[4] Blog: Reader Models for Open Domain Question-Answering
[5] Paper: Pranav Rajpurkar, Robin Jia, et al. - Know What You Don't Know: Unanswerable Questions for SQuAD - 2018

Epoch	Training Loss	Validation Loss
1	1.310800	1.138176
2	0.797600	1.133642