42.126. Getting Start NLP with classification task#

One area where deep learning has dramatically improved in the last couple of years is natural language processing (NLP). Computers can now generate text, translate automatically from one language to another, analyze comments, label words in sentences, and much more.

Perhaps the most widely practically useful application of NLP is classification – that is, classifying a document automatically into some category. This can be used, for instance, for:

  • Sentiment analysis (e.g are people saying positive or negative things about your product)

  • Author identification (what author most likely wrote some document)

  • Legal discovery (which documents are in scope for a trial)

  • Organizing documents by topic

  • Triaging inbound emails

  • …and much more!

Today, we are tasked with comparing two words or short phrases, and scoring them based on whether they’re similar or not, based on which patent class they were used in. With a score of 1 it is considered that the two inputs have identical meaning, and 0 means they have totally different meaning. For instance, abatement and eliminating process have a score of 0.5, meaning they’re somewhat similar, but not identical.

It turns out that this can be represented as a classification problem. How? By representing the question like this:

For the following text…: “TEXT1: abatement; TEXT2: eliminating process” …chose a category of meaning similarity: “Different; Similar; Identical”.

In this assignment section we’ll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above.

42.126.1. Import and EDA#

import pandas as pd
import numpy as np
from datasets import Dataset,DatasetDict
from transformers import AutoModelForSequenceClassification, AutoTokenizer, TrainingArguments, Trainer
import warnings

warnings.filterwarnings("ignore")

First of all, let’s import the dataset.

df = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/deep-learning/nlp/phrase_matching_train.csv')
df.head()
id anchor target context score
0 37d61fd2272659b1 abatement abatement of pollution A47 0.50
1 7b9652b17b68b7a4 abatement act of abating A47 0.75
2 36d72442aefd8232 abatement active catalyst A47 0.25
3 5296b0c19e1ce60e abatement eliminating process A47 0.50
4 54c1e3b9184cb5b6 abatement forest region A47 0.00

As you see, there are 5 columns, where anchor and target are a pair phrases, context is the common context they are in, score is the similarity score of anchor and target.

df.describe(include='object')
id anchor target context
count 36473 36473 36473 36473
unique 36473 733 29340 106
top 37d61fd2272659b1 component composite coating composition H01
freq 1 152 24 2186

We can see that in the 36473 rows, there are 733 unique anchors, 106 contexts, and nearly 30000 targets. Some anchors are very common, with “component composite coating” for instance appearing 152 times.

Earlier, I suggested we could represent the input to the model as something like “TEXT1: abatement; TEXT2: eliminating process”. We’ll need to add the context to this too. In Pandas, we just use + to concatenate, like so:

df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor
df.head(5)
id anchor target context score input
0 37d61fd2272659b1 abatement abatement of pollution A47 0.50 TEXT1: A47; TEXT2: abatement of pollution; ANC...
1 7b9652b17b68b7a4 abatement act of abating A47 0.75 TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2 36d72442aefd8232 abatement active catalyst A47 0.25 TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3 5296b0c19e1ce60e abatement eliminating process A47 0.50 TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4 54c1e3b9184cb5b6 abatement forest region A47 0.00 TEXT1: A47; TEXT2: forest region; ANC1: abatement

42.126.2. Tokenization#

Transformers uses a Dataset object for storing their dataset, of course! We can create one like so:

ds = Dataset.from_pandas(df)
ds
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

But we can’t pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

  • Tokenization: Split each text up into words (or actually, as we’ll see, into tokens)

  • Numericalization: Convert each word (or token) into a number.

The details about how this is done actually depend on the particular model we use. So first we’ll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this (replace “small” with “large” for a slower but more accurate model, once you’ve finished exploring):

model_nm = 'microsoft/deberta-v3-small'

AutoTokenizer will create a tokenizer appropriate for a given model:

tokz = AutoTokenizer.from_pretrained(model_nm)

Here’s an example of how the tokenizer splits a text into “tokens” (which are like words, but can be sub-word pieces, as you see below):

tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")
['▁G',
 "'",
 'day',
 '▁folks',
 ',',
 '▁I',
 "'",
 'm',
 '▁Jeremy',
 '▁from',
 '▁fast',
 '.',
 'ai',
 '!']

Uncommon words will be split into pieces just like ornithorhynchus. The start of a new word is represented by :

tokz.tokenize("A platypus is an ornithorhynchus anatinus.")
['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']

42.126.3. Numericalization#

After completing Tokenization, we need to convert each token into a number, because the model only accepts numbers as input. But … how to do it? We need a large token dictionary to map each token to a number!

vocab = tokz.get_vocab()

The above is the token dictionary that comes with the deberta-v3-small model. You can print it out to check.

tokz("A platypus is an ornithorhynchus anatinus.")
{'input_ids': [1, 336, 114224, 269, 299, 289, 4840, 34765, 102530, 1867, 299, 2401, 26835, 260, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

According to this token dictionary, we can convert the original token sequence into a digital sequence. Input_ids is the number we need, token_type_ids represents whether all tokens belong to the same sentence, and attention_mask represents whether the token exists in the token dictionary.

Here’s a simple function which tokenizes our inputs:

def tok_func(x): return tokz(x["input"])

To run this quickly in parallel on every row in our dataset, use map:

tok_ds = ds.map(tok_func, batched=True)
tok_ds
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

This adds a new item to our dataset called input_ids. For instance, here is the input and IDs for the first row of our data:

row = tok_ds[0]
row['input'], row['input_ids']
('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name labels, but in our dataset it’s currently score. Therefore, we need to rename it:

tok_ds = tok_ds.rename_columns({'score':'labels'})
tok_ds
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36473
})

Now that we’ve prepared our tokens and labels, we need to create our validation set.

42.126.4. Test and validation sets#

eval_df = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/deep-learning/nlp/phrase_matching_test.csv')
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)
eval_ds
Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 36
})

This is the test set. Possibly the most important idea in machine learning is that of having separate training, validation, and test data sets.

dds = tok_ds.train_test_split(0.25, seed=42)
dds
DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

This is the validation set. We use train_test_split to separate it from the training set with a separation ratio of 25%.

42.126.5. Training our model#

Before starting training, we need to set some hyperparameters for our model. Here’s a concise explanation:

  • Batch Size (bs): 128 examples processed in each iteration.

  • Epochs (epochs): The model will be trained through the entire dataset 4 times.

  • Learning Rate (lr): The step size for adjusting model weights during optimization is set to 8e-5.

  • TrainingArguments (args):

    • Warmup Ratio: 10% of training steps used for learning rate warm-up.

    • Learning Rate Scheduler: Cosine learning rate scheduler.

    • Mixed Precision (fp16): Training with mixed-precision for faster computation.

    • Evaluation Strategy: Model evaluation after each epoch.

    • Batch Sizes: 128 examples per training device, 256 for evaluation.

    • Number of Training Epochs: Training for 4 epochs.

    • Weight Decay: L2 regularization with a rate of 0.01.

    • Report To: No reports sent during training (set to ‘none’). to ‘none’).

bs = 128
epochs = 4
lr = 8e-5
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

Now, we can initialize a pre-trained sequence classification model and sets up a training environment using Hugging Face’s Trainer. The model is loaded with AutoModelForSequenceClassification.from_pretrained and configured with training parameters in the Trainer object.

model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz)
Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
trainer.train();
[856/856 00:53, Epoch 4/4]
Epoch Training Loss Validation Loss
1 No log 0.026275
2 No log 0.021973
3 0.039600 0.022443
4 0.039600 0.023286

preds = trainer.predict(eval_ds).predictions.astype(float)
preds
array([[-1.50489807e-03],
       [ 4.90570068e-03],
       [-5.05447388e-04],
       [ 2.69412994e-04],
       [-1.44767761e-03],
       [ 4.85897064e-04],
       [-1.81484222e-03],
       [ 8.22067261e-04],
       [ 4.36019897e-03],
       [ 4.40216064e-03],
       [-6.16550446e-04],
       [-4.18424606e-05],
       [-1.20639801e-03],
       [ 3.18288803e-04],
       [-6.15119934e-04],
       [-8.05377960e-04],
       [-2.66265869e-03],
       [ 2.60114670e-04],
       [ 3.48281860e-03],
       [ 1.68323517e-03],
       [ 1.38378143e-03],
       [-2.48527527e-03],
       [ 7.53879547e-04],
       [ 8.55922699e-04],
       [-2.27355957e-03],
       [-2.88581848e-03],
       [ 3.29780579e-03],
       [ 9.42707062e-04],
       [ 4.26769257e-04],
       [-1.19447708e-04],
       [-2.77519226e-03],
       [ 5.27381897e-04],
       [-8.44001770e-04],
       [ 4.88281250e-04],
       [-2.11715698e-04],
       [-1.00421906e-03]])

Look out - some of our predictions are <0, or >1! Let’s fix those out-of-bounds predictions:

preds = np.clip(preds, 0, 1)
preds
array([[0.        ],
       [0.0049057 ],
       [0.        ],
       [0.00026941],
       [0.        ],
       [0.0004859 ],
       [0.        ],
       [0.00082207],
       [0.0043602 ],
       [0.00440216],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.00031829],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.00026011],
       [0.00348282],
       [0.00168324],
       [0.00138378],
       [0.        ],
       [0.00075388],
       [0.00085592],
       [0.        ],
       [0.        ],
       [0.00329781],
       [0.00094271],
       [0.00042677],
       [0.        ],
       [0.        ],
       [0.00052738],
       [0.        ],
       [0.00048828],
       [0.        ],
       [0.        ]])

42.127. Acknowledgments#

Thanks to Jeremy Howard for creating Getting started with NLP for absolute beginners. It inspires the majority of the content in this chapter.