42.128. Beginner’s Guide to Text Pre-Processing#

Natural Language Processing is a subdomain under Artificial Intelligence which deals with processing natural language data like text and speech. We can also term NLP as the “Art of extracting information from texts”. Recently there has been a lot of activity in this field and amazing research coming out every day! But, the revolutionary research was the “Transformer” which opened up avenues to build massive Deep Learning models which can come very close to human-level tasks like Summarization and Question Answering. Then came the GPTs and BERTs which were massive models consisting of billions of computation parameters trained on very huge datasets and can be fine-tuned to a variety of NLP tasks and problem statements.

Deep down, the roots of building a robust NLP model, Text Processing, plays a very important role. This might not be very evident in the recent models like BERT and GPT, but it is one of the most elementary processes in Natural Language Processing. All NLP researchers and enthusiasts will have done Text Processing more times than not while attempting to solve problems in this domain. For a beginner, Text Processing is a fundamental concept to be nailed before setting sights on solving advanced problems. This brings to a question - Why Text Processing?

42.128.1. Why Text Pre-processing?#

Text Pre-processing is important because language models are quite complex and there might be unnecessary data in the text corpus which might add to an ambiguity factor in the dataset, make it more computationally intensive and also impact the accuracy to a pretty considerable extent.

Text Pre-processing is important because language models are quite complex, largely due to grammar rules. Unnecessary data in non-processed datasets will only add to ambiguity, increase computation requirements, and impact the accuracy of the model to a considerable extent.

Moreover, we have to get the text transformed into vectors/numbers which can be ingested by machines or computers. This process is called Encoding Technique and we have many techniques like CountVectorizer, Tf-Idf Vectorizer, Bag of Words, Word2Vec, GLoVe, etc. Popularly this process is also known as Text Representation. This comes after the Text Pre-Processing. We shall look into these techniques in the next article 🙂

Coming back to Text Pre-processing, let us look into a few popular Text Pre-processing methods in NLP.

42.128.2. Downloading Packages#

We will use the most popular library in processing textual data - NLTK (Natural Language Toolkit). On top of downloading and loading the base NLTK library we have to download a few additional files for our Pre-Processing techniques. The code is shown below:

!pip install nltk

import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

Note: All code examples can be executed on Colab exactly the way it is shown in the articles.

Once that is done, we can start doing different pre-processing activities, a few of which are listed below. At the end, we will bundle all of these pre-processing techniques into a function, making it very easy to use and even add that into a sequence with other pre-processing techniques.

42.128.3. Removing Accented Characters#

This will be our first pre-processing technique, which involves removing unaccented characters like é, â etc. These characters won’t be adding any meaning if included in the sentence. We can use the library - unicodedata to replace the unaccented characters with normal characters.

import unicodedata

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    return text

remove_accented_chars('résumé')
# Result - 'resume'

remove_accented_chars('café')
# Result - 'cafe'

'cafe'

42.128.4. Removing Special & Non-Alphanumeric Characters#

Next step is to take care of special symbols, numbers and non-alphanumeric characters like #, @, $ etc. We can remove these characters easily using regular expressions.

import re

def remove_special_characters(text):
    pattern = r'[^a-zA-Z\s]'
    n_pattern = r'[^a-zA-Z0-9\s]'
    # Removing everything apart from alphanumerical chars
    text = re.sub(pattern, '', text)
    # Removing numbers
    text = re.sub(n_pattern, '', text)
    return text

remove_special_characters('The brown fox is quick and the blue dog is lazy!')
# Result - The brown fox is quick and the blue dog is lazy

remove_special_characters('@ElonMusk is revolutionizing the Space industry, especially the aspect of Reusable rockets!!!')
# Result - ElonMusk is revolutionizing the Space industry especially the aspect of Reusable rockets

'ElonMusk is revolutionizing the Space industry especially the aspect of Reusable rockets'

Note: Removing numbers may or may not be feasible based on the scenario of the problem statement. Therefore, removing numbers is solely based on the dataset and the problem statement.

42.128.5. Converting to Lowercase#

This is an important and compulsory step in Pre-processing text. If we consider the words “Banana” and “banana”, both convey the same meaning, but are represented differently and are treated as unique words by the encoder (which converts text to vectors). To combat this, we can simply convert the entire corpus to lower case to make sure every word or token (in NLP jargon) is in the same configuration which makes it easier to process and represent it in an effective manner.

We can achieve this by simply using the lower() method on the string and further use strip() method to remove any white spaces too.

def to_lower(text):
    return text.lower().strip()

to_lower('Hi there, How are you?')
# Result - hi there, how are you?

'hi there, how are you?'

42.128.6. Removing Punctuation#

Punctuation is an added weight to the corpus, but is very important in conveying the semantic meaning of the sentence. However, we can go ahead and remove them as one of the pre-processing techniques. Advanced encoding techniques like Word Embeddings (covered in a later post) can model the corpus without any punctuation.

import string

def remove_p(text):

    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub('[''""…]', '', text)
    text = re.sub('\n', '', text)

    return text

remove_p('We were , though we had rushed to get there, late for the film. ''Thank you'', I said')
# Result - We were though we had rushed to get there late for the film. Thank you I said

'We were  though we had rushed to get there late for the film Thank you I said'

Note: Punctuation were not removed in the more advanced GPTs & BERTs as the models were powerful enough to process and model sentences as it is without any pre-processing.

42.128.7. Tokenization#

This is a very small step which converts a sentence into tokens or words. If the input is a string (sentence) the output will be the list of words/tokens in that sentence. The official definition of a token is - ”A sequence of characters which are grouped together as a useful semantic term for analyzing”. To put it in simple words, they are nothing but the smallest meaningful entities of a sentence. Here, we use NLTK’s function word_tokenize(). Tokenization is important to apply the next steps - Stopword Removal, Stemming and Lemmatization.

import nltk

def tokenization(text):
    tokens = nltk.word_tokenize(text)
    return tokens

tokenization('She sells sea shells on the sea shore')
# Result - ['She','sells','sea','shells','on','the','sea','shore']

['She', 'sells', 'sea', 'shells', 'on', 'the', 'sea', 'shore']

There is an alternative for nltk.word_tokenize i.e. tensorflow’s text_to_word_sequence which gives the same output as NLTK’s word_tokenize

from tensorflow.keras.preprocessing.text import text_to_word_sequence

text_to_word_sequence('She sells sea shells on the sea shore')
#Result - ['She','sells','sea','shells','on','the','sea','shore']

['she', 'sells', 'sea', 'shells', 'on', 'the', 'sea', 'shore']

42.128.8. Stopword Removal#

Stopwords are the words which are most common like I, am, there, where etc. They usually don’t help in certain NLP tasks and are best removed to save computation and time. Common methodology in earlier times was to remove the stopwords. However, in the age of GPT and BERT, we don’t usually remove the stopwords.

from nltk.corpus import stopwords

STOPWORDS = stopwords.words('english')
def remove_stopwords(tokens):

    filtered_tokens = [token for token in tokens if token not in STOPWORDS]
    return filtered_tokens

remove_stopwords(['the', 'brown', 'fox', 'is', 'quick', 'and', 'the', 'blue', 'dog', 'is', 'lazy'])
#Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazy']

# We can also print all the stopwords present in NLTK configuration by print(stopwords.words('english'))

# Also we have an option to modify the set of stopwords for our custom scenario by the following methods stopwords.remove() and stopwords.add()

['brown', 'fox', 'quick', 'blue', 'dog', 'lazy']

Note: You can try out both approaches for creating the corpus — Removing Stopwords and Retaining the Stopwords. We can see different end results based on whether stopwords were removed or retained.

42.128.9. Stemming#

Stemming is a process of reducing a given token/word to its root form. For ex: The words - likely, likes, liked, liking are reduced to its root form i.e. like. Stemming uses a crude heuristic process that chops off the ends of words in the hope of correctly transforming words into its root form. So the words “trouble”, “troubled” and “troubles” might actually be converted to “troubl” instead of “trouble” because the ends were just chopped off!

Stemming is an optional step and the best way to find out if it is effective or not is to experiment and observe the results before and after stemming. There are two types of Stemmer defined in NLTK - PorterStemmer & SnowballStemmer. The details are given here. In our examples, we will be using PorterStemmer.

from nltk.stem import PorterStemmer

ps = PorterStemmer()
def stem(words):
   stemmed_tokens = [ps.stem(word) for word in words]
   return stemmed_tokens

stem(['brown', 'fox', 'quick', 'blue', 'dog', 'lazy'])
# Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazi']

stem(['welcome', 'fairly', 'easily'])
# Result - ['welcom', 'fairli', 'easili']

['welcom', 'fairli', 'easili']

42.128.10. Lemmatization#

A widely used step after lower-casing and the removal of stopwords is Lemmatization. It is similar to stemming but does not chop the ends of the words, instead it transforms to the actual root word based on a dictionary. This dictionary is called WordNet. Find more details on WordNet here. Since it has to look up a dictionary, it is slightly slower than Stemming. For example, the token “better” is transformed into “good” which retains the semantic meaning even after transformation which might not be the case in stemming (most of the times, the meaning of the stemmed word is not semantically grasped. Lazy becomes lazi after stemming!) NLTK Lemmatizer details here.

from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
def lemmatize(words):
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in words]
    return lemmatized_tokens

lemmatize(['welcome', 'fairly', 'better', 'goose' , 'geese'])
# Result - ['welcome', 'fairly', 'good', 'goose' , 'goose']

['welcome', 'fairly', 'better', 'goose', 'goose']

42.128.11. Putting it all together#

Now that we have defined functions for all Pre-processing steps, let us call them and observe the results. We can also create a pipe of function calls in a specific order for processing. This is also termed as the Pre-processing pipeline.

sentence = 'The brown fox is quick and the blue dog is lazy!'

# REMOVING ACCENTED CHARACTERS
remove_accented_chars(sentence)
# Result - The brown fox is quick and the blue dog is lazy!

# REMOVING SPECIAL CHARACTERS
remove_special_characters(sentence)
# Result - The brown fox is quick and the blue dog is lazy

# CONVERTING TO LOWER CASE
# Pipeline involving removal of spl chars and then lower casing
to_lower(remove_special_characters(sentence))
# Result - the brown fox is quick and the blue dog is lazy

# REMOVING PUNCTUATION
remove_p(to_lower(remove_special_characters(sentence)))
# Result - the brown fox is quick and the blue dog is lazy

# TOKENIZATION
text_tokens = tokenization(remove_p(to_lower(remove_special_characters(sentence))))
# Result - ['the', 'brown', 'fox', 'is', 'quick', 'and', 'the', 'blue', 'dog', 'is', 'lazy']

# REMOVAL OF STOPWORDS
filtered_tokens = remove_stopwords(text_tokens)
# Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazy']

# STEMMING
stem(filtered_tokens)
# Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazi']

# LEMMATIZATION
lemmatize(filtered_tokens)
# Result - ['brown', 'fox', 'quick', 'blue', 'dog', 'lazy']

# REFACTORING THE CORPUS
def refactor(words):
    return ' '.join(words)
refactor(lemmatize(filtered_tokens))
# Result - 'brown fox quick blue dog lazy'
# ONE PIPELINE FOR ALL STEPS
refactor(lemmatize(remove_stopwords(tokenization(remove_p(to_lower(remove_special_characters('The brown fox is quick and the blue dog is lazy!')))))))
# Result - 'brown fox quick blue dog lazy'

'brown fox quick blue dog lazy'

Feel free to experiment stemming, lemmatization, stopword removal aspects in the pipeline. Given here is the code containing all the functions in a single python file.

42.128.12. Conclusion#

We have covered a few of the most popular Text Pre-processing steps in NLP in this post. There are a few more advanced concepts like Bi-gram, Tri-gram filtering, correcting spelling mistakes, expanding abbreviations etc. Feel free to explore these methods also. One more thing to note which has surfaced in recent years is “Pre-processing can hamper the performance of Deep NLP models!” as stated here. BERT and GPT also don’t employ rigorous pre-processing steps, which might induce a thought - ”Was learning these techniques a waste of time?” Definitely not! These techniques are building blocks in NLP and are to be known for any beginner starting out in NLP.

Try these techniques on your custom data and observe how Pre-processing techniques can help in building a very good text corpus which can later be employed for training Deep Learning Models for NLP tasks. In our next post, we will move to the next step of representing the corpus as a vector, commonly known as Text Encoding.

42.128.13. Acknowledgements#

Thanks to Pranav Raikote for creating NLP Tutorials – Part 1: Beginner’s Guide to Text Pre-Processing. It inspires the majority of the content in this chapter.

Ocademy Open Machine Learning Book

Beginner’s Guide to Text Pre-Processing

Contents