# Install the necessary dependencies

import os
import sys
!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython

30. Natural Language Processing#

Natural Language Processing (NLP) stands as a pivotal technology in the realm of artificial intelligence, bridging the gap between human communication and computer understanding. It is a multidisciplinary domain that empowers computers to interpret, analyze, and generate human language, enabling seamless interaction between humans and machines. The significance of NLP is evident in its widespread applications, ranging from automated customer support to real-time language translation.

This section aims to provide newcomers with a comprehensive overview of NLP, its workings, applications, challenges, and future outlook.

30.1. What is Natural Language Processing?#


Image: Natural Language Processing

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on the interaction between computers and humans through natural language. The objective is to program computers to process and analyze large amounts of natural language data.

NLP involves enabling machines to understand, interpret, and produce human language in a way that is both valuable and meaningful. OpenAI, known for developing advanced language models like ChatGPT, highlights the importance of NLP in creating intelligent systems that can understand, respond to, and generate text, making technology more user-friendly and accessible

30.2. Components of NLP#

Natural Language Processing is not a monolithic, singular approach, but rather, it is composed of several components, each contributing to the overall understanding of language. The main components that NLP strives to understand are Syntax, Semantics, Pragmatics, and Discourse.

30.2.1. Syntax#

  • Definition: Syntax pertains to the arrangement of words and phrases to create well-structured sentences in a language.

  • Example: Consider the sentence “The cat sat on the mat.” Syntax involves analyzing the grammatical structure of this sentence, ensuring that it adheres to the grammatical rules of English, such as subject-verb agreement and proper word order.

30.2.2. Semantics#

  • Definition: Semantics is concerned with understanding the meaning of words and how they create meaning when combined in sentences.

  • Example: In the sentence “The panda eats shoots and leaves,” semantics helps distinguish whether the panda eats plants (shoots and leaves) or is involved in a violent act (shoots) and then departs (leaves), based on the meaning of the words and the context.


Image: Semantics in NLP

30.2.3. Pragmatics#

  • Definition: Pragmatics deals with understanding language in various contexts, ensuring that the intended meaning is derived based on the situation, speaker’s intent, and shared knowledge.

  • Example: If someone says, “Can you pass the salt?” Pragmatics involves understanding that this is a request rather than a question about one’s ability to pass the salt, interpreting the speaker’s intent based on the dining context.

30.2.4. Discourse#

  • Definition: Discourse focuses on the analysis and interpretation of language beyond the sentence level, considering how sentences relate to each other in texts and conversations.

  • Example: In a conversation where one person says, “I’m freezing,” and another responds, “I’ll close the window,” discourse involves understanding the coherence between the two statements, recognizing that the second statement is a response to the implied request in the first.

Understanding these components is crucial for anyone delving into NLP, as they form the backbone of how NLP models interpret and generate human langua

30.3. What is NLP Used For?#

Natural Language Processing has found extensive applications across various industries, revolutionizing the way businesses operate and interact with users. Here are some of the key industry applications of NLP.

30.3.1. Healthcare#

NLP assists in transcribing and organizing clinical notes, ensuring accurate and efficient documentation of patient information. For instance, a physician might dictate their notes, which NLP systems transcribe into text. Advanced NLP models can further categorize the information, identifying symptoms, diagnoses, and prescribed treatments, thereby streamlining the documentation process, minimizing manual data entry, and enhancing the accuracy of electronic health records.

30.3.2. Finance#

Financial institutions leverage NLP to perform sentiment analysis on various text data like news articles, financial reports, and social media posts to gauge market sentiment regarding specific stocks or the market in general. Algorithms analyze the frequency of positive or negative words, and through machine learning models, predict potential impacts on stock prices or market movements, aiding traders and investors in making informed decisions.

30.3.3. Customer Service#

NLP-powered chatbots have revolutionized customer support by providing instant, 24/7 responses to customer inquiries. These chatbots understand customer queries through text or voice, interpret the underlying intent, and provide accurate responses or solutions. For instance, a customer might inquire about their order status, and the chatbot, integrating with the order management system, retrieves and delivers the real-time status, enhancing customer experience and reducing support workload.

30.3.4. E-Commerce#

NLP significantly enhances on-site search functionality in e-commerce platforms by understanding and interpreting user queries, even if they are phrased in a conversational manner or contain typos. For example, if a user searches for “blu jeens,” NLP algorithms correct the typos and understand the intent, providing relevant results for “blue jeans,” thereby ensuring that users find what they are looking for, even with imprecise queries.

30.3.6. Everyday applications#

Beyond industry-specific applications, NLP is ingrained in our daily lives, making technology more accessible and user-friendly. Here are some everyday applications of NLP:

  • Search engines. NLP is fundamental to the functioning of search engines, enabling them to understand user queries and provide relevant results.

  • Virtual assistants. Siri, Alexa, and Google Assistant are examples of virtual assistants that use NLP to understand and respond to user commands.

  • Translation services. Services like Google Translate employ NLP to provide real-time language translation, breaking down language barriers and fostering communication.

  • Email filtering. NLP is used in email services to filter out spam and categorize emails, helping users manage their inboxes more effectively.

  • Social media monitoring. NLP enables the analysis of social media content to gauge public opinion, track trends, and manage online reputation.

The applications of NLP are diverse and pervasive, impacting various industries and our daily interactions with technology. Understanding these applications provides a glimpse into the transformative potential of NLP in shaping the future of technology and human interaction.

30.4. Overcoming NLP challenges#

Natural Language Processing, despite its advancements, faces several challenges due to the inherent complexities and nuances of human language. Here are some of the challenges in NLP:

  • Ambiguity. Human language is often ambiguous, with words having multiple meanings, making it challenging for NLP models to interpret the correct meaning in different contexts.

  • Context. Understanding the context in which words are used is crucial for accurate interpretation, and it remains a significant challenge for NLP.

  • Sarcasm and irony. Detecting sarcasm and irony is particularly challenging as it requires understanding the intended meaning, which may be opposite to the literal meaning.

  • Cultural nuances. Language is deeply intertwined with culture, and understanding cultural nuances and idioms is essential for effective NLP.

30.5. Code#

Now that we have a preliminary understanding of natural language processing, let’s train a model about disaster tweets to help you better understand.

First, let’s import the necessary libraries.

import numpy as np
import pandas as pd
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

Let’s now import the dataset which contains numerous tweet texts. Each tweet is labeled as either related to a real disaster or not. Our task is to utilize Natural Language Processing (NLP) techniques to process and analyze these tweet texts. We aim to build a model that can automatically identify whether a tweet is related to a disaster or not.

train_df = pd.read_csv("https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/deep-learning/nlp/disaster_tweets_train.csv")
test_df = pd.read_csv("https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/deep-learning/nlp/disaster_tweets_test.csv")

Let’s take a quick look at our data… first, an example of what is NOT a disaster tweet.

train_df[train_df["target"] == 0]["text"].values[1]
'I love fruits'

Then an example of what is a disaster tweet.

train_df[train_df["target"] == 1]["text"].values[1]
'Forest fire near La Ronge Sask. Canada'

30.5.1. Building vectors#

The theory behind the model we’ll build in this section is pretty simple: the words contained in each tweet are a good indicator of whether they’re about a real disaster or not (this is not entirely correct, but it’s a great place to start).

We’ll use scikit-learn’s CountVectorizer to count the words in each tweet and turn them into data our machine learning model can process.

Note: a vector is, in this context, a set of numbers that a machine learning model can work with. We’ll look at one in just a second.

count_vectorizer = feature_extraction.text.CountVectorizer()

## let's get counts for the first 5 tweets in the data
example_train_vectors = count_vectorizer.fit_transform(train_df["text"][0:5])
train_df["text"][0]
'Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all'
## we use .todense() here because these vectors are "sparse" (only non-zero elements are kept to save space)
print(example_train_vectors[0].todense().shape)
print(example_train_vectors[0].todense())
(1, 54)
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]

The above tells us that:

  • There are 54 unique words (or “tokens”) in the first five tweets.

  • The first tweet contains only some of those unique tokens - all of the non-zero counts above are the tokens that exist in the first tweet.

Now let’s create vectors for all of our tweets.

train_vectors = count_vectorizer.fit_transform(train_df["text"])
test_vectors = count_vectorizer.transform(test_df["text"])

30.5.2. Model#

As we mentioned above, we think the words contained in each tweet are a good indicator of whether they’re about a real disaster or not. The presence of particular word (or set of words) in a tweet might link directly to whether or not that tweet is real.

What we’re assuming here is a linear connection. So let’s build a linear model and see!

## Our vectors are really big, so we want to push our model's weights toward 0 without completely discounting different words - ridge regression is a good way to do this.
clf = linear_model.RidgeClassifier()

Let’s test our model and see how well it does on the training data. For this we’ll use cross-validation - where we train on a portion of the known data, then validate it with the rest. If we do this several times (with different portions) we can get a good idea for how a particular model or method performs.

Here, we are using F1 score as the performance evaluation metric for the model.

scores = model_selection.cross_val_score(clf, train_vectors, train_df["target"], cv=3, scoring="f1")
scores
array([0.59421842, 0.56498283, 0.64082434])

The above scores aren’t terrible! It indicates that our hypothesis is approximately 65% likely. Let’s continue moving forward, fit the model and predict the test set.

clf.fit(train_vectors, train_df["target"])
RidgeClassifier()
sample_submission = pd.read_csv("https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/deep-learning/nlp/disaster_tweets_test.csv")
sample_submission["target"] = clf.predict(test_vectors)
sample_submission.head()
id keyword location text target
0 0 NaN NaN Just happened a terrible car crash 0
1 2 NaN NaN Heard about #earthquake is different cities, s... 1
2 3 NaN NaN there is a forest fire at spot pond, geese are... 1
3 9 NaN NaN Apocalypse lighting. #Spokane #wildfires 0
4 11 NaN NaN Typhoon Soudelor kills 28 in China and Taiwan 1

The above is the result of the model. Because only the linear regression model is used, the effect is not ideal, but for beginners, it is a good opportunity to understand natural language processing technology.

30.6. Your turn! 🚀#

You can practice your nlp skills by following the assignment getting start nlp with classification task.

30.7. Acknowledgments#

Thanks to Matt Crabtree and Phil Culliton for creating the open-source course What is Natural Language Processing (NLP)? and NLP Getting Started Tutorial. It inspires the majority of the content in this chapter.