InfinitePy Newsletter 🇺🇸
Posts
Demystifying NLP and NLTK: A Step-by-Step Guide for Beginners

Demystifying NLP and NLTK: A Step-by-Step Guide for Beginners

At present where every major industry ranging from healthcare to finance and from e-commerce to manufacturing depends on data science and artificial intelligence, comprehending human language has emerged as a crucial task. Natural Language Processing (NLP) is still at the cutting edge of this hazy borderland of linguistics and computer science.

Eduardo Miranda
July 19, 2024

_{🕒 Estimated reading time: 14 minutes}

Natural Language Processing (NLP) is a fascinating field at the intersection of computer science, artificial intelligence, and linguistics. It enables computers to understand, interpret, and generate human language. For anyone who wants to dive into NLP using Python, the Natural Language Toolkit (NLTK) is an excellent starting point. Here, we'll take you through the topics in NLP using NLTK and finish with practical sentiment analysis example.

All the examples are also explained here👨‍🔬, a corresponding Google Colab notebook to make your learning even more interactive.

Understanding Natural Language Processing (NLP)

NLP is a domain within artificial intelligence (AI) that focuses on the interaction between computers and human languages. It enables machines to read, decipher, understand, and produce human language in a valuable way. NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models.

NLP encompasses various methods and algorithms that enable a machine to understand 🤖 and interpret the nuances of human language 🗣️. The main goals include:

Enabling machines to understand human language.
Building systems that can translate 🌐, summarize 📝, and map relationships between different languages 💬.
Automating tasks such as text classification 📚, sentiment analysis 😊😢, and language generation.

Real-World Applications of NLP

Search Engine Optimization (SEO): Search engines use NLP algorithms to understand search queries and web content, thus providing the most relevant results.
Machine Translation: Tools like Google Translate employ NLP to translate text from one language to another with impressive accuracy.
Chatbots and Virtual Assistants: Siri, Alexa, and other AI-driven assistants leverage NLP to understand user requests and provide appropriate responses.
Sentiment Analysis: This is used on social media to gauge how people feel about different products, services, or events.
Spam Detection: Email providers use NLP to identify and filter out spam messages effectively.
Speech Recognition: Converts spoken words into text, a key feature in voice-activated systems.
Text Classification: Categorizes documents into specific categories, such as news articles or customer support tickets.
Named Entity Recognition (NER): Identifies and categorizes entities like names, dates, and locations within text.

Essential Terminologies in NLP

Before diving deeper, it's useful to familiarize yourself with some critical terms in NLP:

Tokenization: Breaking down text into smaller units, typically words or phrases.
Stemming and Lemmatization: Reducing words to their base or root form.
Named Entity Recognition (NER): Identifying and classifying key information (entities) in the text.
Corpus: A large and structured set of texts used for training and testing.

Introducing to Natural Language Toolkit (NLTK)

To navigate the intricate maze of Natural Language Processing (NLP), sophisticated tools are essential, and NLTK (Natural Language Toolkit) stands out as a significant asset in the Python programming ecosystem.

This open-source library simplifies various text processing tasks, offering capabilities for tokenization, stemming, lemmatization, and more. NLTK is equipped with diverse corpora and lexical resources, such as WordNet, and supports a range of text classification algorithms useful for applications like sentiment analysis and spam detection.

NLTK is a rich educational resource, providing extensive documentation and tutorials, making it an excellent entry point for beginners. The library’s flexibility accommodates both basic and sophisticated text processing needs, scaling with the user's expertise.

Getting Started with NLTK

To begin using NLTK, you'll need to install it. This can be done easily with pip:

# First, we need to import the necessary modules from the NLTK library.
pip install nltk

Once you've completed the installation, you'll need to download the required NLTK data. You can do this either through a Python script or from an interactive shell.

import nltk  # Importing the nltk library, which stands for Natural Language Toolkit

# Downloading the entire nltk dataset. This includes a lot of different resources such as corpora, tokenizers, trained models, etc.
# It's useful for various natural language processing (NLP) tasks.
nltk.download('all')

Tokenization

Tokenization is the process of breaking down a text into smaller units called tokens, which can be words, subwords, or characters. This is the first step in NLP and is crucial because it converts the input text into manageable chunks that can be processed further. Without tokenization, the text is just a string of characters, which isn't useful for most language processing tasks. It allows the system to understand and work on each word or token individually, making subsequent analysis possible.

# Importing necessary modules from the Natural Language Toolkit (nltk) package.
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Defining a string variable 'text' that contains the text we want to process.
text = "NLTK is a top platform for creating Python software that processes human language data."

# The sent_tokenize function from nltk.tokenize splits the text into a list of sentences.
sentences = sent_tokenize(text)

# Printing the list of sentences obtained after tokenization.
print("Sentences:", sentences)

# The word_tokenize function from nltk.tokenize breaks down the text into a list of words.
words = word_tokenize(text)

# Printing the list of words obtained after tokenization.
print("Words:", words)

Running the code above will produce the following output.

Sentences: ['NLTK is a top platform for creating Python software that processes human language data.']
Words: ['NLTK', 'is', 'a', 'top', 'platform', 'for', 'creating', 'Python', 'software', 'that', 'processes', 'human', 'language', 'data', '.']

Removing Stopwords

Stopwords are common words in a language (such as "the", "is", "in", etc.) that carry very little useful information for certain tasks like text classification. Removing stopwords helps in reducing the dimensionality of the data and improves the efficiency of algorithms. It allows the system to focus on the more meaningful words that are likely to contribute more significant information to tasks like sentiment analysis or topic modeling.

from nltk.corpus import stopwords  # Import the stopwords module from the Natural Language Toolkit (nltk)

# Define stop words
stop_words = set(stopwords.words('english'))  # Create a set of English stop words using nltk's pre-defined list
# Stop words are common words like 'and', 'the', 'is', etc., that are usually filtered out in text processing

# A list comprehension that keeps only the words not found in the set of stop words
filtered_words = [word for word in words if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)  # Print the list of filtered words

Running the code above will produce the following output.

Filtered Words: ['sample', 'sentence', 'showing', 'stopwords', 'filteration', 'functionality']

Stemming

Stemming is the process of reducing words to their base or root form. For example, "running" becomes "run." This is important because different forms of a word should ideally be recognized as the same word in many NLP tasks. By reducing words to their stems, we can treat "runs", "running", and "runner" as instances of the same base concept, which improves the consistency of the data and enhances the performance of machine learning models.

# Import the PorterStemmer class from the nltk.stem module
from nltk.stem import PorterStemmer

# Create an instance of the PorterStemmer
ps = PorterStemmer()

# Suppose filtered_words is a list of words that we want to stem.
# For this example, let's pretend we have already defined filtered_words elsewhere
# filtered_words = ["running", "jumped", "easily", "fairly"]

# Use list comprehension to apply the stemmer to each word in the filtered_words list
# stemming reduces words to their base or root form
stemmed_words = [ps.stem(word) for word in filtered_words]

# Print the list of stemmed words
print("Stemmed Words:", stemmed_words)

Running the code above will produce the following output.

Stemmed Words: ['sampl', 'sentenc', 'show', 'stopword', 'filter', 'function']

Part-of-Speech (POS) tagging

POS tagging involves labeling each word in a sentence with its corresponding part of speech (e.g., noun, verb, adjective). This is important because it provides syntactic information that can be crucial for understanding the structure and meaning of sentences. For example, knowing whether "book" is used as a noun or a verb can change how the sentence is parsed and understood. POS tagging is fundamental for tasks like parsing, named entity recognition, and sentiment analysis.

# Import the 'pos_tag' function for part-of-speech tagging and 'word_tokenize' for tokenizing text into words, both from the Natural Language Toolkit (nltk) library
from nltk import pos_tag
from nltk.tokenize import word_tokenize

# Define a string variable 'text' which contains the sentence we want to analyze
text = "NLTK is a leading platform for building Python programs."

# Tokenize the text into individual words using the 'word_tokenize' function
# This splits the sentence into a list of words and punctuation marks
tokens = word_tokenize(text)

# Apply part-of-speech tagging to the list of tokens using the 'pos_tag' function
# This function returns a list of tuples, where each tuple contains a token and its corresponding POS tag
tagged_tokens = pos_tag(tokens)

# Print the list of tuples with tokens and their POS tags to the console
print(tagged_tokens)

Running the code above will produce the following output.

[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'), ('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('.', '.')]

Here's what each POS tag stands for:

NLTK: Proper noun, singular (NNP)
is: Verb, 3rd person singular present (VBZ)
a: Determiner (DT)
leading: Verb, gerund or present participle (VBG)
platform: Noun, singular or mass (NN)
for: Preposition or subordinating conjunction (IN)
building: Verb, gerund or present participle (VBG)
Python: Proper noun, singular (NNP)
programs: Noun, plural (NNS)
.: Sentence-final punctuation (.)

Named Entity Recognition (NER)

NER is the task of identifying and classifying proper nouns in text into predefined categories such as names of people, organizations, locations, etc. This is essential for extracting structured information from unstructured text data. For example, in a news article, identifying mentions of companies, dates, and places can help in understanding the context and specifics of the article. It’s vital for information retrieval, question answering systems, and knowledge representation.

# Importing the necessary functions from the nltk library.
from nltk import ne_chunk  # For Named Entity Recognition (NER)
from nltk.tokenize import word_tokenize  # For breaking the text into words
from nltk.tag import pos_tag  # For Part-of-Speech (POS) tagging

# The text on which we want to perform NER.
text = "Google is based in Mountain View, California."

# Tokenizing the text. This will split the text into individual words (tokens).
tokens = word_tokenize(text)
# Example Output: ['Google', 'is', 'based', 'in', 'Mountain', 'View', ',', 'California', '.']

# POS tagging the tokenized text. This will label each token with a part-of-speech tag.
tagged_tokens = pos_tag(tokens)
# Example Output: [('Google', 'NNP'), ('is', 'VBZ'), ('based', 'VBN'), ('in', 'IN'), ('Mountain', 'NNP'), ('View', 'NNP'), (',', ','), ('California', 'NNP'), ('.', '.')]

# Performing Named Entity Recognition (NER) using the tagged tokens.
# This step identifies named entities (like names of companies, locations, etc.) in the text.
named_entities = ne_chunk(tagged_tokens)
# Example Output: Tree('S', [Tree('ORGANIZATION', [('Google', 'NNP')]), ('is', 'VBZ'), ('based', 'VBN'), ('in', 'IN'), Tree('GPE', [('Mountain', 'NNP'), ('View', 'NNP')]), (',', ','), Tree('GPE', [('California', 'NNP')]), ('.', '.')])

# Printing the named entities found in the text.
print(named_entities)

Running the code above will produce the following output.

(S
  (GPE Google/NNP)
  is/VBZ
  based/VBN
  in/IN
  (GPE Mountain/NNP View/NNP)
  ,/,
  (GPE California/NNP)
  ./.)

One-Hot Encoding

One-Hot Encoding is a technique to represent categorical data as binary vectors. In NLP, it is often used to represent words in a way that machines can understand. Each word in the vocabulary is represented as a vector with all zeros except for a single one at the index corresponding to that word. This is important because machine learning models require numerical input, and one-hot encoding provides a straightforward way to convert text into numerical form. However, it’s worth noting that one-hot encoding might not be feasible for very large vocabularies due to memory constraints, and techniques like word embeddings are often used instead.

There are also sophisticated techniques like Word2Vec or GloVe that can be used for word embeddings. Let's dive into a straightforward demonstration of how one-hot encoding can be implemented in Python:

# Define an example text string
text = "This is a simple example of one-hot encoding."

# Split the text into a list of words, using spaces as the delimiter
words = text.split()

# Create a set of unique words from the list of words to form the vocabulary.
# `set` automatically removes any duplicate words.
vocab = set(words)

# Initialize an empty dictionary to hold the one-hot encodings for each word
one_hot_encoding = {}

# Iterate over the vocabulary using enumerate to get both index (i) and word
for i, word in enumerate(vocab):
    # Create a one-hot encoding for the current word.
    # This is a list of 0s and 1s where only the position corresponding to the
    # word's index is 1, and all other positions are 0.
    # [1 if i == j else 0 for j in range(len(vocab))] is a list comprehension
    # that generates the one-hot vector.
    one_hot_encoding[word] = [1 if i == j else 0 for j in range(len(vocab))]

# Print out the one-hot encoding for each word in the vocabulary
# The format `f"{word}: {encoding}"` is an f-string for easier string formatting
for word, encoding in one_hot_encoding.items():
    print(f"{word}: {encoding}")

Running the code above will produce the following output.

example: [1, 0, 0, 0, 0, 0, 0, 0]
simple: [0, 1, 0, 0, 0, 0, 0, 0]
a: [0, 0, 1, 0, 0, 0, 0, 0]
is: [0, 0, 0, 1, 0, 0, 0, 0]
one-hot: [0, 0, 0, 0, 1, 0, 0, 0]
encoding.: [0, 0, 0, 0, 0, 1, 0, 0]
This: [0, 0, 0, 0, 0, 0, 1, 0]
of: [0, 0, 0, 0, 0, 0, 0, 1]

Explanation:

Splitting Text: The text is broken down into individual words.
Creating Vocabulary: Identify and list the unique words from the text.
One-Hot Encoding: Assign a unique one-hot encoding vector to each word in the vocabulary. This vector will have a length equal to the number of words in the vocabulary. It will contain a 1 at the position corresponding to the word and 0s elsewhere.
Displaying Results: Print each word along with its corresponding one-hot encoding vector.

Notes:

Limitations: One-hot encoding results in sparse vectors, which don't capture the semantic relationships between words.
Alternative Methods: Techniques like Word2Vec, GloVe, or pre-trained models like BERT offer dense vector representations that encapsulate the semantic meaning and context of words.

Project Work & Practice

Now that you've learned the topics in NLP using NLTK, let's put it all together into a project. We'll create a basic sentiment analysis tool that classifies text as positive or negative. We will explore how to perform sentiment analysis with the Hugging Face Transformers library using the pipeline function.

Hugging Face's Transformers library offers accessible pre-trained models for numerous natural language processing tasks, such as sentiment analysis. In this guide, we'll utilize the pipeline function to analyze the sentiment of some sample text.

Step 1: Install the Transformers library

To get started, you'll need to install the Transformers library:

# Import necessary components from the transformers library
from transformers import pipeline

Step 2: Data Collection

For simplicity, we'll use a small dataset consisting of positive and negative movie reviews.

Movie reviews dataset. 👍 (Thumbs Up - Positive) | 👎 (Thumbs Down - Negative)

Step 3 : Sentiment Analysis

Let me show you how to utilize the pipeline function to perform sentiment analysis:

from transformers import pipeline

# Initialize the sentiment analysis pipeline
# The 'pipeline' function from the transformers library creates a pipeline for a specific task.
# Here, 'sentiment-analysis' is the task we are interested in.
# The pipeline will automatically download and initialize appropriate pre-trained models.
sentiment_pipeline = pipeline('sentiment-analysis')

# Example text
# Assuming 'reviews_df' is a DataFrame that contains a column named 'Review'
# We convert this column to a list of text reviews.
text = reviews_df['Review'].tolist()

# Perform sentiment analysis
# We use the initialized sentiment_pipeline to analyze the sentiment of each review in the text list.
# The pipeline takes a list of texts and returns a list of sentiment analysis results.
results = sentiment_pipeline(text)

# Print the results
# We iterate over the results to display the sentiment and confidence score for each review.
for i, result in enumerate(results):
    # Print the original text from the text list
    print(f"Text: {text[i]}")

    # Print the sentiment label (e.g., "POSITIVE" or "NEGATIVE") and the confidence score rounded to 2 decimal places
    print(f"Sentiment: {result['label']}, Confidence: {result['score']:.2f}")
    print()  # Print a blank line for better readability

Running the code above will produce the following output.

Text: An absolutely captivating story with incredible performances.
Sentiment: POSITIVE, Confidence: 1.00

Text: Terrible direction and a plot that made no sense.
Sentiment: NEGATIVE, Confidence: 1.00

Text: A visual masterpiece that had me enthralled from start to finish.
Sentiment: POSITIVE, Confidence: 1.00

Text: The acting was subpar and the dialogue was cringe-worthy.
Sentiment: NEGATIVE, Confidence: 1.00

Text: One of the most thought-provoking films I have ever seen.
Sentiment: POSITIVE, Confidence: 1.00

Text: A boring, uninspired movie that felt like a waste of time.
Sentiment: NEGATIVE, Confidence: 1.00

Text: Outstanding cinematography and a memorable soundtrack.
Sentiment: POSITIVE, Confidence: 1.00

Text: A poorly executed film with lackluster performances from the cast.
Sentiment: NEGATIVE, Confidence: 1.00

Text: A deeply emotional experience that resonated with me on many levels.
Sentiment: POSITIVE, Confidence: 1.00

Text: An overhyped movie that did not live up to expectations.
Sentiment: NEGATIVE, Confidence: 1.00

Explanation:

First, we set up the sentiment analysis tool using the pipeline function from the transformers library.
We start by initializing the sentiment analysis process with the pipeline function provided by the transformers library.
Next, we create a list of sample sentences that we want to analyze for sentiment.
We then use the pipeline to process these sentences. It evaluates the text and returns the sentiment (either positive 👍 or negative 👎) along with a confidence score.
Finally, we print out the results, which include both the sentiment and the confidence score for each of the sentences.

Conclusion

Diving into Natural Language Processing using NLTK can open many doors in text analytics, AI, and machine learning. Mastering NLTK involves gradual learning — starting with pre-built datasets and evolving into creating personalized models. With increased proficiency, one can develop advanced applications capable of interpreting and generating human language, bridging the gap between technology and the complexities of human communication.

By following this guide and the practical examples, you should now have a strong foundation in NLP using NLTK. The final project work helps you put all the learned concepts into practice, reinforcing your understanding and preparing you for more complex tasks.

🔔 Subscribe to the InfinitePy Newsletter for more resources and a step-by-step approach to learning Python, and stay up to date with the latest trends and practical tips.

InfinitePy Newsletter - Your source for Python learning and inspiration.