Ai & ML Week-11

Artificial Intelligence and Machine Learning Code:20CS51I

NATURAL LANGUAGE PROCESSING

TEXT PROCESSING TASKS

Text data derived from natural language is unstructured and noisy. Text preprocessing
involves transforming text into a clean and consistent format that can then be fed into a
model for further analysis and learning.
 Text preprocessing is an important step for natural language processing (NLP)
tasks.
 It transforms text into a more digestible form so that machine learning algorithms
can perform better, and depending on how well the data has been preprocessed; the
results are seen.
 Text preprocessing improves the performance of an NLP system.
 For tasks such as sentiment analysis, document categorization, document retrieval
based upon user queries, and more, adding a text preprocessing layer provides
more accuracy.
Some of the preprocessing steps are:

 Tokenization
 Removing Stop words
 Spelling correction
 Stemming
 Lemmatization
We will be using NLTK to perform all NLP tasks. If NLTK is not yet installed in your
system, go to Anaconda Command prompt and perform the following steps:
• pip install nltk
Search Educations Page 1

 Then, enter the python shell in your terminal by simply typing python
 Type import nltk
 Type nltk.download(‘all’)
The above installation will take quite some time due to the massive number of tokenizers,
chunkers, other algorithms, and all of the corpora to be downloaded.
Tokenization
Token in a text document refers to each “entity” that is a part of whatever was split up
based on rules. For examples, each word is a token when a sentence is “tokenized” into
words. Each sentence can also be a token, if you tokenized the sentences out of a
paragraph. So basically, tokenizing involves splitting sentences and words from the body
of the text.
NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings. The given document can be tokenized
into sentences or words based on our requirement.
1.Sentence tokenization
The given document is divided or tokenized into sentences.

Syntax:
nltk.tokenize.sent_tokenize(text, language='english')
returns a sentence-tokenized copy of text, using NLTK’s recommended sentence

tokenizer. (currently PunktSentenceTokenizer for the specified language).
Parameters:
Text (str) – text to split into sentences
Language(str) – the model name in the Punkt corpus. By default it will be English.
Example:
The following example reads a text file and tokenizes it into sentences.
• Import the sent_tokenize method from nltk.tokenize package
•Read the text file using open method
Output:
• Divide the text into sentences using the sent_tokenize method
• Print the sentences to see how the method has divided the text into sentences.

Output:
2.Word tokenization
The given document is divided or tokenized into words.
Syntax:
nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)
returns a tokenized copy of text, using NLTK’s recommended word tokenizer (currently
an improved TreebankWordTokenizer along with PunktSentenceTokenizer for the
specified language).
Parameters
text (str) – text to split into words
language (str) – the model name in the Punkt corpus
preserve_line (bool) – A flag to decide whether to sentence tokenize the text or not.
Example:
For the text file used in previous example, let us find the word tokens.

Output:
The tokens generated will contain special characters such as comma, dot, parentheses,
apostrophe etc. We may have to remove these special characters, and stop words to
generate the vocabulary of our text document.
3.Visualizing Frequency Distribution of words

The standard method for visualizing the word frequency distribution is to count how
often each word occurs in a corpus and to sort the word frequency counts by decreasing
magnitude. We need to create a Bag of Words (BoW), which is a statistical language
model used to analyze text and documents based on word count.
Example:
To visualize the frequency distribution, let us consider opinions given by customers about
a hotel.
• Read the “Opinion.txt” text file which contains opinions given by multiple customers
• Creating the FreqDist

Without the NLTK package, creating a frequency distribution plot (histogram) for a BoW
is possible, but will take multiple lines of code to do so. Through the use of the FreqDist
class, we are able to obtain the frequencies of every token in the BoW with one single
line of code:

Output:
• Visualizing Frequency Distribution using matplotlib

As we can see, the frequency distribution contains frequencies of tokens like ‘the’, ’to’,
‘of’ which are considered as stop words. It also contains special characters and tokens
with different case are treated differently. To overcome these issues, we have to perform
text cleanup as follows:
Converting the text to lower case letters.
Removing special character from data
Removing words with numbers from data

Removing Stopwords
 Stop words are a set of commonly used words in any language. For example, in
English, “the”, “is” and “and”, would easily qualify as stop words.
 In NLP and text mining applications, stop words are used to eliminate unimportant
words, allowing applications to focus on the important words instead When we
visualize the frequency distribution after text cleaning, it will be as follows:
 NLTK has an inbuilt list of stopwords. We can also create our own stopword list
and use it based on our requirements. The following code removes the stopwords
from the text.
When we draw the Histogram after the text cleaning, it would look like the below figure.

4.Visualize with word cloud

A word cloud is a collection, or cluster, of words depicted in different sizes. The bigger
and bolder the word appears, the more often it’s mentioned within a given text and the
more important it is.
To draw a word cloud, install it through pip command as follows:

pip install wordcloud
Example:
Output:

We can set a custom image as a mask to display the word cloud as follows:


SPELL CORRECTION
 Spelling correction is the process of correcting word’s spelling. For example “lisr”
instead of “list”.
 Spelling correction is important for many NLP applications like web search
engines, text summarization, sentiment analysis etc.
 Most approaches use parallel data of noisy and correct word mappings from
different sources as training data for automatic spelling correction.
 Here we are going to use Levenshtein distance or Edit Distance method for spelling
correction.
 This method takes a list of misspelled words and gives the suggestion of the correct
word for each incorrect word.
 It tries to find a word in the list of correct spellings that has the shortest distance
and the same initial letter as the misspelled word. It then returns the word which
matches the given criteria.
Edit Distance
 Edit Distance or Levenshtein distance between two words is the minimum number
of single-character edits (insertions, deletions or substitutions) required to change
one word into the other.
 Edit Distance measures dissimilarity between two strings by finding the minimum
number of operations needed to transform one string into the other. The
transformations that can be performed are:
Inserting a new character: bat -> bats (insertion of 's')

Deleting an existing character. care -> car (deletion of 'e')
Substituting an existing character. bin -> bit (substitution of n with t)
Transposition of two existing consecutive characters. sing -> sign (transposition of ng to
gn)
Implementation using NLTK

Step 1: First of all, we install and import the nltk suite.

Step 2: Now, we download the ‘words’ resource (which contains correct spellings of
words) from the nltk downloader and import it through nltk.corpus and assign it to
correct_words.
Step 3: We define the list of incorrect_words for which we need the correct spellings.
Then we run a loop for each word in the incorrect words list in which we calculate the
Edit distance of the incorrect word with each correct spelling word having the same
initial letter. We then sort them in ascending order so the shortest distance is on top and
extract the word corresponding to it and print it.
NORMALIZATION
 Base form of a word is defined as Morpheme. A token is basically made up of two

components one is morpheme and the other is inflectional form like prefix or
suffix.
 For example, consider the word Antinationalist (Anti + national+ ist ) which is
made up of Anti and ist as inflectional forms and national as the morpheme.
 Normalization is the process of converting a token into its base form. In the
normalization process, the inflectional form of a word is removed so that the base
form can be obtained. In the above example, the normal form of antinationalist is
national.

Why do we need text normalization?

 Normalization is helpful in reducing the number of unique tokens present in the
text, removing the variations in a text and also cleaning the text by removing
redundant information.
 When we normalize text, we attempt to reduce its randomness, bringing it closer to
a predefined “standard”.
 This helps us to reduce the amount of different information that the computer has
to deal with, and therefore improves efficiency.
Two popular methods used for normalization are

 Stemming
 Lemmatization
Stemming
 Stemming is the process of reducing the words to their root form. It is a rule-based
process for removing inflationary forms from a given token. The output of this
process is called the stem.
 For example, “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.
Another example of stemming can be "likes", "liked", "likely", "liking" are reduced
to root word "like".
 The objective of stemming is to reduce related words to the same stem even if the
stem is not a dictionary word.
 Stemming is not a good process for normalization. since sometimes it can produce
non-meaningful words which are not present in the dictionary.
Implementation using Porter Stemmer

The most common algorithm for stemming English is Porter’s algorithm. There are other
stemmers also available such as Snowball Stemmer.
Step 1: First of all, we read the text from a file and perform text cleaning such as
converting to lower case, removing punctuations and numeric data from the text.

Step 2: The text is then split into tokens or words and each word is fed to the Porter
Stemmer to get the stem word.

As we can see, some words are converted to a correct root word, such as ‘refers’ is
reduced to ‘refer’, ‘processing’ is reduced to ‘process’.
But most of the words are stemmed to a word which is not present in the dictionary and
does not carry any meaning.
Stemming a word or sentence may result in words that are not actual words.
Lemmatization
 Lemmatization is a systematic process of removing the inflectional form of a token
and transform it into a base word. It makes use of word structure, vocabulary, part
of speech tags, and grammar relations.
 Unlike stemming, lemmatization reduces words to their base word, reducing the
inflected words properly and ensuring that the root word belongs to the language.
 It’s usually more sophisticated than stemming,
 since stemmers works on an individual word without knowledge of the context. In
lemmatization, a root word is called lemma.
 For example, “am”, “are”, “is” will be converted to “be”. Similarly, ‘running’,
‘runs’, ‘ran’ will be replaced by ‘run’.
 Just like for stemming, there are different lemmatizers. Here we are going to use
WordNet lemmatizer.

 It’s possible to improve performance over lemmatization even further if you

provide the context in which you want to lemmatize, which you can do through
parts-of-speech (POS) tagging.
 POS tagging is the task of assigning each word in a sentence the part of speech that
it assumes in that sentence. The primary target of POS tagging is to identify the
grammatical group of a given word: whether it is a noun, pronoun, adjective, verb,
adverbs, etc. based on the context.

PARTSOFSPEECHTAGGING
POS tagging is the task of assigning each word in a sentence the part of speech that it
assumes in that sentence.TheprimarytargetofPOStagging isto
identifythegrammaticalgroupofgivenword:whether it isa noun, pronoun, adjective, verb,
adverbs, etc. based on the context.
Mostusedtagswithexamples.
 Noun-Daniel,London,table,dog,teacher,pen,city,happiness, hope
 Verb-go,speak,run,eat,play,live,walk,have,like,are, is
 Adjective-big,happy,green,young,fun,crazy, three
 Adverb-slowly,quietly,very,always,never,too,well, tomorrow
 Preposition-at,on,in,from,with,near,between,about,under
 Conjunction-and,or,but,because, so, yet, unless, since,if
 Pronoun-I,you,we,they,he,she,it,me,us,them,him,her,this
 Interjection-Ouch! Wow!Great!Help!Oh!Hey!Hi!
POStaggingisasupervisedlearningsolutionthatusesfeatureslikethepreviousword,nextword,i
sfirstletter capitalized etc. NLTK has a function to get pos tags and it works after
tokenization process

ImplementationusingNLTK:
Thefollowingtableprovidesthedescriptionofvarioustagsproducedusingpos_tag()methodof
NLTK.
Tag Description Tag Description

CC Coordinatingconjunction PRP$ Possessivepronoun
CD Cardinalnumber RB Adverb
DT Determiner RBR Adverb,comparative
EX Existentialthere RBS Adverb,superlative
FW Foreignword RP Particle
Prepositionorsubordinating
IN SYM Symbol
conjunction
JJ Adjective TO to
JJR Adjective,comparative UH Interjection
JJS Adjective,superlative VB Verb,baseform
LS Listitemmarker VBD Verb,pasttense
MD Modal VBG Verb,gerundorpresentparticiple
NN Noun,singularor mass VBN Verb,pastparticiple
NNS Noun,plural VBP Verb,non-3rdpersonsingularpresent
NNP Propernoun,singular VBZ Verb,3rdpersonsingularpresent
NNPS Propernoun,plural WDT Wh-determiner
PDT Predeterminer WP Wh-pronoun
POS Possessiveending WP$ Possessivewh-pronoun
PRP Personalpronoun WRB Wh-adverb

NAMEDENTITYRECOGNITION
Namedentityrecognition(NER)isoneofthemostpopulardatapreprocessingtasks.Itinvolvesth
e identification of key information in the text and classification into a set of predefined
categories.
NER is usedin many fields in Natural Language Processing (NLP), and itcan help
answeringmany real-world questions, such as:
 Whichcompanieswerementionedinthenewsarticle?
 Werespecifiedproductsmentionedincomplaintsorreviews?
 Doesthe tweetcontainthename
ofaperson?Doesthetweetcontainthisperson’slocation?
Someofthecategoriesthatarethemost importantarchitectureinNERsuchthat:
 Person
 Organization
 Place/location
Othercommontasks includeclassifyingofthe following:

 date/time.
 expression
 Numeralmeasurement (money,percent,weight,etc)
 E-mailaddress
ImplementationusingNLTK
Step1:Import thenltkpackage,anddownloadallthenecessarymodules.

Step2:Inthisstep,
 First, wereturna sentence-tokenised copyofthe text using
nltk.sent_tokenize(sentence) and iterateover it.
 Next, we tokenise the sentence and find parts of speech of each word; we’ll run
nltk.pos_tag(nltk.word_tokenize(sent)) individually to see the outputs:
 TokenisingWordsusingnltk:Tokenisesthesentences intothelistofwords
 POS tagging using nltk: Identifies parts of speech of each word and returns an
array of tuples with the words and their parts of speech.
 Chunking on POS: Lastly, we performa chunking operation that returns a nested
nltk.tree.
 Tree object so that we can iterateor traverse the Tree object to get to the named
entities.
 Finally,weget totheoutput ifthereisanyentitylabelinthechunk:


VECTORIZ ER
Vectorization or word embedding is jargon for a classic approach of converting input data
from its raw format (i.e. text ) into vectors of real numbers which is the format that
Machine Learning models support.
After text cleaning and normalization, the processed text is converted feature vectors so
that we can feed it to machine learning applications.
In Machine Learning, vectorization is a step, in feature extraction. The idea is to get some
distinct features out of the text for the model to train on, by converting text to numerical
vectors.
Why do we need Word Embeddings?

 Many Machine Learning algorithms and almost all Deep Learning Architectures
are not capable of processing strings or plain text in their raw form.
 They require numerical numbers as inputs to perform any sort of task, such as
classification, regression, clustering, etc.
 Also, from the huge amount of data that is present in the text format, it is
imperative to extract some knowledge out of it and build any useful applications.
 In short, to build any model in machine learning or deep learning, the final level
data has to be in numerical form because models don’t understand text or image
data directly as humans do.
Terminologies
 Document: A document is a single text data point. For Example, a review of a
particular product by the user.
 Corpus: It a collection of all the documents present in our dataset.
 Feature: Every unique word in the corpus is considered as a feature.
For Example, Let’s consider the 2 documents shown below: D1: Dog hates a cat. It loves
to go out and play.
D2: Cat loves to play with a ball.

We can build a corpus from the above 2 documents just by combining them.
Corpus = “Dog hates a cat. It loves to go out and play. Cat loves to play with a ball.”
And features will be all unique words:

Features: [‘and’, ‘ball’, ‘cat’, ‘dog’, ‘go’, ‘hates’, ‘it’, ‘loves’, ‘out’, ‘play’, ‘to’, ‘with’]

Types of Vectorization/Word Embeddings
1. One-Hot Encoding (OHE):

 In this technique, we represent each unique word in vocabulary by setting a unique
token with value 1 and rest 0 at other positions in the vector.
 In simple words, a vector representation of a one-hot encoded vector represents in
the form of 1, and 0 where 1 stands for the position where the word exists and 0
everywhere else.
Let’s consider the following sentence:
Sentence: I am teaching NLP in Python

A word in this sentence may be “NLP”, “Python”, “teaching”, etc.
Since a dictionary is defined as the list of all unique words present in the sentence. So, a
dictionary may look like
Dictionary: [‘I’, ’am’, ’teaching’,’ NLP’,’ in’, ’Python’]

Therefore, the vector representation in this format according to the above dictionary is
Vector for NLP:

[0,0,0,1,0,0] Vector for
Python: [0,0,0,0,0,1]
This is just a very simple method to represent a word in vector form.
Disadvantages of One-hot Encoding

 One of the disadvantages of One-hot encoding is that the Size of the vector is equal
to the count of unique words in the vocabulary.
 One-hot encoding does not capture the relationships between different words.
Therefore, it does not convey information about the context.
Count Vectorizer
 It is one of the simplest ways of doing text vectorization.
 It creates a document term matrix, which is a set of dummy variables that indicates
if a particular word appears in the document.
 Count vectorizer will fit and learn the word vocabulary and try to create a
document term matrix in which the individual cells denote the frequency of that
word in a particular document, which is also known as term frequency, and the
columns are dedicated to each word in the corpus.

Matrix Formulation
 Consider a Corpus C containing D documents {d1,d2…..dD} from which we
extract N unique tokens.
 Now, the dictionary consists of these N tokens, and the size of the Count Vector
matrix M formed is given by D X N.
 Each row in the matrix M describes the frequency of tokens present in the
document Di.
Let’s consider the following example:

Document-1: He is a smart boy. She is also smart.
Document-2: Chirag is a smart person.
The dictionary created contains the list of unique tokens(words) present in the corpus
Unique Words: [‘He’, ’She’, ’smart’, ’boy’, ’Chirag’, ’person’] Here, D=2, N=6
So, the count matrix M of size 2 X 6 will be represented as –
H Sh smar bo Chira perso

e e t y g n
D 1 1 2 1 0 0
1
D 0 0 1 0 1 1
2
Now, a column can also be understood as a word vector for the corresponding word in the
matrix M. For Example, for the above matrix formed, let’s see the word vectors
generated.
Vector for ‘smart’ is [2,1],
Vector for ‘Chirag’ is [0, 1], and so on.
Implementation:

N-grams
 Similar to the count vectorization technique, in the N-Gram method, a document
term matrix is generated and each cell represents the count.
 The difference in the N-grams method is that the count represents the combination
of adjacent words of length n in the title. Count vectorization is N-Gram where
n=1.
 For example, “I love this article” has four words and n=4.
 if n=2, i.e bigram, then the columns would be — [“I love”, “love this”, ‘this
article”]
 if n=3, i.e trigram, then the columns would be — [“I love this”, ”love this article”]
 if n=4,i.e four-gram, then the column would be -[‘“I love this article”] The n value
is chosen based on performance.

IF-TDF VECTORIZATION
 All the methods discussed in the previous sessions were based on the Bag of
Words model which is simple and works well.
 But the problem with that is that it treats all words equally. As a result, it cannot
distinguish very common words or rare words.
 So, to solve this problem, TF-IDF comes into the picture. TF-IDF is made up of
two terms: Term Frequency and Inverse Document Frequency
Term Frequency
 Term frequency denotes the frequency of a word in a document. For a specified
word, it is defined as the ratio of the number of times a word appears in a
document to the total number of words in the document.
 Or, it is also defined in the following manner:
 It is the percentage of the number of times a word (x) occurs in a particular
document (y) divided by the total number of words in that document.
For Example, Consider the following sentence ‘Cat loves to play with ball’
For the above sentence, the term frequency value for word cat will be:
tf(‘cat’) = 1/6
This number will always stay ≤ 1,
thus we now judge how frequent a word is in the context of all of the words in a
document.
Inverse document frequency

Inverse document frequency looks at how common (or uncommon) a word is amongst the
corpus. It measures the importance of the word in the corpus.
Before we go into IDF, we must make sense of DF – Document Frequency. It’s given by
the following formula:

DF tells us about the proportion of documents that contain a certain word. IDF is the
reciprocal of the Document Frequency.
The intuition behind using IDF is that the more common a word is across all documents,
the lesser its importance is for the current document. A logarithm is taken to dampen the
effect of (normalize) IDF in the final calculation. The final TF-IDF score comes out to
be:
𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 ∗ 𝐼𝐷𝐹
Implementation

NLP PIPELINE
The set of ordered stages one should go through from a labeled dataset to creating a
classifier that can be applied to new samples is called the NLP pipeline. NLP Pipeline is a
set of steps followed to build an end to end NLP software.

Steps involved in building NLP Pipeline
1.Data Acquisition
In the data acquisition step, we collect the data required for building our NLP software.
We can collect the data using any of the following methods:
 We can conduct to survey to collect data and then manually give a label to the data
 Public Dataset – If a public dataset is available for our problem statement.
 Web Scrapping – Scrapping data using beautiful soup or other libraries
2.Text Preprocessing
 Once the data collection step is done, we cannot use this data as is for model
building. We have to do text preprocessing.
 It helps to remove unhelpful parts of the data, or noise, by converting all characters
to lowercase, removing stop words, punctuation marks, and typos in the data.
 After doing data preprocessing accuracy of the model get increases.
Steps involved in Text Preprocessing
1.Text Cleaning -In-text cleaning we do HTML tag removing, removing punctuations,

Spelling checker, etc.
2.Basic Preprocessing -In basic preprocessing we do tokenization (word or sent

tokenization), stop word removal, removing digits, lower casing etc.
3.Advance Preprocessing- In this step we do POS tagging, Named entity recognition etc.
3.Feature Engineering
 After text cleaning and normalization, the processed text is converted to feature
vectors so that we can feed it to machine learning applications.
 Feature Engineering means converting text data to numerical data.
 But why it is required to convert text data to numerical data? Because many
Machine Learning algorithms and almost all Deep Learning Architectures are not
capable of processing strings or plain text in their raw form.
 This step is also called Feature extraction from text.

In this step, we use multiple techniques to convert text to numerical vectors.

 One Hot Encoder
 Bag Of Word(BOW)
 n-grams
 Tf-Idf
 Word2vec
4.Modelling / Model Building
In the modeling step, we try to create a model based on the cleaned data. Here also we
can use multiple approaches to build the model based on the problem statement.
Approaches to building model –
 Heuristic Approach
 Machine Learning Approach
 Deep Learning Approach
5.Model Evaluation
In the model evaluation, we can use different metrics for evaluation such as Accuracy,
Recall, Confusion Metrics, Perplexity, etc.
6.Deployment
In the deployment step, we have to deploy our model on the cloud for the users.
Deployment has three stages deployment, monitoring, and retraining or model update.
Three stages of deployment
 Deployment – model deploying on the cloud for users.

 Monitoring – In the monitoring phase, we have to watch the model continuously.
Here we have to create a dashboard to show evaluation metrics.
 Update- Retrain the model on new data and again deploy.


Ai & ML Week-11

Uploaded by

Copyright:

Available Formats

Ai & ML Week-11

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ai & ML Week-11

Uploaded by

Copyright:

Available Formats

Artificial Intelligence and Machine Learning Code:20CS51I

NATURAL LANGUAGE PROCESSING

TEXT PROCESSING TASKS

Some of the preprocessing steps are:

• pip install nltk

Search Educations Page 1

NLTK Tokenizer Package

The given document is divided or tokenized into sentences.

returns a sentence-tokenized copy of text, using NLTK’s recommended sentence

• Import the sent_tokenize method from nltk.tokenize package

•Read the text file using open method

• Divide the text into sentences using the sent_tokenize method

Search Educations Page 3

Search Educations Page 4

3.Visualizing Frequency Distribution of words

• Creating the FreqDist

Search Educations Page 5

• Visualizing Frequency Distribution using matplotlib

Search Educations Page 6

Converting the text to lower case letters.

Removing special character from data

Removing words with numbers from data

Search Educations Page 7

Search Educations Page 8

4.Visualize with word cloud

To draw a word cloud, install it through pip command as follows:

Search Educations Page 9

Search Educations Page 10

Search Educations Page 11

Inserting a new character: bat -> bats (insertion of 's')

Implementation using NLTK

Search Educations Page 12

 Base form of a word is defined as Morpheme. A token is basically made up of two

Search Educations Page 13

Why do we need text normalization?

Two popular methods used for normalization are

Implementation using Porter Stemmer

Search Educations Page 14

Search Educations Page 15

Search Educations Page 16

 It’s possible to improve performance over lemmatization even further if you

Search Educations Page 17

Search Educations Page 18

Tag Description Tag Description

Search Educations Page 19

Othercommontasks includeclassifyingofthe following:

Search Educations Page 20

Search Educations Page 21

Search Educations Page 22

Why do we need Word Embeddings?

D2: Cat loves to play with a ball.

And features will be all unique words:

Search Educations Page 23

Types of Vectorization/Word Embeddings

1. One-Hot Encoding (OHE):

Let’s consider the following sentence:

Sentence: I am teaching NLP in Python

Dictionary: [‘I’, ’am’, ’teaching’,’ NLP’,’ in’, ’Python’]

Vector for NLP: