Ai & ML Week-11
Ai & ML Week-11
Ai & ML Week-11
Text data derived from natural language is unstructured and noisy. Text preprocessing
involves transforming text into a clean and consistent format that can then be fed into a
model for further analysis and learning.
Text preprocessing is an important step for natural language processing (NLP)
tasks.
It transforms text into a more digestible form so that machine learning algorithms
can perform better, and depending on how well the data has been preprocessed; the
results are seen.
Text preprocessing improves the performance of an NLP system.
For tasks such as sentiment analysis, document categorization, document retrieval
based upon user queries, and more, adding a text preprocessing layer provides
more accuracy.
Then, enter the python shell in your terminal by simply typing python
Type import nltk
Type nltk.download(‘all’)
The above installation will take quite some time due to the massive number of tokenizers,
chunkers, other algorithms, and all of the corpora to be downloaded.
Tokenization
Token in a text document refers to each “entity” that is a part of whatever was split up
based on rules. For examples, each word is a token when a sentence is “tokenized” into
words. Each sentence can also be a token, if you tokenized the sentences out of a
paragraph. So basically, tokenizing involves splitting sentences and words from the body
of the text.
1.Sentence tokenization
Parameters:
Text (str) – text to split into sentences
Language(str) – the model name in the Punkt corpus. By default it will be English.
Example:
The following example reads a text file and tokenizes it into sentences.
Output:
• Print the sentences to see how the method has divided the text into sentences.
Output:
2.Word tokenization
The given document is divided or tokenized into words.
Syntax:
nltk.tokenize.word_tokenize(text, language='english', preserve_line=False)
returns a tokenized copy of text, using NLTK’s recommended word tokenizer (currently
an improved TreebankWordTokenizer along with PunktSentenceTokenizer for the
specified language).
Parameters
text (str) – text to split into words
language (str) – the model name in the Punkt corpus
preserve_line (bool) – A flag to decide whether to sentence tokenize the text or not.
Example:
For the text file used in previous example, let us find the word tokens.
Output:
The tokens generated will contain special characters such as comma, dot, parentheses,
apostrophe etc. We may have to remove these special characters, and stop words to
generate the vocabulary of our text document.
Example:
To visualize the frequency distribution, let us consider opinions given by customers about
a hotel.
• Read the “Opinion.txt” text file which contains opinions given by multiple customers
Output:
As we can see, the frequency distribution contains frequencies of tokens like ‘the’, ’to’,
‘of’ which are considered as stop words. It also contains special characters and tokens
with different case are treated differently. To overcome these issues, we have to perform
text cleanup as follows:
Removing Stopwords
Stop words are a set of commonly used words in any language. For example, in
English, “the”, “is” and “and”, would easily qualify as stop words.
In NLP and text mining applications, stop words are used to eliminate unimportant
words, allowing applications to focus on the important words instead When we
visualize the frequency distribution after text cleaning, it will be as follows:
NLTK has an inbuilt list of stopwords. We can also create our own stopword list
and use it based on our requirements. The following code removes the stopwords
from the text.
When we draw the Histogram after the text cleaning, it would look like the below figure.
Example:
Output:
We can set a custom image as a mask to display the word cloud as follows:
SPELL CORRECTION
Spelling correction is the process of correcting word’s spelling. For example “lisr”
instead of “list”.
Spelling correction is important for many NLP applications like web search
engines, text summarization, sentiment analysis etc.
Most approaches use parallel data of noisy and correct word mappings from
different sources as training data for automatic spelling correction.
Here we are going to use Levenshtein distance or Edit Distance method for spelling
correction.
This method takes a list of misspelled words and gives the suggestion of the correct
word for each incorrect word.
It tries to find a word in the list of correct spellings that has the shortest distance
and the same initial letter as the misspelled word. It then returns the word which
matches the given criteria.
Edit Distance
Edit Distance or Levenshtein distance between two words is the minimum number
of single-character edits (insertions, deletions or substitutions) required to change
one word into the other.
Edit Distance measures dissimilarity between two strings by finding the minimum
number of operations needed to transform one string into the other. The
transformations that can be performed are:
Step 2: Now, we download the ‘words’ resource (which contains correct spellings of
words) from the nltk downloader and import it through nltk.corpus and assign it to
correct_words.
Step 3: We define the list of incorrect_words for which we need the correct spellings.
Then we run a loop for each word in the incorrect words list in which we calculate the
Edit distance of the incorrect word with each correct spelling word having the same
initial letter. We then sort them in ascending order so the shortest distance is on top and
extract the word corresponding to it and print it.
NORMALIZATION
Stemming
Stemming is the process of reducing the words to their root form. It is a rule-based
process for removing inflationary forms from a given token. The output of this
process is called the stem.
For example, “retrieval”, “retrieved”, “retrieves” reduce to the stem “retrieve”.
Another example of stemming can be "likes", "liked", "likely", "liking" are reduced
to root word "like".
The objective of stemming is to reduce related words to the same stem even if the
stem is not a dictionary word.
Stemming is not a good process for normalization. since sometimes it can produce
non-meaningful words which are not present in the dictionary.
Step 1: First of all, we read the text from a file and perform text cleaning such as
converting to lower case, removing punctuations and numeric data from the text.
Step 2: The text is then split into tokens or words and each word is fed to the Porter
Stemmer to get the stem word.
As we can see, some words are converted to a correct root word, such as ‘refers’ is
reduced to ‘refer’, ‘processing’ is reduced to ‘process’.
But most of the words are stemmed to a word which is not present in the dictionary and
does not carry any meaning.
Stemming a word or sentence may result in words that are not actual words.
Lemmatization
Lemmatization is a systematic process of removing the inflectional form of a token
and transform it into a base word. It makes use of word structure, vocabulary, part
of speech tags, and grammar relations.
Unlike stemming, lemmatization reduces words to their base word, reducing the
inflected words properly and ensuring that the root word belongs to the language.
It’s usually more sophisticated than stemming,
since stemmers works on an individual word without knowledge of the context. In
lemmatization, a root word is called lemma.
For example, “am”, “are”, “is” will be converted to “be”. Similarly, ‘running’,
‘runs’, ‘ran’ will be replaced by ‘run’.
Just like for stemming, there are different lemmatizers. Here we are going to use
WordNet lemmatizer.
PARTSOFSPEECHTAGGING
POS tagging is the task of assigning each word in a sentence the part of speech that it
assumes in that sentence.TheprimarytargetofPOStagging isto
identifythegrammaticalgroupofgivenword:whether it isa noun, pronoun, adjective, verb,
adverbs, etc. based on the context.
Mostusedtagswithexamples.
Noun-Daniel,London,table,dog,teacher,pen,city,happiness, hope
Verb-go,speak,run,eat,play,live,walk,have,like,are, is
Adjective-big,happy,green,young,fun,crazy, three
Adverb-slowly,quietly,very,always,never,too,well, tomorrow
Preposition-at,on,in,from,with,near,between,about,under
Conjunction-and,or,but,because, so, yet, unless, since,if
Pronoun-I,you,we,they,he,she,it,me,us,them,him,her,this
Interjection-Ouch! Wow!Great!Help!Oh!Hey!Hi!
POStaggingisasupervisedlearningsolutionthatusesfeatureslikethepreviousword,nextword,i
sfirstletter capitalized etc. NLTK has a function to get pos tags and it works after
tokenization process
ImplementationusingNLTK:
Thefollowingtableprovidesthedescriptionofvarioustagsproducedusingpos_tag()methodof
NLTK.
NAMEDENTITYRECOGNITION
Namedentityrecognition(NER)isoneofthemostpopulardatapreprocessingtasks.Itinvolvesth
e identification of key information in the text and classification into a set of predefined
categories.
NER is usedin many fields in Natural Language Processing (NLP), and itcan help
answeringmany real-world questions, such as:
Whichcompanieswerementionedinthenewsarticle?
Werespecifiedproductsmentionedincomplaintsorreviews?
Doesthe tweetcontainthename
ofaperson?Doesthetweetcontainthisperson’slocation?
Someofthecategoriesthatarethemost importantarchitectureinNERsuchthat:
Person
Organization
Place/location
ImplementationusingNLTK
Step1:Import thenltkpackage,anddownloadallthenecessarymodules.
Step2:Inthisstep,
First, wereturna sentence-tokenised copyofthe text using
nltk.sent_tokenize(sentence) and iterateover it.
Next, we tokenise the sentence and find parts of speech of each word; we’ll run
nltk.pos_tag(nltk.word_tokenize(sent)) individually to see the outputs:
TokenisingWordsusingnltk:Tokenisesthesentences intothelistofwords
POS tagging using nltk: Identifies parts of speech of each word and returns an
array of tuples with the words and their parts of speech.
Chunking on POS: Lastly, we performa chunking operation that returns a nested
nltk.tree.
Tree object so that we can iterateor traverse the Tree object to get to the named
entities.
Finally,weget totheoutput ifthereisanyentitylabelinthechunk:
VECTORIZ ER
Vectorization or word embedding is jargon for a classic approach of converting input data
from its raw format (i.e. text ) into vectors of real numbers which is the format that
Machine Learning models support.
After text cleaning and normalization, the processed text is converted feature vectors so
that we can feed it to machine learning applications.
In Machine Learning, vectorization is a step, in feature extraction. The idea is to get some
distinct features out of the text for the model to train on, by converting text to numerical
vectors.
Terminologies
Document: A document is a single text data point. For Example, a review of a
particular product by the user.
Corpus: It a collection of all the documents present in our dataset.
Feature: Every unique word in the corpus is considered as a feature.
For Example, Let’s consider the 2 documents shown below: D1: Dog hates a cat. It loves
to go out and play.
Corpus = “Dog hates a cat. It loves to go out and play. Cat loves to play with a ball.”
Count Vectorizer
It is one of the simplest ways of doing text vectorization.
It creates a document term matrix, which is a set of dummy variables that indicates
if a particular word appears in the document.
Count vectorizer will fit and learn the word vocabulary and try to create a
document term matrix in which the individual cells denote the frequency of that
word in a particular document, which is also known as term frequency, and the
columns are dedicated to each word in the corpus.
Matrix Formulation
Consider a Corpus C containing D documents {d1,d2…..dD} from which we
extract N unique tokens.
Now, the dictionary consists of these N tokens, and the size of the Count Vector
matrix M formed is given by D X N.
Each row in the matrix M describes the frequency of tokens present in the
document Di.
Now, a column can also be understood as a word vector for the corresponding word in the
matrix M. For Example, for the above matrix formed, let’s see the word vectors
generated.
Vector for ‘smart’ is [2,1],
Vector for ‘Chirag’ is [0, 1], and so on.
Implementation:
N-grams
Similar to the count vectorization technique, in the N-Gram method, a document
term matrix is generated and each cell represents the count.
The difference in the N-grams method is that the count represents the combination
of adjacent words of length n in the title. Count vectorization is N-Gram where
n=1.
For example, “I love this article” has four words and n=4.
if n=2, i.e bigram, then the columns would be — [“I love”, “love this”, ‘this
article”]
if n=3, i.e trigram, then the columns would be — [“I love this”, ”love this article”]
if n=4,i.e four-gram, then the column would be -[‘“I love this article”] The n value
is chosen based on performance.
IF-TDF VECTORIZATION
All the methods discussed in the previous sessions were based on the Bag of
Words model which is simple and works well.
But the problem with that is that it treats all words equally. As a result, it cannot
distinguish very common words or rare words.
So, to solve this problem, TF-IDF comes into the picture. TF-IDF is made up of
two terms: Term Frequency and Inverse Document Frequency
Term Frequency
Term frequency denotes the frequency of a word in a document. For a specified
word, it is defined as the ratio of the number of times a word appears in a
document to the total number of words in the document.
Or, it is also defined in the following manner:
It is the percentage of the number of times a word (x) occurs in a particular
document (y) divided by the total number of words in that document.
For Example, Consider the following sentence ‘Cat loves to play with ball’
For the above sentence, the term frequency value for word cat will be:
tf(‘cat’) = 1/6
This number will always stay ≤ 1,
thus we now judge how frequent a word is in the context of all of the words in a
document.
DF tells us about the proportion of documents that contain a certain word. IDF is the
reciprocal of the Document Frequency.
The intuition behind using IDF is that the more common a word is across all documents,
the lesser its importance is for the current document. A logarithm is taken to dampen the
effect of (normalize) IDF in the final calculation. The final TF-IDF score comes out to
be:
𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 ∗ 𝐼𝐷𝐹
Implementation
NLP PIPELINE
The set of ordered stages one should go through from a labeled dataset to creating a
classifier that can be applied to new samples is called the NLP pipeline. NLP Pipeline is a
set of steps followed to build an end to end NLP software.
1.Data Acquisition
In the data acquisition step, we collect the data required for building our NLP software.
We can collect the data using any of the following methods:
We can conduct to survey to collect data and then manually give a label to the data
Public Dataset – If a public dataset is available for our problem statement.
Web Scrapping – Scrapping data using beautiful soup or other libraries
2.Text Preprocessing
Once the data collection step is done, we cannot use this data as is for model
building. We have to do text preprocessing.
It helps to remove unhelpful parts of the data, or noise, by converting all characters
to lowercase, removing stop words, punctuation marks, and typos in the data.
After doing data preprocessing accuracy of the model get increases.
3.Advance Preprocessing- In this step we do POS tagging, Named entity recognition etc.
3.Feature Engineering
After text cleaning and normalization, the processed text is converted to feature
vectors so that we can feed it to machine learning applications.
Feature Engineering means converting text data to numerical data.
But why it is required to convert text data to numerical data? Because many
Machine Learning algorithms and almost all Deep Learning Architectures are not
capable of processing strings or plain text in their raw form.
This step is also called Feature extraction from text.
In the modeling step, we try to create a model based on the cleaned data. Here also we
can use multiple approaches to build the model based on the problem statement.
Approaches to building model –
Heuristic Approach
Machine Learning Approach
Deep Learning Approach
5.Model Evaluation
In the model evaluation, we can use different metrics for evaluation such as Accuracy,
Recall, Confusion Metrics, Perplexity, etc.
6.Deployment
In the deployment step, we have to deploy our model on the cloud for the users.
Deployment has three stages deployment, monitoring, and retraining or model update.
Three stages of deployment