Chapter 15 - MINING MEANING FROM TEXT

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 20

Mining Meaning

from Text
Contents
• Parsing Text For Analysis

• Computing Descriptive Statistics For Text

• Term Frequency – Inverse Document Frequency (TF-IDF)

• RAKE

• Word Cloud

• Assigning Sentiment To Text Samples


Section I
Parsing Text for
Analysis
Parsing Text For Analysis

• The unit of analysis is determined initially in any research, and then bigger sections of text are
processed into individual tokens.

• The standard unit of text analysis is a token, or an atomic unit that is relevant to the current study

• The text is parsed into data frames, which can then be further evaluated with Machine Learning or
statistical methods.

• Generally, a text requires significant preprocessing to bring it to a form that a model can digest.
Example:

• On tweets from the dataset, we show a common procedure for preparing text.

• The tidy text R library is used to handle text, which makes up a bigger amount of the data.

• The text of the tweets is then focused on by picking this column from the wider data collection.

• Non-English terms and emoji are included in the output, but we're only interested in specific
token words here.

• We use the tidy text library's unnest tokens, which breaks down each sentence into its constituent
words.
Contd..
• After that, the anti join function removes Stop words from the data frame.

• We might look at the top 10 most often occurring words in the Tweets by using the count
function.

• The quantiles can be used to calculate the distribution of these popular terms.

• Some words, such as hashtags, should be removed from the data frames to preserve the linguistic
standards of the social media landscape.

• The unnest_tokens function is used in such cases.


Graphical Plot

• To help you grasp it better, let's plot the relevant words graphically.

• We can now plot the top 20 words that appear in the cleaned tweets

• The Figure is a more clear representation of the common words used by users in tweets
Section II
Computing
Descriptive
Statistics
for Text
Computing Descriptive Statistics for Text

• Some terms are more appropriate for a given context and cannot be used again. Frequencies are
skewed by such words.

• For example, a brand may have posted numerous tweets during a marketing campaign, and the
campaign's statements may have dominated the debate surrounding the company's tweets.

• Term Frequency, Inverse Document Frequency (TF-IDF), and Rapid Automatic Keyword
Extraction ( RAKE) are two approaches for assessing key terms within a corpus .

• These methods are used in exploratory analysis of the text of a corpus of text,
Term Frequency – Inverse Document Frequency
(TF-IDF)
We inverse weight the document occurrence of frequent terms when calculating the TF-IDF
score, such that such terms are given far less weight than they would otherwise.

We must remember that, though frequent terms are significant, they are significantly less relevant

if they appear in practically every document.


Contd..
We should utilize TF-IDF ratings to evaluate the relative worth of keywords that are crucial to the
task at hand.

In this approach, we first compile a list of essential terms, then use TF-IDF scores to verify the
relative importance of these terms.

It's best to set a restriction for how many times a term must appear before it's regarded a keyword
when using it in an exploratory sense.
RAKE
The Rapid Automatic Keyword Extraction (RAKE) algorithm is a second approach for extracting
keywords from a corpus of text.

It creates expressions by separating the input text at stop words and phrase delimiters.

The words in these expressions were given a weighted score that penalises frequent appearances
in the corpus.

The sum of the scores of the words that make up the phrase is also provided to the expressions.
Where is My Word Cloud?
To create word cloud, we use the word cloud library.

We suggest that while a word cloud is aesthetically pleasing, it loses the sense of the significant
relevance of the words that make up the narrative by incorporating the frequencies of words into
the size of the words.

 We lose estimates on the number of times these terms appear in Precise.


Contd..
In conclusion, while word clouds are likely to make effective first plots, the
TF-IDF and RAKE algorithms give more powerful methods for keyword
extraction.
Section III
Assigning
Sentiment to Text
Samples
Assigning Sentiment To Text Samples
• Determine the sentiment of the text based on the words that make up the text is another common
role in text analysis.

• A pre-defined vocabulary of terms is employed in this process to map the sentiment of text
constituent words to attitudes and emotions commonly associated with the word.

• For emotion assignment, the tidy text library offers three built-in lexicons.
Steps Involved

To begin, remove any stop words and punctuation from the tweets.

The cleansed words are then mapped to a lexicon with sentiments pre-assigned to each term.

Finally, the sentiment of the tweet is calculated as the average of the sentiments of the words that
make up the tweet.

 Our sentiment classification is based on word lexicons and does not take into account the context
in which the word was used.
Contd..
Conclusion

• We have present several methods to extract meaning from a corpus of text.

• The key words within the corpus can be extracted using the TF-IDF scores or the RAKE algorithm.

• We also present a method to assign sentiment and emotion to individual tweets that can then be
further analyzed for patterns in the variation associated with these tweets.
Thank You

You might also like