Chapter 15 - MINING MEANING FROM TEXT
Chapter 15 - MINING MEANING FROM TEXT
Chapter 15 - MINING MEANING FROM TEXT
from Text
Contents
• Parsing Text For Analysis
• RAKE
• Word Cloud
• The unit of analysis is determined initially in any research, and then bigger sections of text are
processed into individual tokens.
• The standard unit of text analysis is a token, or an atomic unit that is relevant to the current study
• The text is parsed into data frames, which can then be further evaluated with Machine Learning or
statistical methods.
• Generally, a text requires significant preprocessing to bring it to a form that a model can digest.
Example:
• On tweets from the dataset, we show a common procedure for preparing text.
• The tidy text R library is used to handle text, which makes up a bigger amount of the data.
• The text of the tweets is then focused on by picking this column from the wider data collection.
• Non-English terms and emoji are included in the output, but we're only interested in specific
token words here.
• We use the tidy text library's unnest tokens, which breaks down each sentence into its constituent
words.
Contd..
• After that, the anti join function removes Stop words from the data frame.
• We might look at the top 10 most often occurring words in the Tweets by using the count
function.
• The quantiles can be used to calculate the distribution of these popular terms.
• Some words, such as hashtags, should be removed from the data frames to preserve the linguistic
standards of the social media landscape.
• To help you grasp it better, let's plot the relevant words graphically.
• We can now plot the top 20 words that appear in the cleaned tweets
• The Figure is a more clear representation of the common words used by users in tweets
Section II
Computing
Descriptive
Statistics
for Text
Computing Descriptive Statistics for Text
• Some terms are more appropriate for a given context and cannot be used again. Frequencies are
skewed by such words.
• For example, a brand may have posted numerous tweets during a marketing campaign, and the
campaign's statements may have dominated the debate surrounding the company's tweets.
• Term Frequency, Inverse Document Frequency (TF-IDF), and Rapid Automatic Keyword
Extraction ( RAKE) are two approaches for assessing key terms within a corpus .
• These methods are used in exploratory analysis of the text of a corpus of text,
Term Frequency – Inverse Document Frequency
(TF-IDF)
We inverse weight the document occurrence of frequent terms when calculating the TF-IDF
score, such that such terms are given far less weight than they would otherwise.
We must remember that, though frequent terms are significant, they are significantly less relevant
In this approach, we first compile a list of essential terms, then use TF-IDF scores to verify the
relative importance of these terms.
It's best to set a restriction for how many times a term must appear before it's regarded a keyword
when using it in an exploratory sense.
RAKE
The Rapid Automatic Keyword Extraction (RAKE) algorithm is a second approach for extracting
keywords from a corpus of text.
It creates expressions by separating the input text at stop words and phrase delimiters.
The words in these expressions were given a weighted score that penalises frequent appearances
in the corpus.
The sum of the scores of the words that make up the phrase is also provided to the expressions.
Where is My Word Cloud?
To create word cloud, we use the word cloud library.
We suggest that while a word cloud is aesthetically pleasing, it loses the sense of the significant
relevance of the words that make up the narrative by incorporating the frequencies of words into
the size of the words.
• A pre-defined vocabulary of terms is employed in this process to map the sentiment of text
constituent words to attitudes and emotions commonly associated with the word.
• For emotion assignment, the tidy text library offers three built-in lexicons.
Steps Involved
To begin, remove any stop words and punctuation from the tweets.
The cleansed words are then mapped to a lexicon with sentiments pre-assigned to each term.
Finally, the sentiment of the tweet is calculated as the average of the sentiments of the words that
make up the tweet.
Our sentiment classification is based on word lexicons and does not take into account the context
in which the word was used.
Contd..
Conclusion
• The key words within the corpus can be extracted using the TF-IDF scores or the RAKE algorithm.
• We also present a method to assign sentiment and emotion to individual tweets that can then be
further analyzed for patterns in the variation associated with these tweets.
Thank You