Skip to content

Topic-Modelling Reuters news headlines to find topic clusters for the past 7 years using LDA and t-SNE

Notifications You must be signed in to change notification settings

MiteshPuthran/Reuters_2012-17_News_Headline_Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Reuters 2012-17 News Headline Analysis

  • This analysis consists of analyzing Reuters 6 million news headline data from 2012 to 2017, downloaded from Kaggle. This dataset can be found here.

  • The main idea is TOPIC-MODELLING, to find out how news headlines can be related and classified into different subgroups using LDA (Latent Dirichlet Allocation) and applying t-SNE.

Data Pre-processing

  • The first step was to clean the data and convert it into the usable form. The raw data had 'headlines' and 'publishtime' column which was in the form where the data and time were just nubers combined together. Had to be seperated and converted into date-time format before it was used for further analysis.





Finding most headlines of certain length and finding most used words in headlines other than articles like a, the etc. and creating a word cloud.

  • Headlines were converted into lowercase letter and cleaned using regular expressions.
  • Most used words were found by calculating the frequency of the words that appeared in all the headline since 2012 to 2017.
  • To calculate the length of headlines, each word of every headline were counted for every entry in the dataset. It turned out that most of the headlines published between 2012 and 2017 consisted of 9 words.



Word Cloud of most used word pairs in the published headlines



Classifying the news headlines into different categories using LDA and applying t-SNE

  • Before classifying the headlines, they have to vectorized.
  • As there were 6 million headlines, it made no sense to apply LDA and check whether it was working as expected. So before applying LDA to the whole dataset, 10000 random samples were selected.
  • LDA was applied on the samples to discover 20 different classification topics.



  • After applying LDA, decided to apply t-SNE to find a representation of the input in low dimensional space such that similar points in the original space are also similar in the representation space.
  • Also the dimensions has to be reduced to 2 in order to visualize the classifications and also to overcome hardware limitation and time.
  • Plotting the classifications of topics on the sample data



  • The headlines have been clustered into 20 different topics as follows:

    Topic 0: present ceo
    Topic 1: update fund
    Topic 2: brief reg
    Topic 3: brief announces
    Topic 4: new says
    Topic 5: preview dividend
    Topic 6: rate reg
    Topic 7: research markets
    Topic 8: china shares
    Topic 9: brief profit
    Topic 10: new program
    Topic 11: business technology
    Topic 12: stocks india
    Topic 13: bank national
    Topic 14: launches deal
    Topic 15: quarter results
    Topic 16: fitch outlook
    Topic 17: reg asset
    Topic 18: new high
    Topic 19: update announces

  • After getting the topics from the samples, it was time to process the whole dataset.

  • The process takes lot of time to execute, once it was done the headlines were classified into the following topics:



  • You can see the heatmap of the different clusters that were discovered with LDA and t-SNE application.



About

Topic-Modelling Reuters news headlines to find topic clusters for the past 7 years using LDA and t-SNE

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published