TwitterHealth Monitor

This is the capstone project repository instructed by NYU Langone. In this project, our main goal is to Use NLP approaches to extract features relating to DIET and PHYSICAL ACTIVITY from real-time Twitter stream, and understand their temporal and spatial variation in the US.

Project Status Memo

✅ Completed

Data Preprocessing:
- Convert the raw tweets data from json to csv and extract important features including the length of tweet, emojis, hashtags etc.
- Generating the food list from USDA and physical activity list from harvard.edu
- Label tweets with food/activity by key-word search
Statistic analysis on the tweets
- the most-used languages on Twitter inside U.S.
- the average length of each tweet for each state
- frequency analysis on hashtag, emoji
- from the result of key-word search, analyze the percentage of tweet mentioning food or activity for each state.
Baseline Modeling:
- Topic modeling: Tried NMF and LDA model. Tuned different combination of hyperparameters of LDA.
- Random Forest: Used LDA transformation to extract each tweet's topic probability distribution as features. Then constructed the machine learning classifiers.
Optimization:
- Tried different hyper-parameters for LDA to extract the higher-quality features and feed into classifiers.
- Tuned Random Forest Hyper-parameter to have better classification to identify food tweets and activity tweets.
Results:
- Subsampling to generate the confidence interval of the percentage of tweet mentioning food or activity for each state.
- Used embedding distances and false positives to evaluate models' performance
- Generated graph to demonstrated the seasonal and state-wise difference of food/activity related tweets.

Repo Description

/notebooks/ : the directory for code demo

Activity_Classifier_Random_Forest_newest.ipynb: uses random forest to predict whether a tweet contains activity or not
Activity_List_General.ipynb: uses website scrapping to obtain a list of activity
Activity_List_Specific.ipynb: a more specific list of activity
Food_Nutrient_Report.ipynb: includes how to use USDA API to access food nutrient database and define food health scale indicator
NMF.ipynb: nmf model based on small sample
NMF_Model_2015Data.ipynb: runs NMF model with 5 million tokenized tweets from 2015 to see its performance 7.RF_Baseline_for_Food.ipynb: uses random forest to predict whether a tweet contains food or not
Raw_Data_Process_new_version.ipynb: shows how the data is processed from json fomrat to csv format
USDA_foodlist_Basic.ipynb: some data visualization on food nutrient by different category level
key_word_match.ipynb: a demo for key word serach(Aho–Corasick algorithm) and some analysis for the result
statics_analysis.ipynb: shows the statistic analysis in terms of the number and length of tweet, language, emojis, hashtags for different states.
tokenization + LDA.ipynb: tokenizes the text tata and runs LDA model at different scenario
Run_Statistic.ipynb: Applied the Kolmogorov-Smirnov test to each LDA model and visulization.
usa_heatmap.ipynb: generate usa heatmap to show the seasonal and state-wise difference.
day_of_week_change.ipynb: generate plots to show the day-of-week change of the percentage of tweet mentioning food/activity for each state.
resample_confidence_interval.ipynb: subsampling to generate teh confidence interval of the percentage of tweet mentioning food/activity for each state.
glove_distance_calc.ipynb: calculate word embedding distance between tweets and key-word list using Glove.
Random_Forest_Embedding_Distance.ipynb: evaluate rf's performance using embedding distances.
key_word_search vs random_forest.ipynb: compare the results of key-word search and random forest model.

/pyhton script/: the code run on HPC

Run_LDA.py: running and tuning LDA
filter_en_text.py: the code used to remove all the tweets that are not in english
key_word_process.py: key word search by Aho-Corasick algorithem to detect tweets contain food or activity
process_raw_data.py: the code used to convert original twitter data in json format to csv format for future analysis
token_by_spark.py: using spark to do tokenization, stemming, and lemmatization
token_by_whole_file.py: read entire csv file (require a lot of memeory) and then do tokenization, stemming, and lemmatization
token_row_by_row.py: read a csv file row by row, for each row do tokenization, stemming, and lemmatization (require less memeory) 8.key_word_process.py: read a csv file containing tweets and find whether the tweet mentions food or activity key words
ks_test.py: Kolmogorov-Smirnov test and return empirical p-values for food and activity seperately for each LDA model.

/figures/: some visulization results related to the progress.

Team Name： Burger King
Team Members: Qintai Liu, Zhiming Guo, Xiaoxue Ma, Jin Han

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
docs		docs
figures		figures
notebooks		notebooks
python script		python script
src		src
.gitignore		.gitignore
.gitignore.swo		.gitignore.swo
Memo.md		Memo.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TwitterHealth Monitor

Project Status Memo

Repo Description

About

Releases

Packages

Contributors 6

Languages

NYU-CDS-Capstone-Project/Twitter_Health_Monitor

Folders and files

Latest commit

History

Repository files navigation

TwitterHealth Monitor

Project Status Memo

Repo Description

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages