This is the capstone project repository instructed by NYU Langone. In this project, our main goal is to Use NLP approaches to extract features relating to DIET and PHYSICAL ACTIVITY from real-time Twitter stream, and understand their temporal and spatial variation in the US.
✅ Completed
- Data Preprocessing:
- Convert the raw tweets data from json to csv and extract important features including the length of tweet, emojis, hashtags etc.
- Generating the food list from USDA and physical activity list from harvard.edu
- Label tweets with food/activity by key-word search
- Statistic analysis on the tweets
- the most-used languages on Twitter inside U.S.
- the average length of each tweet for each state
- frequency analysis on hashtag, emoji
- from the result of key-word search, analyze the percentage of tweet mentioning food or activity for each state.
- Baseline Modeling:
- Topic modeling: Tried NMF and LDA model. Tuned different combination of hyperparameters of LDA.
- Random Forest: Used LDA transformation to extract each tweet's topic probability distribution as features. Then constructed the machine learning classifiers.
- Optimization:
- Tried different hyper-parameters for LDA to extract the higher-quality features and feed into classifiers.
- Tuned Random Forest Hyper-parameter to have better classification to identify food tweets and activity tweets.
- Results:
- Subsampling to generate the confidence interval of the percentage of tweet mentioning food or activity for each state.
- Used embedding distances and false positives to evaluate models' performance
- Generated graph to demonstrated the seasonal and state-wise difference of food/activity related tweets.
/notebooks/ : the directory for code demo
- Activity_Classifier_Random_Forest_newest.ipynb: uses random forest to predict whether a tweet contains activity or not
- Activity_List_General.ipynb: uses website scrapping to obtain a list of activity
- Activity_List_Specific.ipynb: a more specific list of activity
- Food_Nutrient_Report.ipynb: includes how to use USDA API to access food nutrient database and define food health scale indicator
- NMF.ipynb: nmf model based on small sample
- NMF_Model_2015Data.ipynb: runs NMF model with 5 million tokenized tweets from 2015 to see its performance 7.RF_Baseline_for_Food.ipynb: uses random forest to predict whether a tweet contains food or not
- Raw_Data_Process_new_version.ipynb: shows how the data is processed from json fomrat to csv format
- USDA_foodlist_Basic.ipynb: some data visualization on food nutrient by different category level
- key_word_match.ipynb: a demo for key word serach(Aho–Corasick algorithm) and some analysis for the result
- statics_analysis.ipynb: shows the statistic analysis in terms of the number and length of tweet, language, emojis, hashtags for different states.
- tokenization + LDA.ipynb: tokenizes the text tata and runs LDA model at different scenario
- Run_Statistic.ipynb: Applied the Kolmogorov-Smirnov test to each LDA model and visulization.
- usa_heatmap.ipynb: generate usa heatmap to show the seasonal and state-wise difference.
- day_of_week_change.ipynb: generate plots to show the day-of-week change of the percentage of tweet mentioning food/activity for each state.
- resample_confidence_interval.ipynb: subsampling to generate teh confidence interval of the percentage of tweet mentioning food/activity for each state.
- glove_distance_calc.ipynb: calculate word embedding distance between tweets and key-word list using Glove.
- Random_Forest_Embedding_Distance.ipynb: evaluate rf's performance using embedding distances.
- key_word_search vs random_forest.ipynb: compare the results of key-word search and random forest model.
/pyhton script/: the code run on HPC
- Run_LDA.py: running and tuning LDA
- filter_en_text.py: the code used to remove all the tweets that are not in english
- key_word_process.py: key word search by Aho-Corasick algorithem to detect tweets contain food or activity
- process_raw_data.py: the code used to convert original twitter data in json format to csv format for future analysis
- token_by_spark.py: using spark to do tokenization, stemming, and lemmatization
- token_by_whole_file.py: read entire csv file (require a lot of memeory) and then do tokenization, stemming, and lemmatization
- token_row_by_row.py: read a csv file row by row, for each row do tokenization, stemming, and lemmatization (require less memeory) 8.key_word_process.py: read a csv file containing tweets and find whether the tweet mentions food or activity key words
- ks_test.py: Kolmogorov-Smirnov test and return empirical p-values for food and activity seperately for each LDA model.
/figures/: some visulization results related to the progress.
- Team Name: Burger King
- Team Members: Qintai Liu, Zhiming Guo, Xiaoxue Ma, Jin Han