This is the public code repository for the following article:
@inproceedings{joseph_exploring_2016,
title = {Exploring patterns of identity usage in tweets: a new problem, solution and case study},
shorttitle = {Exploring patterns of identity usage in tweets},
url = {http://dl.acm.org/citation.cfm?id=2883027},
urldate = {2016-06-22},
booktitle = {Proceedings of the 25th {International} {Conference} on {World} {Wide} {Web}},
publisher = {International World Wide Web Conferences Steering Committee},
author = {Joseph, Kenneth and Wei, Wei and Carley, Kathleen M.},
year = {2016},
pages = {401--412}
}
The repository contains code and data necessary to reproduce results from the paper - if you use this code, please cite the article!
Also, note that the tools described here are now part of the twitter_dm library, so that may be easier to use.
- Apache Spark & pyspark s(for the case study)
- The
twitter_dm
library, which you can find at https://github.com/kennyjoseph/twitter_dm and then install on your machine
Run install.sh
from this directory to pull down the GloVe and Brown Cluster data and to install a few python libraries. NOTE that it will install all nltk data files and pip install numpy, scikit-learn and pandas
The python
directory contains almost all code written for the model, save for the generation of the Penn Treebank POS tags and the rule-based features. The files are ordered numerically based on the order in which they were run to generate results (e.g. 1_*_.py
was run first, 2_*.py
was run second, etc). The r
directory is an R Project file (for RStudio) that simply has the one R file we used to generate the plots in the article. Finally, the java
folder has a gradle project for how we ran the rule-based stuff to generate the bootstrapped dictionary. We also used this code to generate PTB POS tags and the rule-based model features.
While you're welcome to explore the full code base and re-run everything, we have provided processed data that allows you to simply replicate the results obtained:
10_run_baselines.py
will replicate results from the baseline models11_run_param_tuning.py
will rerun the parameter tuning12_run_final_model_on_pub_data.py
will run the final model on the validation set13_run_model_on_case_study_data.py
will run the model on case study data - this code can be modified to run the model on a directory of gzipped (or uncompressed) .json files where each file contains the tweet stream for a particular user14_case_study_analysis.ipynb
will replicate all results in our case study. Note that you need Spark to do so, and further that it will take a while if this machine isn't big. We ran it on a machine with 64 virtual cores and 512 GB of RAM
The one thing that you will not be able to explore is the raw json for any of the tweets we use, as Twitter's API prohibits us from releasing this data. The directory pickled_random_for_labeling
is thus not included.