Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
JingqingZ committed Mar 10, 2019
1 parent 128abcb commit ff56a7b
Show file tree
Hide file tree
Showing 3 changed files with 50 additions and 55 deletions.
19 changes: 13 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,19 +51,26 @@ In order to run the code, please check the following issues.
- [ConceptNet v5.6.0](https://github.com/commonsense/conceptnet5/wiki/Downloads)
- [DBpedia ontology dataset](https://github.com/zhangxiangxiao/Crepe)
- [20 Newsgroups original 19997 docs](http://qwone.com/~jason/20Newsgroups/)
- [x] Check [config.py](src_reject/config.py) and update the locations of data files accordingly. The [config.py](src_reject/config.py) also defines the locations of intermediate files.
- [x] Check [config.py] and update the locations of data files accordingly. The [config.py] also defines the locations of intermediate files.
- [x] The intermediate files already provided in this repo
- [classLabelsDBpedia.csv](data/zhang15/dbpedia_csv/classLabelsDBpedia.csv): A summary of classes in DBpedia and linked nodes in ConceptNet.
- [classLabels20news.csv](data/20-newsgroups/clean/classLabels20news.csv): A summary of classes in 20news and linked nodes in ConceptNet.
- Selection of seen/unseen classes in DBpedia with unseen rate [0.25](data/zhang15/dbpedia_csv/dbpedia_random_group_0.25.txt) and [0.5](data/zhang15/dbpedia_csv/dbpedia_random_group_0.5.txt).
- Selection of seen/unseen classes in 20news with unseen rate [0.25](data/20-newsgroups/clean/20news_random_group_0.25.txt) and [0.5](data/20-newsgroups/clean/20news_random_group_0.5.txt).
- Note: seen/unseen classes are randomly selected for 10 times. You may randomly generate another 10 groups of seen/unseen classes.
- Random selection of seen/unseen classes in DBpedia with unseen rate [0.25](data/zhang15/dbpedia_csv/dbpedia_random_group_0.25.txt) and [0.5](data/zhang15/dbpedia_csv/dbpedia_random_group_0.5.txt).
- Random selection of seen/unseen classes in 20news with unseen rate [0.25](data/20-newsgroups/clean/20news_random_group_0.25.txt) and [0.5](data/20-newsgroups/clean/20news_random_group_0.5.txt).
- Note: seen/unseen classes were randomly selected for 10 times. You may randomly generate another 10 groups of seen/unseen classes.
- [x] The intermediate files need to be manually generated
- run `combine_zhang15_dbpedia_train_test()` in [playground.py](src_reject/playground.py): the generated `full.csv` is used to create vocabulary for DBpedia.
- run `combine_20news_train_test()` in [playground.py](src_reject/playground.py): the generated `full.csv` is used to create vocabulary for 20news.
- Appropriate preprocessing is recommended. For example, the vocabulary is limited by 20K most frequent words and all numbers are excluded.
- Run `combine_zhang15_dbpedia_train_test()` in [playground.py]:
- The generated `full.csv` is used to create vocabulary for DBpedia later.
- Run `doing_sth_on_20_news()` in [playground.py]:
- This function automatically collects 20news data and randomly split the data into training set `train.csv` (70%) and testing set `test.csv` (30%).
- Besides, `full.csv` is also generated and is used to create vocabulary for 20news later.
- Note that the variable `home_dir` in this function should be the location of the home directory of uncompressed 20news data, which includes a collection of folders named by class labels.
- [x] Other intermediate files should be generated automatically when they are needed.

[TensorLayer]: https://github.com/tensorlayer/tensorlayer
[config.py]: src_reject/config.py
[playground.py]: src_reject/playground.py

### How to train / test Phase 1

Expand Down
4 changes: 3 additions & 1 deletion src_reject/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,7 @@

conceptnet_path = "../wordEmbeddings/conceptnet-assertions-en-5.6.0.csv"

# TODO by Peter: how to get these rejector files
if dataset == "dbpedia" and unseen_rate == 0.25:
rejector_file = "./dbpedia_unseen0.25_augmented12000.pickle"
elif dataset == "dbpedia" and unseen_rate == 0.5:
Expand Down Expand Up @@ -170,6 +171,7 @@
zhang15_dbpedia_train_path = zhang15_dbpedia_dir + "train.csv"
zhang15_dbpedia_train_processed_path = zhang15_dbpedia_dir + "processed_train_text.pkl"

# TODO by Peter: how to get augmented data
zhang15_dbpedia_train_aug_path = zhang15_dbpedia_dir + "train_augmented_aggregated.csv"
zhang15_dbpedia_train_aug_processed_path = zhang15_dbpedia_dir + "processed_train_aug_text.pkl"

Expand All @@ -188,9 +190,9 @@
# zhang15_dbpedia_kg_vector_train_processed_path = zhang15_dbpedia_dir + "kg_vector_train_processed.pkl"
# zhang15_dbpedia_kg_vector_test_processed_path = zhang15_dbpedia_dir + "kg_vector_test_processed.pkl"

#TODO change filename of kg_vector
# zhang15_dbpedia_kg_vector_dir = zhang15_dbpedia_dir + "KG_VECTOR_3/"
# zhang15_dbpedia_kg_vector_prefix = "KG_VECTORS_3_"
# TODO by Peter, how to get KG_Vector files
zhang15_dbpedia_kg_vector_dir = zhang15_dbpedia_dir + "KG_VECTOR_CLUSTER_3GROUP/"
# zhang15_dbpedia_kg_vector_dir = zhang15_dbpedia_dir + "KG_VECTOR_CLUSTER_ALLGROUP/"
zhang15_dbpedia_kg_vector_prefix = "VECTORS_CLUSTER_3_"
Expand Down
Loading

0 comments on commit ff56a7b

Please sign in to comment.