Merge pull request #1 from JingqingZ/piyawat

Piyawat
JingqingZ · Mar 23, 2019 · fd0d8a7 · fd0d8a7
2 parents 956d935 + a745756
commit fd0d8a7
Show file tree

Hide file tree

Showing 9 changed files with 2,296 additions and 51 deletions.
diff --git a/README.md b/README.md
@@ -48,7 +48,8 @@ In order to run the code, please check the following issues.
     - Numpy 1.14.5
     - Pandas 0.21.0
     - NLTK 3.2.5
-- [x] Download original dataset
+    - tqdm 2.2.3
+- [x] Download original datasets
     - [GloVe.6B.200d](https://nlp.stanford.edu/projects/glove/)
     - [ConceptNet v5.6.0](https://github.com/commonsense/conceptnet5/wiki/Downloads)
     - [DBpedia ontology dataset](https://github.com/zhangxiangxiao/Crepe)
@@ -74,9 +75,69 @@ In order to run the code, please check the following issues.
 [config.py]: src_reject/config.py
 [playground.py]: src_reject/playground.py
 
+### How to perform data augmentation
+
+An example:
+```bash
+python3 topic_translation.py \
+        --data dbpedia \
+        --nott 100
+```
+
+The arguments of the command represent
+* `data`: Dataset, either `dbpedia` or `20news`.
+* `nott`: No. of original texts to be translated into all classes except the original class. If `nott` is not given, all the texts in the training dataset will be translated. 
+
+The location of the result file is specified by config.\{zhang15_dbpedia, news20\}_train_augmented_aggregated_path.
+
+
+### How to perform feature augmentation / create v_{w,c}
+
+An example:
+```bash
+python3 kg_vector_generation.py --data dbpedia 
+```
+The argument of the command represents
+* `data`: Dataset, either `dbpedia` or `20news`.
+
+The locations of the result files are specified by config.\{zhang15_dbpedia, news20\}_kg_vector_dir.
+
 ### How to train / test Phase 1
 
-Pending
+- Without data augmentation: an example
+```bash
+python3 train_reject.py \
+        --data dbpedia \
+        --unseen 0.5 \
+        --model vw \
+        --nepoch 3 \
+        --rgidx 1 \
+        --train 1
+```
+
+- With data augmentation: an example
+```bash
+python3 train_reject_augmented.py \
+        --data dbpedia \
+        --unseen 0.5 \
+        --model vw \
+        --nepoch 3 \
+        --rgidx 1 \
+        --naug 100 \
+        --train 1
+```
+
+The arguments of the command represent
+* `data`: Dataset, either `dbpedia` or `20news`.
+* `unseen`: Rate of unseen classes, either `0.25` or `0.5`.
+* `model`: The model to be trained. This argument can only be
+    * `vw`: the inputs are embedding of words (from text)
+* `nepoch`: The number of epochs for training
+* `train`: In Phase 1, this argument does not affect the program. The program will run training and testing together.
+* `rgidx`: Optional, Random group starting index: e.g. if 5, the training will start from the 5th random group, by default `1`. This argument is used when the program is accidentally interrupted.
+* `naug`: The number of augmented data per unseen class
+
+The location of the result file (pickle) is specified by config.rejector_file. The pickle file is actually a list of 10 sublists (corresponding to 10 iterations). Each sublist contains predictions of each test case (1 = predicted as seen, 0 = predicted as unseen).
 
 ### How to train / test the traditional classifier in Phase 2
 
@@ -96,7 +157,7 @@ The arguments of the command represent
 * `model`: The model to be trained. This argument can only be
     * `vw`: the inputs are embedding of words (from text)
 * `sepoch`: Repeat training of each epoch for several times. The ratio of positive/negative samples and learning rate will keep consistent in one epoch no matter how many times the epoch is repeated.
-* `train`: In Phase 1, this argument does not affect the program. The program will run training and testing together.
+* `train`: For the traditional classifier, this argument does not affect the program. The program will run training and testing together.
 * `rgidx`: Optional, Random group starting index: e.g. if 5, the training will start from the 5th random group, by default `1`. This argument is used when the program is accidentally interrupted.
 * `gpu`: Optional, GPU occupation percentage, by default `1.0`, which means full occupation of available GPUs.
 * `baseepoch`: Optional, you may want to specify which epoch to test.

diff --git a/data/20-newsgroups/clean/classLabels20news.csv b/data/20-newsgroups/clean/classLabels20news.csv
@@ -1,21 +1,21 @@
-ClassCode,ClassLabel,ConceptNet,Count,ClassDescription,Hierarchy
-1,alt.atheism,atheism,799,the belief or theory that God does not exist,alt
-2,comp.graphics,graphics,973,pictures produced by computers,computer
-3,comp.os.ms-windows.misc,operating system,985,the software that tells the parts of a computer how to work together and what to do,computer;os;ms;windows
-4,comp.sys.ibm.pc.hardware,ibm,982,ibm personal computer equipments,computer;system;pc;hardware
-5,comp.sys.mac.hardware,mac,961,mac computer equipment,computer;system;hardware
-6,comp.windows.x,windows,980,windows x,computer;x;
-7,misc.forsale,sale,972,the process of selling goods or services for money,
-8,rec.autos,auto,990,relating to cars,recreation
-9,rec.motorcycles,motorcycle,994,a road vehicle that has two wheels and an engine and looks like a large heavy bicycle,recreation
-10,rec.sport.baseball,baseball,994,a game played by two teams of nine players who get points by hitting a ball with a bat and then running around four bases,recreation;sport
-11,rec.sport.hockey,hockey,999,a game played on grass by two teams of 11 players who try to score goals by hitting a ball with a curved stick called a hockey stick,recreation;sport
-12,sci.crypt,crypt,991,the use of codes to put information on a website into a form that can only be read by users with permission,science
-13,sci.electronics,electronics,981,using electricity and extremely small electrical parts such as microchips and transistors,science
-14,sci.med,medical,990,relating to medicine and the treatment of injuries and diseases,science
-15,sci.space,space,987,the whole of the universe outside the Earth’s atmosphere,science
-16,soc.religion.christian,christian,997,the religion based on the teachings of Jesus Christ. Its followers worship in a church.,social
-17,talk.politics.guns,gun,910,"a weapon that shoots bullets, for example a pistol or a rifle. You load a gun with ammunition and pull the trigger to use it",talk;politics
-18,talk.politics.mideast,mideast,940,"the region of the world that consists of the countries east of the Mediterranean Sea and west of India. It includes Egypt, Jordan, Israel, Lebanon, Syria, Turkey, Iran, and Iraq.",talk;politics
-19,talk.politics.misc,politics,775,the ideas and activities involved in getting power in a country or over a particular area of the world,talk
-20,talk.religion.misc,religion,628,the belief in the existence of a god or gods,talk
+ClassCode,ClassLabel,ConceptNet,Count,ClassDescription,Hierarchy,ClassWord
+1,alt.atheism,atheism,799,the belief or theory that God does not exist,alt,atheism
+2,comp.graphics,graphics,973,pictures produced by computers,computer,graphics
+3,comp.os.ms-windows.misc,operating system,985,the software that tells the parts of a computer how to work together and what to do,computer;os;ms;windows,os
+4,comp.sys.ibm.pc.hardware,ibm,982,ibm personal computer equipments,computer;system;pc;hardware,ibm
+5,comp.sys.mac.hardware,mac,961,mac computer equipment,computer;system;hardware,mac
+6,comp.windows.x,windows,980,windows x,computer;x;,windows
+7,misc.forsale,sale,972,the process of selling goods or services for money,,sale
+8,rec.autos,auto,990,relating to cars,recreation,auto
+9,rec.motorcycles,motorcycle,994,a road vehicle that has two wheels and an engine and looks like a large heavy bicycle,recreation,motorcycle
+10,rec.sport.baseball,baseball,994,a game played by two teams of nine players who get points by hitting a ball with a bat and then running around four bases,recreation;sport,baseball
+11,rec.sport.hockey,hockey,999,a game played on grass by two teams of 11 players who try to score goals by hitting a ball with a curved stick called a hockey stick,recreation;sport,hockey
+12,sci.crypt,crypt,991,the use of codes to put information on a website into a form that can only be read by users with permission,science,cryptography
+13,sci.electronics,electronics,981,using electricity and extremely small electrical parts such as microchips and transistors,science,electronics
+14,sci.med,medical,990,relating to medicine and the treatment of injuries and diseases,science,medical
+15,sci.space,space,987,the whole of the universe outside the Earth’s atmosphere,science,space
+16,soc.religion.christian,christian,997,the religion based on the teachings of Jesus Christ. Its followers worship in a church.,social,christian
+17,talk.politics.guns,gun,910,"a weapon that shoots bullets, for example a pistol or a rifle. You load a gun with ammunition and pull the trigger to use it",talk;politics,gun
+18,talk.politics.mideast,mideast,940,"the region of the world that consists of the countries east of the Mediterranean Sea and west of India. It includes Egypt, Jordan, Israel, Lebanon, Syria, Turkey, Iran, and Iraq.",talk;politics,mideast
+19,talk.politics.misc,politics,775,the ideas and activities involved in getting power in a country or over a particular area of the world,talk,politics
+20,talk.religion.misc,religion,628,the belief in the existence of a god or gods,talk,religion
diff --git a/src_reject/config.py b/src_reject/config.py
@@ -2,19 +2,22 @@
 import argparse
 
 parser = argparse.ArgumentParser(description='configurations')
-parser.add_argument("--data",  type=str, required=True, help="dataset: dbpedia or 20news")
-parser.add_argument("--unseen", type=float, required=True, help="unseen rate: 0.25 0.5 0.75")
-# parser.add_argument("--aug", type=int, required=True, help="augmentation: 0 4000 8000 12000 16000 20000")
-parser.add_argument("--model", type=str, required=True, help="model: vwvcvkg vwvc vwvkg vcvkg kgonly cnnfc rnnfc")
+parser.add_argument("--data",  type=str, required=False, help="dataset: dbpedia or 20news")
+parser.add_argument("--unseen", type=float, required=False, help="unseen rate: 0.25 0.5 0.75")
+# parser.add_argument("--aug", type=int, required=False, help="augmentation: 0 4000 8000 12000 16000 20000")
+parser.add_argument("--model", type=str, required=False, help="model: vwvcvkg vwvc vwvkg vcvkg kgonly cnnfc rnnfc")
 parser.add_argument("--ns", type=int, default=2, required=False, help="negative samples: integer, the ratio of positive and negative samples, the higher the more negative samples")
 parser.add_argument("--ni", type=int, default=2, required=False, help="negative increase: integer, the speed of increasing negative samples during training per epoch")
-parser.add_argument("--sepoch", type=int, required=True, help="small epoch: integer, repeat training of each epoch for several times so that the ratio of posi/negative, learning rate both keep the same")
+parser.add_argument("--sepoch", type=int, required=False, help="small epoch: integer, repeat training of each epoch for several times so that the ratio of posi/negative, learning rate both keep the same")
+parser.add_argument("--nepoch", type=int, default = 5, required=False, help="number of epochs for training")
 parser.add_argument("--rgidx", type=int, default=1, required=False, help="random group starting index: e.g. if 5, the training will start from the 5th random group, by default 1")
-parser.add_argument("--train", type=int, required=True, help="train or not")
+parser.add_argument("--train", type=int, required=False, help="train or not")
 parser.add_argument("--gpu", type=float, default=1.0, required=False, help="gpu occupation percentage")
 parser.add_argument("--baseepoch", type=int, required=False, help="base epoch for testing")
 parser.add_argument("--fulltest", type=int, required=False, help="full test or not")
 parser.add_argument("--threshold", type=float, required=False, help="threshold for seen")
+parser.add_argument("--nott", type=int, required=False, help="no. of original texts to be translated")
+parser.add_argument("--naug", type=int, default = 0, required=False, help="no. of augmented data per unseen class")
 args = parser.parse_args()
 print(args)
 
@@ -110,21 +113,15 @@
 
 word_embed_file_path = "../data/glove/glove.6B.200d.txt"
 word_embed_gensim_file_path = '../data/glove/glove.6B.200d.gensim.txt'
-conceptnet_path = "../wordEmbeddings/conceptnet-assertions-en-5.6.0.csv"
+conceptnet_path = "../data/conceptnet-assertions-en-5.6.0.csv"
 POS_OF_WORD_path = "../data/POS_OF_WORD.pickle"
 WORD_TOPIC_TRANSLATION_path = "../data/WORD_TOPIC_TRANSLATION.pickle"
 
-# TODO by Peter: how to get these rejector files
-if dataset == "dbpedia" and unseen_rate == 0.25:
-    rejector_file = "./dbpedia_unseen0.25_augmented12000.pickle"
-elif dataset == "dbpedia" and unseen_rate == 0.5:
-    rejector_file = "./dbpedia_unseen0.50_augmented8000.pickle"
-elif dataset == "20news" and unseen_rate == 0.25:
-    rejector_file = "./20news_unseen0.25_augmented4000.pickle"
-elif dataset == "20news" and unseen_rate == 0.5:
-    rejector_file = "./20news_unseen0.50_augmented3000.pickle"
+if dataset in ["dbpedia", "20news"] and unseen_rate in [0.25, 0.5, 0.75]:
+	rejector_file = "../results/%s_unseen%.2f_augmented%d.pickle" % (dataset, unseen_rate, args.naug)
 else:
     rejector_file = None
+
 
 
 ##################################
@@ -169,13 +166,14 @@
 zhang15_dbpedia_dir = zhang15_dir + "dbpedia_csv/"
 
 zhang15_dbpedia_full_data_path = zhang15_dbpedia_dir + "full.csv"
+zhang15_dbpedia_full_augmented_path = zhang15_dbpedia_dir + "full_augmented.csv"
 
 zhang15_dbpedia_train_path = zhang15_dbpedia_dir + "train.csv"
 zhang15_dbpedia_train_processed_path = zhang15_dbpedia_dir + "processed_train_text.pkl"
 
-# TODO by Peter: how to get augmented data
-zhang15_dbpedia_train_aug_path = zhang15_dbpedia_dir + "train_augmented_aggregated.csv"
-zhang15_dbpedia_train_aug_processed_path = zhang15_dbpedia_dir + "processed_train_aug_text.pkl"
+zhang15_dbpedia_train_augmented_path = zhang15_dbpedia_dir + "train_augmented.csv"
+zhang15_dbpedia_train_augmented_aggregated_path = zhang15_dbpedia_dir + "train_augmented_aggregated.csv"
+zhang15_dbpedia_train_augmented_processed_path = zhang15_dbpedia_dir + "processed_train_augmented_text.pkl"
 
 zhang15_dbpedia_test_path = zhang15_dbpedia_dir + "test.csv"
 zhang15_dbpedia_test_processed_path = zhang15_dbpedia_dir + "processed_test_text.pkl"
@@ -194,9 +192,8 @@
 
 # zhang15_dbpedia_kg_vector_dir = zhang15_dbpedia_dir + "KG_VECTOR_3/"
 # zhang15_dbpedia_kg_vector_prefix = "KG_VECTORS_3_"
-# TODO by Peter, how to get KG_Vector files
+zhang15_dbpedia_kg_vector_node_data_path = zhang15_dbpedia_dir + 'NODES_DATA.pickle'
 zhang15_dbpedia_kg_vector_dir = zhang15_dbpedia_dir + "KG_VECTOR_CLUSTER_3GROUP/"
-# zhang15_dbpedia_kg_vector_dir = zhang15_dbpedia_dir + "KG_VECTOR_CLUSTER_ALLGROUP/"
 zhang15_dbpedia_kg_vector_prefix = "VECTORS_CLUSTER_3_"
 
 zhang15_dbpedia_word_embed_matrix_path = zhang15_dbpedia_dir + "word_embed_matrix.npz"
@@ -285,15 +282,14 @@
 news20_test_path = news20_dir + "test.csv"
 news20_test_processed_path = news20_dir + "processed_test_text.pkl"
 
-# TODO by Peter, how to get augmented data
-news20_train_aug_path = news20_dir + "train_augmented.csv"
-news20_train_aug_processed_path = news20_dir + "processed_train_aug_text.pkl"
+news20_train_augmented_path = news20_dir + "train_augmented.csv"
+news20_train_augmented_aggregated_path = news20_dir + "train_augmented_aggregated.csv"
+news20_train_augmented_processed_path = news20_dir + "processed_train_augmented_text.pkl"
+
 
 news20_vocab_path = news20_dir + "vocab.txt"
 
-# TODO by Peter, how to get kg vectors
-# news20_kg_vector_dir = news20_dir + "KG_VECTOR_3_Lem/"
-# news20_kg_vector_prefix = "lemmatised_KG_VECTORS_3_"
+news20_kg_vector_node_data_path = news20_dir + 'NODES_DATA.pickle'
 news20_kg_vector_dir = news20_dir + "KG_VECTOR_CLUSTER_3GROUP/"
 news20_kg_vector_prefix = "VECTORS_CLUSTER_3_"