Skip to content

Commit

Permalink
Merge pull request #1 from JingqingZ/piyawat
Browse files Browse the repository at this point in the history
Piyawat
  • Loading branch information
JingqingZ authored Mar 23, 2019
2 parents 956d935 + a745756 commit fd0d8a7
Show file tree
Hide file tree
Showing 9 changed files with 2,296 additions and 51 deletions.
67 changes: 64 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,8 @@ In order to run the code, please check the following issues.
- Numpy 1.14.5
- Pandas 0.21.0
- NLTK 3.2.5
- [x] Download original dataset
- tqdm 2.2.3
- [x] Download original datasets
- [GloVe.6B.200d](https://nlp.stanford.edu/projects/glove/)
- [ConceptNet v5.6.0](https://github.com/commonsense/conceptnet5/wiki/Downloads)
- [DBpedia ontology dataset](https://github.com/zhangxiangxiao/Crepe)
Expand All @@ -74,9 +75,69 @@ In order to run the code, please check the following issues.
[config.py]: src_reject/config.py
[playground.py]: src_reject/playground.py

### How to perform data augmentation

An example:
```bash
python3 topic_translation.py \
--data dbpedia \
--nott 100
```

The arguments of the command represent
* `data`: Dataset, either `dbpedia` or `20news`.
* `nott`: No. of original texts to be translated into all classes except the original class. If `nott` is not given, all the texts in the training dataset will be translated.

The location of the result file is specified by config.\{zhang15_dbpedia, news20\}_train_augmented_aggregated_path.


### How to perform feature augmentation / create v_{w,c}

An example:
```bash
python3 kg_vector_generation.py --data dbpedia
```
The argument of the command represents
* `data`: Dataset, either `dbpedia` or `20news`.

The locations of the result files are specified by config.\{zhang15_dbpedia, news20\}_kg_vector_dir.

### How to train / test Phase 1

Pending
- Without data augmentation: an example
```bash
python3 train_reject.py \
--data dbpedia \
--unseen 0.5 \
--model vw \
--nepoch 3 \
--rgidx 1 \
--train 1
```

- With data augmentation: an example
```bash
python3 train_reject_augmented.py \
--data dbpedia \
--unseen 0.5 \
--model vw \
--nepoch 3 \
--rgidx 1 \
--naug 100 \
--train 1
```

The arguments of the command represent
* `data`: Dataset, either `dbpedia` or `20news`.
* `unseen`: Rate of unseen classes, either `0.25` or `0.5`.
* `model`: The model to be trained. This argument can only be
* `vw`: the inputs are embedding of words (from text)
* `nepoch`: The number of epochs for training
* `train`: In Phase 1, this argument does not affect the program. The program will run training and testing together.
* `rgidx`: Optional, Random group starting index: e.g. if 5, the training will start from the 5th random group, by default `1`. This argument is used when the program is accidentally interrupted.
* `naug`: The number of augmented data per unseen class

The location of the result file (pickle) is specified by config.rejector_file. The pickle file is actually a list of 10 sublists (corresponding to 10 iterations). Each sublist contains predictions of each test case (1 = predicted as seen, 0 = predicted as unseen).

### How to train / test the traditional classifier in Phase 2

Expand All @@ -96,7 +157,7 @@ The arguments of the command represent
* `model`: The model to be trained. This argument can only be
* `vw`: the inputs are embedding of words (from text)
* `sepoch`: Repeat training of each epoch for several times. The ratio of positive/negative samples and learning rate will keep consistent in one epoch no matter how many times the epoch is repeated.
* `train`: In Phase 1, this argument does not affect the program. The program will run training and testing together.
* `train`: For the traditional classifier, this argument does not affect the program. The program will run training and testing together.
* `rgidx`: Optional, Random group starting index: e.g. if 5, the training will start from the 5th random group, by default `1`. This argument is used when the program is accidentally interrupted.
* `gpu`: Optional, GPU occupation percentage, by default `1.0`, which means full occupation of available GPUs.
* `baseepoch`: Optional, you may want to specify which epoch to test.
Expand Down
42 changes: 21 additions & 21 deletions data/20-newsgroups/clean/classLabels20news.csv
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
ClassCode,ClassLabel,ConceptNet,Count,ClassDescription,Hierarchy
1,alt.atheism,atheism,799,the belief or theory that God does not exist,alt
2,comp.graphics,graphics,973,pictures produced by computers,computer
3,comp.os.ms-windows.misc,operating system,985,the software that tells the parts of a computer how to work together and what to do,computer;os;ms;windows
4,comp.sys.ibm.pc.hardware,ibm,982,ibm personal computer equipments,computer;system;pc;hardware
5,comp.sys.mac.hardware,mac,961,mac computer equipment,computer;system;hardware
6,comp.windows.x,windows,980,windows x,computer;x;
7,misc.forsale,sale,972,the process of selling goods or services for money,
8,rec.autos,auto,990,relating to cars,recreation
9,rec.motorcycles,motorcycle,994,a road vehicle that has two wheels and an engine and looks like a large heavy bicycle,recreation
10,rec.sport.baseball,baseball,994,a game played by two teams of nine players who get points by hitting a ball with a bat and then running around four bases,recreation;sport
11,rec.sport.hockey,hockey,999,a game played on grass by two teams of 11 players who try to score goals by hitting a ball with a curved stick called a hockey stick,recreation;sport
12,sci.crypt,crypt,991,the use of codes to put information on a website into a form that can only be read by users with permission,science
13,sci.electronics,electronics,981,using electricity and extremely small electrical parts such as microchips and transistors,science
14,sci.med,medical,990,relating to medicine and the treatment of injuries and diseases,science
15,sci.space,space,987,the whole of the universe outside the Earth’s atmosphere,science
16,soc.religion.christian,christian,997,the religion based on the teachings of Jesus Christ. Its followers worship in a church.,social
17,talk.politics.guns,gun,910,"a weapon that shoots bullets, for example a pistol or a rifle. You load a gun with ammunition and pull the trigger to use it",talk;politics
18,talk.politics.mideast,mideast,940,"the region of the world that consists of the countries east of the Mediterranean Sea and west of India. It includes Egypt, Jordan, Israel, Lebanon, Syria, Turkey, Iran, and Iraq.",talk;politics
19,talk.politics.misc,politics,775,the ideas and activities involved in getting power in a country or over a particular area of the world,talk
20,talk.religion.misc,religion,628,the belief in the existence of a god or gods,talk
ClassCode,ClassLabel,ConceptNet,Count,ClassDescription,Hierarchy,ClassWord
1,alt.atheism,atheism,799,the belief or theory that God does not exist,alt,atheism
2,comp.graphics,graphics,973,pictures produced by computers,computer,graphics
3,comp.os.ms-windows.misc,operating system,985,the software that tells the parts of a computer how to work together and what to do,computer;os;ms;windows,os
4,comp.sys.ibm.pc.hardware,ibm,982,ibm personal computer equipments,computer;system;pc;hardware,ibm
5,comp.sys.mac.hardware,mac,961,mac computer equipment,computer;system;hardware,mac
6,comp.windows.x,windows,980,windows x,computer;x;,windows
7,misc.forsale,sale,972,the process of selling goods or services for money,,sale
8,rec.autos,auto,990,relating to cars,recreation,auto
9,rec.motorcycles,motorcycle,994,a road vehicle that has two wheels and an engine and looks like a large heavy bicycle,recreation,motorcycle
10,rec.sport.baseball,baseball,994,a game played by two teams of nine players who get points by hitting a ball with a bat and then running around four bases,recreation;sport,baseball
11,rec.sport.hockey,hockey,999,a game played on grass by two teams of 11 players who try to score goals by hitting a ball with a curved stick called a hockey stick,recreation;sport,hockey
12,sci.crypt,crypt,991,the use of codes to put information on a website into a form that can only be read by users with permission,science,cryptography
13,sci.electronics,electronics,981,using electricity and extremely small electrical parts such as microchips and transistors,science,electronics
14,sci.med,medical,990,relating to medicine and the treatment of injuries and diseases,science,medical
15,sci.space,space,987,the whole of the universe outside the Earth’s atmosphere,science,space
16,soc.religion.christian,christian,997,the religion based on the teachings of Jesus Christ. Its followers worship in a church.,social,christian
17,talk.politics.guns,gun,910,"a weapon that shoots bullets, for example a pistol or a rifle. You load a gun with ammunition and pull the trigger to use it",talk;politics,gun
18,talk.politics.mideast,mideast,940,"the region of the world that consists of the countries east of the Mediterranean Sea and west of India. It includes Egypt, Jordan, Israel, Lebanon, Syria, Turkey, Iran, and Iraq.",talk;politics,mideast
19,talk.politics.misc,politics,775,the ideas and activities involved in getting power in a country or over a particular area of the world,talk,politics
20,talk.religion.misc,religion,628,the belief in the existence of a god or gods,talk,religion
50 changes: 23 additions & 27 deletions src_reject/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,22 @@
import argparse

parser = argparse.ArgumentParser(description='configurations')
parser.add_argument("--data", type=str, required=True, help="dataset: dbpedia or 20news")
parser.add_argument("--unseen", type=float, required=True, help="unseen rate: 0.25 0.5 0.75")
# parser.add_argument("--aug", type=int, required=True, help="augmentation: 0 4000 8000 12000 16000 20000")
parser.add_argument("--model", type=str, required=True, help="model: vwvcvkg vwvc vwvkg vcvkg kgonly cnnfc rnnfc")
parser.add_argument("--data", type=str, required=False, help="dataset: dbpedia or 20news")
parser.add_argument("--unseen", type=float, required=False, help="unseen rate: 0.25 0.5 0.75")
# parser.add_argument("--aug", type=int, required=False, help="augmentation: 0 4000 8000 12000 16000 20000")
parser.add_argument("--model", type=str, required=False, help="model: vwvcvkg vwvc vwvkg vcvkg kgonly cnnfc rnnfc")
parser.add_argument("--ns", type=int, default=2, required=False, help="negative samples: integer, the ratio of positive and negative samples, the higher the more negative samples")
parser.add_argument("--ni", type=int, default=2, required=False, help="negative increase: integer, the speed of increasing negative samples during training per epoch")
parser.add_argument("--sepoch", type=int, required=True, help="small epoch: integer, repeat training of each epoch for several times so that the ratio of posi/negative, learning rate both keep the same")
parser.add_argument("--sepoch", type=int, required=False, help="small epoch: integer, repeat training of each epoch for several times so that the ratio of posi/negative, learning rate both keep the same")
parser.add_argument("--nepoch", type=int, default = 5, required=False, help="number of epochs for training")
parser.add_argument("--rgidx", type=int, default=1, required=False, help="random group starting index: e.g. if 5, the training will start from the 5th random group, by default 1")
parser.add_argument("--train", type=int, required=True, help="train or not")
parser.add_argument("--train", type=int, required=False, help="train or not")
parser.add_argument("--gpu", type=float, default=1.0, required=False, help="gpu occupation percentage")
parser.add_argument("--baseepoch", type=int, required=False, help="base epoch for testing")
parser.add_argument("--fulltest", type=int, required=False, help="full test or not")
parser.add_argument("--threshold", type=float, required=False, help="threshold for seen")
parser.add_argument("--nott", type=int, required=False, help="no. of original texts to be translated")
parser.add_argument("--naug", type=int, default = 0, required=False, help="no. of augmented data per unseen class")
args = parser.parse_args()
print(args)

Expand Down Expand Up @@ -110,21 +113,15 @@

word_embed_file_path = "../data/glove/glove.6B.200d.txt"
word_embed_gensim_file_path = '../data/glove/glove.6B.200d.gensim.txt'
conceptnet_path = "../wordEmbeddings/conceptnet-assertions-en-5.6.0.csv"
conceptnet_path = "../data/conceptnet-assertions-en-5.6.0.csv"
POS_OF_WORD_path = "../data/POS_OF_WORD.pickle"
WORD_TOPIC_TRANSLATION_path = "../data/WORD_TOPIC_TRANSLATION.pickle"

# TODO by Peter: how to get these rejector files
if dataset == "dbpedia" and unseen_rate == 0.25:
rejector_file = "./dbpedia_unseen0.25_augmented12000.pickle"
elif dataset == "dbpedia" and unseen_rate == 0.5:
rejector_file = "./dbpedia_unseen0.50_augmented8000.pickle"
elif dataset == "20news" and unseen_rate == 0.25:
rejector_file = "./20news_unseen0.25_augmented4000.pickle"
elif dataset == "20news" and unseen_rate == 0.5:
rejector_file = "./20news_unseen0.50_augmented3000.pickle"
if dataset in ["dbpedia", "20news"] and unseen_rate in [0.25, 0.5, 0.75]:
rejector_file = "../results/%s_unseen%.2f_augmented%d.pickle" % (dataset, unseen_rate, args.naug)
else:
rejector_file = None



##################################
Expand Down Expand Up @@ -169,13 +166,14 @@
zhang15_dbpedia_dir = zhang15_dir + "dbpedia_csv/"

zhang15_dbpedia_full_data_path = zhang15_dbpedia_dir + "full.csv"
zhang15_dbpedia_full_augmented_path = zhang15_dbpedia_dir + "full_augmented.csv"

zhang15_dbpedia_train_path = zhang15_dbpedia_dir + "train.csv"
zhang15_dbpedia_train_processed_path = zhang15_dbpedia_dir + "processed_train_text.pkl"

# TODO by Peter: how to get augmented data
zhang15_dbpedia_train_aug_path = zhang15_dbpedia_dir + "train_augmented_aggregated.csv"
zhang15_dbpedia_train_aug_processed_path = zhang15_dbpedia_dir + "processed_train_aug_text.pkl"
zhang15_dbpedia_train_augmented_path = zhang15_dbpedia_dir + "train_augmented.csv"
zhang15_dbpedia_train_augmented_aggregated_path = zhang15_dbpedia_dir + "train_augmented_aggregated.csv"
zhang15_dbpedia_train_augmented_processed_path = zhang15_dbpedia_dir + "processed_train_augmented_text.pkl"

zhang15_dbpedia_test_path = zhang15_dbpedia_dir + "test.csv"
zhang15_dbpedia_test_processed_path = zhang15_dbpedia_dir + "processed_test_text.pkl"
Expand All @@ -194,9 +192,8 @@

# zhang15_dbpedia_kg_vector_dir = zhang15_dbpedia_dir + "KG_VECTOR_3/"
# zhang15_dbpedia_kg_vector_prefix = "KG_VECTORS_3_"
# TODO by Peter, how to get KG_Vector files
zhang15_dbpedia_kg_vector_node_data_path = zhang15_dbpedia_dir + 'NODES_DATA.pickle'
zhang15_dbpedia_kg_vector_dir = zhang15_dbpedia_dir + "KG_VECTOR_CLUSTER_3GROUP/"
# zhang15_dbpedia_kg_vector_dir = zhang15_dbpedia_dir + "KG_VECTOR_CLUSTER_ALLGROUP/"
zhang15_dbpedia_kg_vector_prefix = "VECTORS_CLUSTER_3_"

zhang15_dbpedia_word_embed_matrix_path = zhang15_dbpedia_dir + "word_embed_matrix.npz"
Expand Down Expand Up @@ -285,15 +282,14 @@
news20_test_path = news20_dir + "test.csv"
news20_test_processed_path = news20_dir + "processed_test_text.pkl"

# TODO by Peter, how to get augmented data
news20_train_aug_path = news20_dir + "train_augmented.csv"
news20_train_aug_processed_path = news20_dir + "processed_train_aug_text.pkl"
news20_train_augmented_path = news20_dir + "train_augmented.csv"
news20_train_augmented_aggregated_path = news20_dir + "train_augmented_aggregated.csv"
news20_train_augmented_processed_path = news20_dir + "processed_train_augmented_text.pkl"


news20_vocab_path = news20_dir + "vocab.txt"

# TODO by Peter, how to get kg vectors
# news20_kg_vector_dir = news20_dir + "KG_VECTOR_3_Lem/"
# news20_kg_vector_prefix = "lemmatised_KG_VECTORS_3_"
news20_kg_vector_node_data_path = news20_dir + 'NODES_DATA.pickle'
news20_kg_vector_dir = news20_dir + "KG_VECTOR_CLUSTER_3GROUP/"
news20_kg_vector_prefix = "VECTORS_CLUSTER_3_"

Expand Down
Loading

0 comments on commit fd0d8a7

Please sign in to comment.