[SIGIR 2023] This is the code for our paper `Weakly-Supervised Scientific Document Classification via Retrieval-Augmented Multi-Stage Training'.
The weakly labeled datasets, including unlabeled training data, testing data, label names, as well as the expanded label names are in here: dataset. The statistics of dataset is summarized as follows:
Dataset | MeSH | arXiv-CS | arXiv-Math |
---|---|---|---|
Type | Topic | Sentiment | Sentiment |
# of Train | 16.3k | 75.7k | 62.5k |
# of Test | 3.5k | 5.1k | 6.3kk |
# OOV Label Names | 4 (36%) | 5 (25%) | 3 (19%) |
We have uploaded the dense retrieval after pretraining at this link following the code in this link and this link. Note that we use the the embedding of the [CLS] token as the sequence embeddings.
- PyTorch 1.8
- python 3.7
- Transformers 4.6.0
- tqdm
- nltk
- MulticoreTSNE
gen_embedding.py
: the main code to produce sequence embeddings. We have provided the embedding generated with our checkpoint at this link.dense_retrieval.py
: the main code to run the retrieval module.keyword_expansion.py
: the keyword expansion module with local and global context.main.py
andtrainer.py
: the main code for classifier training.
python dense_retrieval.py --dataset=wos --model=${DR_model} --gpuid=2 --round=0 --type=train --target=wos --N=$N --prompt_id=0
bash commands/run_classifier.sh
Parameter Setting:
task
: the downstream datasetround
: the retrieval rounds (set it to0
in this time)semi_method
: the type of learning (finetune or self-training), set toft
this timelr
: the learning ratebatch_size
: the batch size in traininggpu
: allocate the GPU resource to speed up training.
In this step (stage-II in the paper), local and global information are used to select keywords to augment the label names for each class.
ROUNDS=6
TARGET=mesh
DR_model=arxiv_ckpt
loc=1
glo=1
for ((i=0;i<${ROUNDS};++i)); do
echo "Dataset ${TARGET}, DR Model: ${DR_model}, Round${i}, local ${loc}, global ${glo}"
python keyword_expansion.py --dr_model=${DR_model} --topN=$N --round=$i --target=${TARGET} --loc=${loc} --glo=${glo}
python dense_retrieval.py --dataset=${TARGET} --model=${DR_model} --gpuid=0 --round=$(($i+1)) --type=train --N=$N --prompt_id=0 --loc=${loc} --glo=${glo}
done
In this step, you need to first fine-tune classifiers using the documents retrieved with expanded label names in step 3. After that, you can use self-training to further refine the classifiers on all unlabeled corpus.
bash commands/run_classifier.sh
Parameter Setting:
task
: the downstream datasetround
: the retrieval rounds (set it to${the_number_of_dense_retrieval_rounds}
in this time)semi_method
: the type of learning (finetune or self-training), set toft
for fine-tuning andst
for self-training. Normally we need to first finetune the model, then use the finetuned model as the starting points for self-training.lr
: the learning ratebatch_size
: the batch size in traininggpu
: allocate the GPU resource to speed up training.- For self-training, you need to set the
load_prev
to 1 and set theprev_ckpt
to the checkpoints obtained from the previous fine-tuning step.
If you have any questions, feel free to reach out to ran.xu at emory.edu
. Please try to specify the problem with details so we can help you better and quicker!
If you find this repository valuable for your research, we kindly request that you acknowledge our paper by citing the following paper. We appreciate your consideration.
@inproceedings{xu2023weakly,
title={Weakly-Supervised Scientific Document Classification via Retrieval-Augmented Multi-Stage Training},
author={Xu, Ran and Yu, Yue and Ho, Joyce C and Yang, Carl},
booktitle={Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval},
year={2023}
}