Skip to content

Latest commit

 

History

History
 
 

deltalm

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Encoder-Decoder Pre-training for Language Generation and Translation

DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders. Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Alexandre Muzio, Saksham Singhal, Hany Hassan Awadalla, Xia Song, Furu Wei. CoRR abs/2106.13736.

mT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs. Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang, Xian-Ling Mao, Heyan Huang, and Furu Wei. In EMNLP 2021.


Pretrained Models

  • DeltaLM-base: #enc-dec=12-6; #hidden=768; #head=12; #FFN=3072 (#parameters: 360M)
  • DeltaLM-large: #enc-dec=24-12; #hidden=1024; #head=16; #FFN=4096 (#parameters: 830M)
  • Vocabulary and Sentencepiece-model
  • DeltaLM can be finetuned to support language generation and translation tasks for 100+ languages

Cross-lingual Abstractive Summarization - Wikilingua

We evaluate DeltaLM on cross-lingual abstractive summarization benchmark. We report the results by averaging the numbers in different languages.

Model #Params ROUGE-1 ROUGE-2 ROUGE-L
mBART 610M 34.5 12.9 28.7
mT5 300M 27.5 8.8 22.8
mT5 580M 31.8 11.5 26.0
DeltaLM 360M 35.3 13.4 28.7

Setup

git submodule update --init deltalm/fairseq
cd deltalm/
pip install --editable fairseq/

Fine-tuning

  1. Organize the raw data in the following structure:
.
+-- /path/to/data/
|   +-- train.src
|   +-- train.tgt
|   +-- valid.src
|   +-- valid.tgt

Examples (IWSLT14 German to English):

bash examples/prepare_iwslt14.sh /tmp/iwslt14
  1. Tokenize the data using Sentencepiece:
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < train.src > train.spm.src
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < train.tgt > train.spm.tgt
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < valid.src > valid.spm.src
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < valid.tgt > valid.spm.tgt
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < test.src > test.spm.src
spm_encode --model=/path/to/checkpoint/spm.model --output_format=piece < test.tgt > test.spm.tgt

Examples (IWSLT14 German to English):

bash examples/binary_iwslt14.sh \
     /tmp/iwslt14/iwslt14.tokenized.de-en \
     /tmp/iwslt14/iwslt14.spm \
     /path/to/checkpoint/spm.model
  1. Binary the data:
data_bin=/path/to/data-bin/
python preprocess.py  \
    --trainpref train.spm \
    --validpref valid.spm \
    --testpref test.spm \
    --source-lang src --target-lang tgt \
    --destdir $data_bin \
    --srcdict /path/to/checkpoint/dict.txt \
    --tgtdict /path/to/checkpoint/dict.txt \
    --workers 40

Examples (IWSLT14 German to English):

bash examples/binary_iwslt14.sh \
     /tmp/iwslt14/iwslt14.spm \
     /tmp/iwslt14/iwslt14.bin \
     /path/to/checkpoint/dict.txt
  1. Fine-tuning:
PRETRAINED_MODEL=/path/to/checkpoint/model.pt
python train.py $data_bin \
    --save-dir $save_dir \
    --arch deltalm_base \
    --pretrained-deltalm-checkpoint $PRETRAINED_MODEL \
    --share-all-embeddings \
    --max-source-positions 512 --max-target-positions 512 \
    --criterion label_smoothed_cross_entropy \
    --label-smoothing 0.1 \
    --optimizer adam --adam-betas '(0.9, 0.98)' \
    --lr-scheduler inverse_sqrt \
    --lr $lr \
    --warmup-init-lr 1e-07 \
    --stop-min-lr 1e-09 \
    --warmup-updates 4000 \
    --max-update 400000 \
    --max-epoch 100 \
    --max-tokens $batch_size \
    --update-freq 1 \
    --seed 1 \
    --log-format simple \
    --skip-invalid-size-inputs-valid-test

**Note:

  • For large checkpoint, please set --arch deltalm_large.
  • Please adjust the max-tokens and update-freq to suit in different experimental environments. Recommendation of the total batch size is 4096 * 128 tokens per step.
  • Use --fp16 for more efficient training on the devices that have Tensor Cores.

Examples (IWSLT14 German to English):

bash examples/train_iwslt14.sh \
     /tmp/iwslt14/iwslt14.bin \
     /tmp/iwslt14/checkpoints \
     /path/to/checkpoint/model.pt
  1. Evaluation:
python generate.py $data_bin \
    --path $save_dir/checkpoint_best.pt \
    --batch-size 128 --beam 5 --remove-bpe=sentencepiece

Examples (IWSLT14 German to English):

bash examples/evaluate_iwslt14.sh \
     /tmp/iwslt14/iwslt14.bin \
     /tmp/iwslt14/checkpoints

Citation

If you find this repository useful, please consider citing our work:

@article{deltalm,
      title={{DeltaLM}: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders}, 
      author={Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Alexandre Muzio and Saksham Singhal and Hany Hassan Awadalla and Xia Song and Furu Wei},
      year={2021},
      eprint={2106.13736},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgement

This repository is built using the Fairseq repository.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using DeltaLM models, please submit a GitHub issue.

For other communications related to DeltaLM, please contact Shuming Ma ([email protected]), Furu Wei ([email protected]).