Attention-based sequence to sequence learning
- TensorFlow 1.2+ for Python 3
- YAML and Matplotlib modules for Python 3:
sudo apt-get install python3-yaml python3-matplotlib
- A recent NVIDIA GPU
Train a model (CONFIG is a YAML configuration file, such as config/default.yaml
):
./seq2seq.sh CONFIG --train -v
Translate text using an existing model:
./seq2seq.sh CONFIG --decode FILE_TO_TRANSLATE --output OUTPUT_FILE
or for interactive decoding:
./seq2seq.sh CONFIG --decode
This is the same model and dataset as Bahdanau et al. 2015.
config/WMT14/download.sh # download WMT14 data into raw_data/WMT14
config/WMT14/prepare.sh # preprocess the data, and copy the files to data/WMT14
./seq2seq.sh config/WMT14/baseline.yaml --train -v # train a baseline model on this data
You should get similar BLEU scores as these (our model was trained on a single Titan X I for about 4 days).
Dev | Test | +beam | Steps | Time |
---|---|---|---|---|
25.04 | 28.64 | 29.22 | 240k | 60h |
25.25 | 28.67 | 29.28 | 330k | 80h |
Download this model here. To use this model, just extract the archive into the seq2seq/models
folder, and run:
./seq2seq.sh models/WMT14/config.yaml --decode -v
This is the same dataset as Ranzato et al. 2015.
config/IWSLT14/prepare.sh
./seq2seq.sh config/IWSLT14/baseline.yaml --train -v
Dev | Test | +beam | Steps |
---|---|---|---|
28.32 | 25.33 | 26.74 | 44k |
The model is available for download here.
If you want to use the toolkit for Automatic Speech Recognition (ASR) or Automatic Speech Translation (AST), then you'll need to pre-process your audio files accordingly.
This README details how it can be done. You'll need to install the Yaafe library, and use scripts/speech/extract-audio-features.py
to extract MFCCs from a set of wav files.
- YAML configuration files
- Beam-search decoder
- Ensemble decoding
- Multiple encoders
- Hierarchical encoder
- Bidirectional encoder
- Local attention model
- Convolutional attention model
- Detailed logging
- Periodic BLEU evaluation
- Periodic checkpoints
- Multi-task training: train on several tasks at once (e.g. French->English and German->English MT)
- Subwords training and decoding
- Input binary features instead of text
- Pre-processing script: we provide a fully-featured Python script for data pre-processing (vocabulary creation, lowercasing, tokenizing, splitting, etc.)
- Dynamic RNNs: we use symbolic loops instead of statically unrolled RNNs. This means that we don't mean to manually configure bucket sizes, and that model creation is much faster.
- This project is based on TensorFlow's reference implementation
- We include some of the pre-processing scripts from Moses
- The scripts for subword units come from github.com/rsennrich/subword-nmt