GigaSpeech is a large English speech recognition corpus collected from audiobooks, podcasts, and Youtube. GigaSpeech contains 5 subsets of different sizes for difference usages. We use the largest subset XL which has 10,000 hours of audio as the training data for ASR.
In this recipe, we will introduce how to pre-process the GigaSpeech corpus and how to train/evaluate an ASR model using neurst. Also, we give an example of how to use our benchmark model for prediction.
apt
- libsndfile1
- ffmpeg
pip
- TensorFlow >=2.4.1
- soundfile
- python_speech_features
- pyyaml
- sentencepiece
The final performance of ASR on GigaSpeech is
See RESULTS for the comparison with counterparts.
- ASR (dmodel=1024, WER)
Model | Dev | Test |
---|---|---|
speech_transformer_l | 11.89 | 11.60 |
The benchmark models:
Task | Language | Models | Hypothesis | Reference |
---|---|---|---|---|
ASR | EN | speech_transformer_l | Dev Test | Dev Test |
An example of how to use our benchmark models for inference:
# create dev and test set
./03-create_devtest_set.sh /path_to_data/
# download and untar models
wget https://lf3-nlp-opensource.bytetos.com/obj/nlp-opensource/neurst/speech_to_text/gigaspeech/ckpt.tgz
tar -zxvf ckpt.tgz
# show WER on test and dev
python3 -m neurst.cli.run_exp \
--config_paths /path_to_data/asr/asr_prediction_args.yml \
--model_dir giga_xl
# save hypothesis to file
python3 -m neurst.cli.run_exp \
--config_paths /path_to_data/asr/asr_prediction_args.yml \
--model_dir giga_xl \
--output_file "{'dev': 'dev.hypo.txt', 'test': 'test.hypo.txt'}"
First, we download the original opus files into directory /path_to_data/
and we have
/path_to_data/
├── audio
│ ├── audiobook
│ │ ├── P0001
│ │ │ ├── AUD0000000001.opus
│ │ │ ├── AUD0000000002.opus
│ │ │ ├── ..................
│ │ ├── ...
│ ├── podcast
│ │ ├── P0000
│ │ ├── ...
│ └── youtube
│ ├── P0000
│ ├── ...
└── GigaSpeech.json
These files take about 450GB. Please make sure your disk has enough space.
The ASR corpus contains source raw audio files and text files in English. Here we compute audio features (log-mel filterbank coefficients) and map the transcript tokens to IDs.
This step is extremely time comsuing because of the size of the original corpus. Also, the resultant TFRecords take about 1.1TB. Please make sure your disk has enough space.
We can extract audio features and transcripts with
$ ./examples/speech_transformer/gigaspeech/./02-create_training_set.sh /path_to_data subset
If you want to train ASR with punctuations reserved, you can add --keep-punctuation
option. By default, we ignore all punctuations during training, validation, and testing.
By default, it extracts 80-channel log-mel filterbank coefficients using a lightweight python package python_speech_features
with windows of 25ms and steps of 10ms. Then we have
/path_to_data/
├── spm.model
├── spm.vocab
└── asr
├── asr_data_prep.yml
├── asr_training_args.yml
└── train
├── train.tfrecords-00000-of-01664
├── ......
└── train.tfrecords-01663-of-01664
where the directory /path_to_data/asr/train
(/path_to_data/asr/devtest
) contains the extracted audio features and the corresponding transcriptions in TFRecord format for training (and evaluation).
Furthermore, to examine the elements in the TFRecord files, we can simply run the command line tool view_tfrecord
:
$ python3 -m neurst.cli.view_tfrecord /path_to_data/asr/train/
features {
feature {
key: "audio"
value {
float_list {
......
value: -0.12725864350795746
value: -0.30450382828712463
value: -0.3461953103542328
value: -0.3424985408782959
value: -0.5286926031112671
}
}
}
feature {
key: "audio_length"
value {
int64_list {
value: 4
}
}
}
feature {
key: "transcript"
value {
int64_list {
value: 5
value: 72
value: 36
value: 8
value: 4
value: 6720
value: 54
value: 182
value: 9
value: 7560
value: 45
value: 476
value: 10002
}
}
}
}
elements: {
"audio": float32
"audio_length": int64
"transcript": int64
}
This step is similar to step 2. The only difference is we do not convert transcript tokens to IDs. Instead, we use raw text for dev and test set in order to compute WER.
We can extract audio features and raw text with
$ ./examples/speech_transformer/gigaspeech/./03-create_devtest_set.sh /path_to_data
Then we have
/path_to_data/
├── spm.model
├── spm.vocab
└── asr
├── asr_data_prep.yml
├── asr_prediction_args.yml
├── asr_training_args.yml
├── asr_validation_args.yml
├── train
│ ├── train.tfrecords-00000-of-01664
│ ├── ......
│ └── train.tfrecords-01663-of-01664
└── devtest
├── DEV.tfrecords-00000-of-00001
└── TEST.tfrecords-00000-of-00001
To examine the elements in the dev and test TFRecord files, we can simply run the command line tool view_tfrecord
$ python3 -m neurst.cli.view_tfrecord /path_to_data/asr/devtest/DEV.tfrecords-00000-of-00001
features {
feature {
key: "audio"
value {
float_list {
......
value: -0.5336110591888428
value: -0.6020540595054626
value: -0.5788766145706177
value: -0.6893889307975769
}
}
}
feature {
key: "src_lang"
value {
bytes_list {
value: "en"
}
}
}
feature {
key: "transcript"
value {
bytes_list {
value: "in this week\'s episode, we drive the new ford ranger. we answer questions about e v\'s and driving in cold weather."
}
}
}
feature {
key: "uuid"
value {
bytes_list {
value: "724c4c23-a096-4a57-9aa9-be4a1e9af2ce"
}
}
}
}
elements: {
"transcript": bytes (str)
"uuid": bytes (str)
"audio": float32
"src_lang": bytes (str)
}
The training and evaluation procedures are the same as those of AugmentedLibrispeech.
Specifically, if you are training the XL subset, please use the speech_transformer_l
model.
Also, please use the default arguments in asr_training_args.yml
if possible, especially the batch_size
. If you train a large model with a small batch_size
, e.g. speech_transformer_l
with batch_size=20,000, the training procedure is hard to converge.