About • Installation • How To Use • Final results • Credits • License
This repository contains the end-to-end pipeline for solving ASR task with PyTorch. The model was implemented is Deep Speech 2.
See the task assignment here.
See wandb report with all experiments.
Follow these steps to install the project:
-
(Optional) Create and activate new environment using
conda
.# create env conda create -n ASR python=3.10 # activate env conda activate ASR
-
Install all required packages.
pip install -r requirements.txt
-
Download model checkpoint, vocab and language model.
python download_weights.py
- If you want only to decode audio to text, your directory with audio should has the following format:
Run the following command:
NameOfTheDirectoryWithUtterances └── audio ├── UtteranceID1.wav # may be flac or mp3 ├── UtteranceID2.wav . . . └── UtteranceIDn.wav
wherepython inference.py datasets=inference_custom inferencer.save_path=SAVE_PATH datasets.test.audio_dir=TEST_DATA/audio
SAVE_PATH
is a path to save predicted text andTEST_DATA
is directory with audio. - If you have ground truth text and want to evaluate model, make sure that directory with audio and ground truth text has the following format:
Then run the following command:
NameOfTheDirectoryWithUtterances ├── audio │ ├── UtteranceID1.wav # may be flac or mp3 │ ├── UtteranceID2.wav │ . │ . │ . │ └── UtteranceIDn.wav └── transcriptions ├── UtteranceID1.txt ├── UtteranceID2.txt . . . └── UtteranceIDn.txt
python inference.py datasets=inference_custom inferencer.save_path=SAVE_PATH datasets.test.audio_dir=TEST_DATA/audio datasets.test.transcription_dir=TEST_DATA/transcriptions
- If you only have predicted and ground truth texts and only want to evaluate model, make sure that directory with ones has the following format:
Then run the following command:
NameOfTheDirectoryWithUtterances ├── ID1.json # may be flac or mp3 . . . └── IDn.json ID1 = {"pred_text": "ye are newcomers", "text": "YE ARE NEWCOMERS"}
python calculate_wer_cer.py --dir_path=DIR
- Finally, if you want to reproduce results from here, run the following code:
Feel free to choose what kind of metrics you want to evaluate (see this config).
python inference.py dataloader.batch_size=500 inferencer.save_path=SAVE_PATH datasets.test.part="test-other"
The model training contains of 3 stages. To reproduce results, train model using the following commands:
-
Train 47 epochs without augmentations
python train.py writer.run_name="part1" dataloader.batch_size=230 transforms=example_only_instance trainer.early_stop=47
-
Train 103 epochs with augmentations
python train.py writer.run_name="part2" dataloader.batch_size=230 trainer.resume_from=part1/model_best.pth datasets.val.part=test-other
-
Train 15 epochs from new optimizer state
python train.py -cn=part3 writer.run_name="part3" dataloader.batch_size=230 datasets.val.part=test-other
It takes around 57 hours to train model from scratch on A100 GPU.
This results were obtained using beam search and language model:
WER CER
test-other 16.96 9.43
test-clean 6.34 2.58
You can see that using language model yields a very significant quality boost:
WER(w/o lm) CER(w/o lm)
test-other 25.35 9.73
Finally, beam search also contributes to quality improvement:
WER(w/o bm) CER(w/o bm)
test-other 25.80 9.91
This repository is based on a PyTorch Project Template.