Skip to content

free001style/ASR

Repository files navigation

Automatic Speech Recognition (ASR)

AboutInstallationHow To UseFinal resultsCreditsLicense

About

This repository contains the end-to-end pipeline for solving ASR task with PyTorch. The model was implemented is Deep Speech 2.

See the task assignment here.

See wandb report with all experiments.

Installation

Follow these steps to install the project:

  1. (Optional) Create and activate new environment using conda.

    # create env
    conda create -n ASR python=3.10
    
    # activate env
    conda activate ASR
  2. Install all required packages.

    pip install -r requirements.txt
  3. Download model checkpoint, vocab and language model.

    python download_weights.py

How To Use

Inference

  1. If you want only to decode audio to text, your directory with audio should has the following format:
    NameOfTheDirectoryWithUtterances
    └── audio
         ├── UtteranceID1.wav # may be flac or mp3
         ├── UtteranceID2.wav
         .
         .
         .
         └── UtteranceIDn.wav
    
    Run the following command:
    python inference.py datasets=inference_custom inferencer.save_path=SAVE_PATH datasets.test.audio_dir=TEST_DATA/audio
    where SAVE_PATH is a path to save predicted text and TEST_DATA is directory with audio.
  2. If you have ground truth text and want to evaluate model, make sure that directory with audio and ground truth text has the following format:
    NameOfTheDirectoryWithUtterances
    ├── audio
    │   ├── UtteranceID1.wav # may be flac or mp3
    │   ├── UtteranceID2.wav
    │   .
    │   .
    │   .
    │   └── UtteranceIDn.wav
    └── transcriptions
        ├── UtteranceID1.txt
        ├── UtteranceID2.txt
        .
        .
        .
        └── UtteranceIDn.txt
    
    Then run the following command:
    python inference.py datasets=inference_custom inferencer.save_path=SAVE_PATH datasets.test.audio_dir=TEST_DATA/audio datasets.test.transcription_dir=TEST_DATA/transcriptions
  3. If you only have predicted and ground truth texts and only want to evaluate model, make sure that directory with ones has the following format:
    NameOfTheDirectoryWithUtterances
     ├── ID1.json # may be flac or mp3
     .
     .
     .
     └── IDn.json
    
    ID1 = {"pred_text": "ye are newcomers", "text": "YE ARE NEWCOMERS"}
    
    Then run the following command:
    python calculate_wer_cer.py --dir_path=DIR
  4. Finally, if you want to reproduce results from here, run the following code:
    python inference.py dataloader.batch_size=500 inferencer.save_path=SAVE_PATH datasets.test.part="test-other"
    Feel free to choose what kind of metrics you want to evaluate (see this config).

Training

The model training contains of 3 stages. To reproduce results, train model using the following commands:

  1. Train 47 epochs without augmentations

    python train.py writer.run_name="part1" dataloader.batch_size=230 transforms=example_only_instance trainer.early_stop=47
  2. Train 103 epochs with augmentations

    python train.py writer.run_name="part2" dataloader.batch_size=230 trainer.resume_from=part1/model_best.pth datasets.val.part=test-other
  3. Train 15 epochs from new optimizer state

    python train.py -cn=part3 writer.run_name="part3" dataloader.batch_size=230 datasets.val.part=test-other

It takes around 57 hours to train model from scratch on A100 GPU.

Final results

This results were obtained using beam search and language model:

                WER     CER
test-other     16.96    9.43
test-clean     6.34     2.58

You can see that using language model yields a very significant quality boost:

             WER(w/o lm)    CER(w/o lm)
test-other      25.35          9.73

Finally, beam search also contributes to quality improvement:

             WER(w/o bm)     CER(w/o bm)
test-other      25.80            9.91

Credits

This repository is based on a PyTorch Project Template.

License

License