Skip to content
This repository has been archived by the owner on Apr 29, 2021. It is now read-only.
/ deepspeech Public archive

A PyTorch implementation of DeepSpeech and DeepSpeech2.

License

Notifications You must be signed in to change notification settings

MyrtleSoftware/deepspeech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Myrtle Deep Speech

A PyTorch implementation of DeepSpeech and DeepSpeech2.

This repository is intended as an evolving baseline for other implementations to compare their training performance against.

Current roadmap:

  1. Pre-trained weights for both networks and full performance statistics.
  2. Mixed-precision training.

Running

Build the Docker image:

make build

Run the Docker container (here using nvidia-docker), ensuring to publish the port of the JupyterLab session to the host:

sudo docker run --runtime=nvidia --shm-size 512M -p 9999:9999 deepspeech

The JupyterLab session can be accessed via localhost:9999.

This Python package will accessible in the running Docker container and is accessible through either the command line interface:

deepspeech --help

or as a Python package:

import deepspeech

Examples

deepspeech --help will print the configurable parameters (batch size, learning rate, log location, number of epochs...) - it aims to have reasonably sensible defaults.

Training

A Deep Speech training run can be started by the following command, adding flags as necessary:

deepspeech ds1

By default the experimental data and logs are output to /tmp/experiments/year_month_date-hour_minute_second_microsecond.

Inference

A Deep Speech evaluation run can be started by the following command, adding flags as necessary:

deepspeech ds1 \
           --state_dict_path $MODEL_PATH \
           --log_file \
           --decoder greedy \
           --train_subsets \
           --dev_log wer \
           --dev_subsets dev-clean \
           --dev_batch_size 1

Note the lack of an argument to --log_file causes the WER results to be written to stderr.

Dataset

The package contains code to download and use the LibriSpeech ASR corpus.

WER

The word error rate (WER) is computed using the formula that is widely used in many open-source speech-to-text systems (Kaldi, PaddlePaddle, Mozilla DeepSpeech). In pseudocode, where N is the number of validation or test samples:

sum_edits = sum([edit_distance(target, predict)
                 for target, predict in zip(targets, predictions)])
sum_lens = sum([len(target) for target in targets])
WER = (1.0/N) * (sum_edits / sum_lens)

This reduces the impact on the WER of errors in short sentences. Toy example:

Target Prediction Edit Distance Label Length
lectures lectured 1 1
i'm afraid he said i am afraid he said 2 4
nice to see you mister meeking nice to see your mister makin 2 6

The mean WER of each sample considered individually is:

>>> (1.0/3) * ((1.0/1) + (2.0/4) + (2.0/6))
0.611111111111111

Compared to the pseudocode version given above:

>>> (1.0/3) * ((1.0 + 2 + 2) / (1.0 + 4 + 6))
0.1515151515151515

Maintainer

Please contact sam at myrtle dot ai.