This is the official benchmark platform accompanying the paper CL-MASR: A Continual Learning Benchmark for Multilingual ASR.
It includes scripts to train Whisper and WavLM-based ASR systems on a subset of 20 languages selected from Common Voice 13 in a continual learning fashion using a handful of methods including rehearsal-based, architecture-based, and regularization-based approaches.
The goal is to continually learn new languages while limiting forgetting the previously learned ones. An ideal method should achieve both positive forward transfer (i.e. improve performance on new tasks leveraging shared knowledge from previous tasks) and positive backward transfer (i.e. improve performance on previous tasks leveraging shared knowledge from new tasks).
The following algorithms have been implemented so far:
-
Rehearsal-based
- Experience Replay (ER)
- Averaged Gradient Episodic Memory (A-GEM)
- Dark Experience Replay (DER) (task-incremental variant)
-
Architecture-based
- Progressive Neural Networks (PNN)
- Piggyback (PB)
- Learning to Prompt (L2P) (task-aware variant)
-
Regularization-based
- Elastic Weight Consolidation (EWC) (online variant)
- Learning without Forgetting (LwF) (online variant)
- Memory Aware Synapses (MAS)
⚡ Dataset [download]
The dataset used for the CL-MASR benchmark is extracted from Common Voice 13 (see reference paper). Each of the 20 languages in the dataset includes approximately 10 hours of training material, with an additional 1 hour designated for validation and another 1 hour for testing purposes.
Download the dataset from here and extract it to a data folder of your choice (CL-MASR by default).
To set up the benchmark, clone the benchmark repository and install SpeechBrain:
git clone https://github.com/speechbrain/benchmarks.git
cd benchmarks
git submodule update --init --recursive
cd speechbrain
pip install -r requirements.txt
pip install -e .Navigate to <path-to-repository>/benchmarks/CL_MASR/<model>, open a terminal and run:
python train_<cl-method>.py hparams/train_<cl-method>.yaml --data_folder <path-to-data-folder>NOTE: in order to reproduce the experiments with WavLM large, you need to download checkpoint pretrained on the base languages from here.
NOTE: to profile the model (optional), install ptflops and torchinfo as additional dependencies.
NOTE: multi-GPU training is currently not supported.
Navigate to <path-to-repository>/benchmarks/CL_MASR, open a terminal and run:
python analyze_logs.py <path-to-folder-containing-model-logs>This command will recursively retrieve and analyze all log files that are named according to the
format <cl-method>_base=<comma-separated-base-locales>_new=<comma-separated-new-locales>.txt
(this is the default naming convention followed in all the training scripts).
You can find the resulting performance metric summaries and plots in <path-to-folder-containing-model-logs>.
See the help (python analyze_logs.py -h) for advanced configuration options.
NOTE: make sure to specify the --im_refs and --fwt_refs arguments that correspond to the given model (default to Whisper large-v2).
NOTE: to plot the results (optional), install matplotlib and/or plotly as additional dependencies.
| Release | Hyperparameters | Average AWER | Average BWT | Average IM | Average FWT | Logs | GPUs |
|---|---|---|---|---|---|---|---|
| 07-06-23 | whisper/hparams/train_ft.yaml | 98.50 | -84.58 | -4.16 | -0.83 | Link | 1xV100 32GB |
| 07-06-23 | whisper/hparams/train_er.yaml | 50.83 | -13.20 | -0.81 | -4.17 | Link | 1xV100 32GB |
| 07-06-23 | whisper/hparams/train_agem.yaml | 81.08 | -55.85 | 0.20 | -5.19 | Link | 1xV100 32GB |
| 01-10-23 | whisper/hparams/train_der.yaml | 67.84 | -41.28 | -4.29 | - | Not available | 1xV100 32GB |
| 07-06-23 | whisper/hparams/train_pnn.yaml | 44.12 | 0.00 | 3.18 | -8.16 | Link | 1xV100 32GB |
| 07-06-23 | whisper/hparams/train_pb.yaml | 43.95 | 0.00 | 3.51 | -8.50 | Link | 1xV100 32GB |
| 01-10-23 | whisper/hparams/train_l2p.yaml | 114.65 | 0.00 | 110.50 | - | Not available | 1xV100 32GB |
| 07-06-23 | whisper/hparams/train_ewc.yaml | 98.04 | -68.32 | 2.87 | -7.85 | Link | 1xV100 32GB |
| 07-06-23 | whisper/hparams/train_lwf.yaml | 95.76 | -77.50 | 0.00 | -4.98 | Link | 1xV100 32GB |
| 01-10-23 | whisper/hparams/train_mas.yaml | 68.08 | -0.58 | 38.62 | - | Not available | 1xV100 32GB |
| 07-06-23 | wavlm/hparams/train_ft.yaml | 91.61 | -54.67 | -10.19 | -0.21 | Link | 1xV100 32GB |
| 07-06-23 | wavlm/hparams/train_er.yaml | 60.79 | -8.96 | -7.62 | -2.77 | Link | 1xV100 32GB |
| 07-06-23 | wavlm/hparams/train_agem.yaml | 72.54 | 13.59 | 35.29 | -45.69 | Link | 1xV100 32GB |
| 01-10-23 | wavlm/hparams/train_der.yaml | 71.22 | -16.64 | -3.21 | - | Not available | 1xV100 32GB |
| 07-06-23 | wavlm/hparams/train_pnn.yaml | 66.07 | 0.00 | 12.95 | -23.34 | Link | 1xV100 32GB |
| 07-06-23 | wavlm/hparams/train_pb.yaml | 61.87 | 0.00 | 2.75 | -13.15 | Link | 1xV100 32GB |
| 01-10-23 | wavlm/hparams/train_l2p.yaml | 92.72 | 0.00 | 52.11 | - | Not available | 1xV100 32GB |
| 07-06-23 | wavlm/hparams/train_ewc.yaml | 86.98 | -39.54 | -4.26 | -6.13 | Link | 1xV100 32GB |
| 07-06-23 | wavlm/hparams/train_lwf.yaml | 87.17 | -26.03 | 10.42 | -20.82 | Link | 1xV100 32GB |
| 01-10-23 | wavlm/hparams/train_mas.yaml | 83.06 | -1.37 | 33.22 | - | Not available | 1xV100 32GB |
Raw experiment logs are available here. We do not include the checkpoints due to storage limits (each experiment with Whisper large-v2 generates ~125 GB of checkpoint data).
Analyses generated via analyze_logs.py are available here.
All the experiments were run on 5 CentOS Linux machines with an Intel(R) Xeon(R) Silver 4216 Cascade Lake CPU with 32 cores @ 2.10 GHz, 64 GB RAM and an NVIDIA Tesla V100 SXM2 @ 32 GB with CUDA Toolkit 11.4. With the specified hardware configuration, approximately 10 days are necessary to complete all the experiments.
If you use the CL-MASR benchmark, please cite:
@article{dellalibera2024clmasr,
author = {{Della Libera}, Luca and Mousavi, Pooneh and Zaiem, Salah and Subakan, Cem and Ravanelli, Mirco},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
title = {{CL-MASR}: A Continual Learning Benchmark for Multilingual {ASR}},
year = {2024},
volume = {32},
number = {},
pages = {4931--4944},
doi = {10.1109/TASLP.2024.3487410}
}If you use SpeechBrain, please cite the reference paper:
@article{ravanelli2024open,
author = {Mirco Ravanelli and Titouan Parcollet and Adel Moumen and Sylvain de Langen and Cem Subakan and Peter Plantinga and Yingzhi Wang and Pooneh Mousavi and Luca {Della Libera} and Artem Ploujnikov and Francesco Paissan and Davide Borra and Salah Zaiem and Zeyu Zhao and Shucong Zhang and Georgios Karakasidis and Sung-Lin Yeh and Pierre Champion and Aku Rouhe and Rudolf Braun and Florian Mai and Juan Zuluaga-Gomez and Seyed Mahed Mousavi and Andreas Nautsch and Ha Nguyen and Xuechen Liu and Sangeet Sagar and Jarod Duret and Salima Mdhaffar and Ga{{\"e}}lle Laperri{{\`e}}re and Mickael Rouvier and Renato De Mori and Yannick Est{{\`e}}ve},
title = {Open-Source Conversational {AI} with {SpeechBrain} 1.0},
journal = {Journal of Machine Learning Research},
year = {2024},
volume = {25},
number = {333},
pages = {1--11},
url = {http://jmlr.org/papers/v25/24-0991.html}
}@article{ravanelli2021speechbrain,
author = {Mirco Ravanelli and Titouan Parcollet and Peter Plantinga and Aku Rouhe and Samuele Cornell and Loren Lugosch and Cem Subakan and Nauman Dawalatabad and Abdelwahab Heba and Jianyuan Zhong and Ju-Chieh Chou and Sung-Lin Yeh and Szu-Wei Fu and Chien-Feng Liao and Elena Rastorgueva and François Grondin and William Aris and Hwidong Na and Yan Gao and Renato De Mori and Yoshua Bengio},
title = {{SpeechBrain}: A General-Purpose Speech Toolkit},
journal = {arXiv preprint arXiv:2106.04624},
year = {2021},
url = {https://arxiv.org/abs/2106.04624},
}