This is the code repository for the neural speech codec presented in the EMNLP 2024 paper ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers [paper]
- Our neural speech codec ESC, within only 30MB, efficiently compresses 16kHz speech at bitrates of 1.5, 3, 4.5, 6, 7.5, and 9kbps, while maintaining comparative reconstruction quality to Descript's audio codec.
- We provide pretrained model checkpoints [download] for different ESC variants and DAC models, as well as a demo webpage [link] including multilingual speech samples.
conda create -n esc python=3.8
conda activate esc
pip install -r requirements.txt
python -m scripts.compress --input /path/to/input.wav --save_path /path/to/output --model_path /path/to/model --num_streams 6 --device cpu
This will create .pth
(code) and .wav
(reconstructed audio) files under the specified save_path
. Our codec supports num_streams
from 1 to 6, corresponding to bitrates 1.5 ~ 9.0 kbps. For programmatic usage, you can compress audio tensors using torchaudio
as follows:
import torchaudio, torch
from esc import ESC
model = ESC(**config)
model.load_state_dict(torch.load("model.pth", map_location="cpu"),)
x, _ = torchaudio.load("input.wav")
# Enc. (@ num_streams*1.5 kbps)
codes, f_shape = model.encode(x, num_streams=6)
# Dec.
recon_x = model.decode(codes, f_shape)
For more details, see the example.ipynb
notebook.
We provide developmental training and evaluation datasets available on Hugging Face. For custom training, set the train_data_path
in exp.yaml
to the parent directory containing .wav
audio segments. Run the following to start training:
WANDB_API_KEY=your_API_key
accelerate launch main.py --exp_name esc9kbps --config_path ./configs/9kbps_esc_base.yaml --wandb_project efficient-speech-codec --lr 1.0e-4 --num_epochs 80 --num_pretraining_epochs 15 --num_devices 4 --dropout_rate 0.75 --save_path /path/to/output --seed 53
We use accelerate
library to handle distributed training and wandb
library for monitoring. To enable adversarial training with the same discriminator in DAC, include the --adv_training
flag.
Training a base ESC model on 4 RTX4090 GPUs takes ~12 hours for 250k steps on 3-second speech clips with a batch size of 36. Detailed experiment configurations can be found in the configs/
folder. For complete experiments presented in the paper, refer to scripts_all.sh
.
CUDA_VISIBLE_DEVICES=0
python -m scripts.test --eval_folder_path path/to/data --batch_size 12 --model_path /path/to/model --device cuda
This will run codec evaluation across all available bandwidth on the specified test set folder. We provide four metrics for reporting: PESQ
, Mel-Distance
, SI-SDR
and Bitrate-Utilization-Rate
. Evaluation statistics will be saved under model_path
by default.
You can download the pre-trained model checkpoints below:
Codec | Checkpoint | #Param. |
---|---|---|
ESC-Base | Download | 8.39M |
ESC-Base(adv) | Download | 8.39M |
ESC-Large | Download | 15.58M |
DAC-Tiny(adv) | Download | 8.17M |
DAC-Tiny | Download | 8.17M |
DAC-Base(adv) | Download | 74.31M |
We provide a comprehensive performance comparison of ESC with Descript's audio codec (DAC) at different scales of model sizes (w/ and w/o adversarial trainings).
If you find our work useful or relevant to your research, please kindly cite our paper:
@article{gu2024esc,
title={ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers},
author={Gu, Yuzhe and Diao, Enmao},
journal={arXiv preprint arXiv:2404.19441},
year={2024}
}