This repository provides the official PyTorch implementation of Wavehax, an alias-free neural vocoder that combines 2D convolutions with harmonic priors for high-fidelity and robust complex spectrogram estimation.
To set up the environment, run:
$ cd wavehax
$ pip install -e .
This will install the necessary dependencies in editable mode.
- egs: This directory contains project-specific examples and configurations.
- egs/jvs: An example project using the Japanese Versatile Speech (JVS) Corpus, with speaker- and style-wise fundamental frequency (F0) ranges available at JVS Corpus F0 Range.
- wavehax: The main source code for Wavehax.
This repository uses Hydra for managing hyperparameters. Hydra provides an easy way to dynamically create a hierarchical configuration by composition and override it through config files and the command line.
Prepare your dataset by creating .scp
files that define the path to each audio file (e.g., egs/jvs/data/scp/train_no_dev.scp
).
During the preprocessing step, list files for the extracted features will be automatically generated (e.g., egs/jvs/data/list/train_no_dev.list
).
Ensure that separate .scp
and .list
files are available for training, validation, and evaluation datasets.
To extract acoustic features and prepare statistics:
# Move to the project directory.
$ cd egs/jvs
# Extract acoustic features like F0 and mel-spectrogram. To customize hyperparameters, edit wavehax/bin/config/extract_features.yaml, or override them from the command line.
$ wavehax-extract-features audio_scp=data/scp/all.scp
# Compute statistics of the training data. You can adjust hyperparameters in wavehax/bin/config/compute_statistics.yaml.
$ wavehax-compute-statistics filepath_list=data/scp/train_no_dev.list save_path=data/stats/train_no_dev.joblib
To train the vocoder model:
# Start training. You can adjust hyperparameters in wavehax/bin/config/decode.yaml. In the paper, the model was trained for 1000K steps to match other models, but Wavehax achieves similar performance with fewer training steps.
$ wavehax-train generator=wavehax discriminator=univnet train=wavehax train.train_max_steps=500000 data=jvs out_dir=exp/wavehax
To generate speech waveforms using the trained model:
# Perform inference using the trained model. You can adjust hyperparameters in wavehax/bin/config/decode.yaml.
$ wavehax-decode generator=wavehax data=jvs out_dir=exp/wavehax ckpt_steps=500000
You can monitor the training process using TensorBoard:
$ tensorboard --logdir exp
We plan to release models trained on several datasets.