NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

This is an implementation of Microsoft's NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality in Pytorch.

Contribution and pull requests are highly appreciated!

23.02.09: Demo samples (using the first 1800 epochs) are out. (link)

Overview

Naturalspeech is a VAE-based model that employs several techniques to improve the prior and simplify the posterior. It differs from VITS in several ways, including:

Phoneme pre-training: Naturalspeech uses a pre-trained phoneme encoder on a large text corpus, obtained through masked language modeling on phoneme sequences.
Differentiable durator: The posterior operates at the frame level, while the prior operates at the phoneme level. Naturalspeech uses a differentiable durator to bridge the length difference, resulting in soft and flexible features that are expanded.
Bidirectional Prior/Posterior: Naturalspeech reduces the posterior and enhances the prior through normalizing flow, which maps in both directions with forward and backward loss.
Memory-based VAE: The prior is further enhanced through a memory bank using Q-K-V attention."

Notes

This implementation does not include pre-training of phonemes using a large-scale text corpus from the news-crawl dataset.
The multiplier for each loss can be adjusted in the configuration file. Using losses without a multiplier may not lead to convergence.
The tuning stage for the last 2k epochs has been omitted.
Due to the high VRAM usage of the soft-dtw loss, there is an option to use a non-softdtw loss for memory efficiency.
For the soft-dtw loss, the warp factor has been set to 134.4 (0.07 * 192) to match the non-softdtw loss, instead of 0.07.
To train the duration predictor in the warm-up stage, duration labels are required. The paper suggests using any tool to provide the duration label. In this implementation, a pre-trained VITS model was used.
To further improve memory efficiency during training, randomly silced sequences are fed to the decoder as in the VITS model.

How to train

# python >= 3.6
pip install -r requirements.txt

clone this repository
download The LJ Speech Dataset: link

create symbolic link to ljspeech dataset:

ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1

text preprocessing (optional, if you are using custom dataset):

apt-get install espeak

python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt

duration preprocessing (obtain duration labels using pretrained VITS):

If you want to skip this section, use durations/durations.tar.bz2 and overwrite the durations folder.
1. git clone https://github.com/jaywalnut310/vits.git; cd vits
2. create symbolic link to ljspeech dataset
```
ln -s /path/to/LJSpeech-1.1/wavs/ DUMMY1
```
3. download pretrained VITS model described as from VITS official github: github link / pretrained models
4. setup monotonic alignment search (for VITS inference):
```
cd monotonic_align; mkdir monotonic_align; python setup.py build_ext --inplace; cd ..
```
5. copy duration preprocessing script to VITS repo: cp /path/to/naturalspeech/preprocess_durations.py .
6. ```
python3 preprocess_durations.py --weights_path ./pretrained_ljs.pth --filelists filelists/ljs_audio_text_train_filelist.txt.cleaned filelists/ljs_audio_text_val_filelist.txt.cleaned filelists/ljs_audio_text_test_filelist.txt.cleaned
```
7. once the duration labels are created, copy the labels to the naturalspeech repo: cp -r durations/ path/to/naturalspeech
train (warmup)
```
python3 train.py -c configs/ljs.json -m [run_name] --warmup
```
Note here that ljs.json is for low-resource training, which runs for 1500 epochs and does not use soft-dtw loss. If you want to reproduce the steps stated in the paper, use ljs_reproduce.json, which runs for 15000 epochs and uses soft-dtw loss.
initialize and attach memory bank after warmup:
```
  python3 attach_memory_bank.py -c configs/ljs.json --weights_path logs/[run_name]/G_xxx.pth
```
if you lack memory, you can specify the --num_samples argument to use only a subset of samples.

train (resume)

  python3 train.py -c configs/ljs.json -m [run_name]

You can use tensorboard to monitor the training.

tensorboard --logdir /path/to/naturalspeech/logs

During each evaluation phase, a selection of samples from the test set is evaluated and saved in the logs/[run_name]/eval directory.

References

VITS implemetation by @jaywalnut310 for normalizing flows, phoneme encoder, and hifi-gan decoder implementation
Parallel Tacotron 2 Implementation by @keonlee9420 for learnable upsampling Layer
soft-dtw implementation by @Maghoumi for sdtw loss

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Overview

Notes

How to train

References

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
durations		durations
filelists		filelists
models		models
resources		resources
text		text
utils		utils
.gitignore		.gitignore
README.md		README.md
attach_memory_bank.py		attach_memory_bank.py
naturalspeech_training.ipynb		naturalspeech_training.ipynb
preprocess_durations.py		preprocess_durations.py
preprocess_texts.py		preprocess_texts.py
requirements.txt		requirements.txt
train.py		train.py

heatz123/naturalspeech

Folders and files

Latest commit

History

Repository files navigation

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

Overview

Notes

How to train

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages