FastPitch: Parallel Text-to-speech with Pitch Prediction

Abstract

We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference, and generates speech that could be further controlled with predicted contours. FastPitch can thus change the perceived emotional state of the speaker or put emphasis on certain lexical units. We find that uniformly increasing or decreasing the pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the quality of synthesized speech, making it comparable to state-of-the-art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformer architecture of FastSpeech with a similar speed of mel-scale spectrogram synthesis, orders of magnitude faster than real-time.

source code (published May 18th, 2020)

article (published June 15th, 2020)

Interactive Pitch Manipulation

FastPitch learns to model the voice according to the pitch countour. The predicted contour may be adjusted - automatically or manually - as shown in the video below. The interface is used to adjust the predicted pitch vector . A single FastPitch model is used, with no additional post-processing.

Audio Samples

The samples come from our development subset of the LJSpeech-1.1 dataset. In all cases we use the same WaveGlow vocoder. A single FastPitch model has been used. We present examples of automatic pitch transformations applied during synthesis.

The criteria in effect prior to November twenty-two, nineteen sixty-three, for determining whether to accept material for the PRS general files

Ground truth Ground truth mel + WaveGlow Tacotron2
FastPitch FastPitch (amplified) FastPitch (inverted)
FastPitch (flattened) FastPitch (-50 Hz) FastPitch (+50 Hz)

Chapter seven. Lee Harvey Oswald: Background and Possible Motives, Part one.

Ground truth Ground truth mel + WaveGlow Tacotron2
FastPitch FastPitch (amplified) FastPitch (inverted)
FastPitch (flattened) FastPitch (-50 Hz) FastPitch (+50 Hz)

which he kept concealed in a hiding-place with a trap-door just under his bed.

Ground truth Ground truth mel + WaveGlow Tacotron2
FastPitch FastPitch (amplified) FastPitch (inverted)
FastPitch (flattened) FastPitch (-50 Hz) FastPitch (+50 Hz)

At the first the boxes were impounded, opened, and found to contain many of O'Connor's effects.

Ground truth Ground truth mel + WaveGlow Tacotron2
FastPitch FastPitch (amplified) FastPitch (inverted)
FastPitch (flattened) FastPitch (-50 Hz) FastPitch (+50 Hz)

He was in consequence put out of the protection of their internal law, end quote. Their code was a subject of some curiosity.

Ground truth Ground truth mel + WaveGlow Tacotron2
FastPitch FastPitch (amplified) FastPitch (inverted)
FastPitch (flattened) FastPitch (-50 Hz) FastPitch (+50 Hz)

Audio Samples: Multi-speaker

We present samples synthesized with a multi-speaker FastPitch, trained on a dataset with three speakers (57 hours of data). The output of the model is conditioned on the speaker embedding.

The criteria in effect prior to November twenty-two, nineteen sixty-three, for determining whether to accept material for the PRS general files

FastPitch LJS FastPitch Sally FastPitch Helen
FastPitch 50% LJS / 50% Sally FastPitch 50% Sally / 50% Helen

At the first the boxes were impounded, opened, and found to contain many of O'Connor's effects.

FastPitch LJS FastPitch Sally FastPitch Helen
FastPitch 50% LJS / 50% Sally FastPitch 50% Sally / 50% Helen

He was in consequence put out of the protection of their internal law, end quote. Their code was a subject of some curiosity.

FastPitch LJS FastPitch Sally FastPitch Helen
FastPitch 50% LJS / 50% Sally FastPitch 50% Sally / 50% Helen