é³å£°æ³¢å½¢çæã¿ã¹ã¯ã«ããã¦çæããã波形ã«å¯¾ããSTFTãæ失é¢æ°ã«ä½¿ãç 究ã®ãµã¼ãã¤
- Parallel WaveGAN
- NSF
- HiFi-GAN
- MultiBand-MelGAN
- StyleMelGAN
GANç³»ã§SoTAãªvocoderã¯ã¿ããªæ¡ç¨ãã¦ãã¤ã¡ã¼ã¸ãã.
model | loss name | reference | loss | intent |
---|---|---|---|---|
PWG1 | multi-resolution STFT auxiliary loss | spec | Lsc & Lmag | stability and speed1 |
HiFi-GAN1 | Mel-Spectrogram Loss | mel-spec | L1 | stability1 & perceptual quality2 |
MB-MelGAN1 | multi-resolution STFT loss | spec | Lsc & Lmag | speed1 |
StyleMelGAN1 | multi-scale spectral reconstruction loss | spec | Lsc & Lmag | prevent adversarial artifacts1 |
spectral convergence; Lsc
log STFT magnitude; Lmag
Not only amplitude spectra but also phase spectra obtained from generated speech waveforms are used to calculate the proposed loss.
STFT SPECTRAL LOSS FOR TRAINING A NEURAL SPEECH WAVEFORM MODEL
Adversarial Loss + Recoustruction Loss ã¯GANã®å¸¸å¥æ段1.
multi-resolution STFT loss: ç°ãªãn_fftã®STFTãçµã¿åããã¦å
¨ä½lossã«ãã
GANã®å®å®åã»é«éåã«å¯ä¸ãã2.
æçµç²¾åº¦ã«è²¢ç®ãããã¯ãããããã (STFTã£ã¦ããã¨ç²ãlossãªã®ã¯ç¢ºãã«)
c.f. MB-MelGAN ablation study
c.f. HiFi-GAN ablation study (MOS 3.25ãããªãã®ã§å¦ç¿å¤±ææ°å³ãªãã ã¨æã)
-
“ we propose a multi-resolution STFT auxiliary loss.” from the PWG paper↩
-
“Referring to previous work (Isola et al., 2017), applying a reconstruction loss to GAN model helps to generate realistic results” from the HiFi-GAN paper↩
-
“In addition to the GAN objective, we add a mel-spectrogram loss to improve the training efficiency of the generator and the fidelity of the generated audio” from the HiFi-GAN paper↩
-
“The mel-spectrogram loss helps the generator to … stabilizes the adversarial training process from the early stages.” from the HiFi-GAN paper↩
-
“we adopt the multi-resolution STFT loss” from the MB-MelGAN paper↩
-
“To improve the stability and efficiency of the adversarial training process” from the PWG paper↩
-
“the convergence process extremely slow. To solve this problem, we adopt” from the MB-MelGAN paper↩
-
“regularized by a multi-scale spectral reconstruction loss.” from the StyleMelGAN paper↩
-
“to prevent the emergence of adversarial artifacts.” from the StyleMelGAN paper↩
-
“also be expected to have the effect of focusing more on improving the perceptual quality” from the HiFi-GAN paper↩
-
“we observed that the quality improves more stably when the loss is applied.” from the HiFi-GAN paper↩