ãã®è¨äºã¯ ゆゆ式 Advent Calendar 2017 - Adventar 24 æ¥ç®ã®è¨äºã§ãã
ã¯ããã«
ååãå¯ã®ç»åãç¡éã«çæãããã¨ã«ï¼é¨åçã«ï¼æåãã訳ã§ãããç»åãã§ãããä»åº¦ã¯å£°ã欲ãããªã£ã¦ãã¾ãã
ããã§ã [1710.08969] Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention ã chainer ã§å®è£ ãã¦ãå¯ã®å£°ã§å¦ç¿ããã¦ã¿ã¾ããã
ã¬ãã¸ããª
ã¬ãã¸ããªã§ãã
çµæ
æ£ç´ããã¾ã§ã¯ãªãªãã£ã®é«ã声ã¯çæã§ãã¾ããã§ããã
å¦ç¿ãã¼ã¿ãè¶³ããªãã®ãä¸çªã®åå ã®ããã§ãã詳ããèå¯ã解説ã¯å¾åãã«ãã¦ã¨ããããçµæãè¼ãã¦ããã¾ãã
ãããå¼ã¢ããã³ãã«ã¬ã³ãã¼ã
ä¸å¿ãå¯ãããå¼ã¢ããã³ãã«ã¬ã³ãã¼ã£ã¦è¨ã£ã¦ããã¼ã£ã¦è¨ãããã°ãããããªãã¨ããã¬ãã«ã§ãã
ãã¾ããã¿ã¤ã ãããã
æåã®ããã¯èãåããããã§ããã¾ã ã¾ã ã¨ããå°è±¡ã
ãããããç¾å®ããã¹ã¦èªåã®ã»ãã¸ããæ²ããã®ã ã
åççã«ã¯é·ãæç« ãä¸åº¦ã«çæã§ããã®ã§ãããã¡ããã¨å¦ç¿ã§ãã¦ãªãã¨ãã®ããã«æå³ããããªãæãã«ãªãã¾ãã
ããã ã®ããã¼ããããã人ã ãã
ãã¡ãã¯ãå¦ç¿ãã¼ã¿ã«å«ã¾ããå®éã®å¯ã®ã»ãªãã§ããä¸ã®ï¼ã¤ã«æ¯ã¹ã¦ããªã精度ãé«ããã¨ãããéå¦ç¿ãã¦ãã¾ã£ã¦ãããã¨ãåããã¾ãã
ããããã
ãç¸ã
ä½ä¸ä½åº¦ãç»å ´ããã¯ã¼ãã§ãï¼ããã¯ããï¼ããããæµç³ã«ç²¾åº¦é«ãã§ããã
å®ç¨ããã¯ã¾ã ã¾ã ç¨é ãã§ãããã¨ããããæåã®ããã¸ã§ã¯ãã¨ãã¦ã¯ãå¯ãä¸å¿ä½ã話ãã¦ãã£ã½ãã¨ããã¾ã§æ¼ãçããã®ã§è¯ãã¨ãã¾ãããã
ããããã¾ããã¯ãªãªãã£ã大ããæ¹åã§ãããã©ããã§çºè¡¨ãããã¨æãã¾ãã
æè¡ç詳細ã«ã¤ãã¦
ããããå ã¯æè¡çãªè©³ç´°ã«ã¤ãã¦æ¸ãã¾ãã
ä¸å¿å¯¾è±¡ã¨ãã¦èãã¦ããã®ã¯ã深層å¦ç¿ãå°ã触ã£ããã¨ã®ãããããã®äººã§ãã
ãã£ã±ãåãããªãã£ããã©ããä¸ã¾ã§èªã¿é£ã°ãã¦ãã ããã
èæ¯
深層å¦ç¿ã«ãã end-to-end é³å£°åæã¯ããä¸å¹´ãããã§å¤§ããçºå±ãã¾ããã
Google ã® Tacotron ã Baidu ã® DeepVoice ãªã©ãæåã§ããæè¿ Tacotron2 ãçºè¡¨ããã¦è©±é¡ã«ãªãã¾ããã
ããã㯠RNN ããã¼ã¹ã¨ãã¦ããã®ã§ãããå¦ç¿ãé ãã大伿¥ã® GPU ã§æ®´ã£ã¦åãã¦æ¬é ãçºæ®ããé¡ã®ã¢ãã«ã§ããã
ããã§ããæ°ã«æãããã®éã«ã CNN ããã¼ã¹ã«ãããé«éã«å¦ç¿ã§ããã¢ãã«ãããã¤ãæå±ããã¾ããã
DeepVoice3 ããã®ä»£è¡¨ã§ãåãä»åå®è£ ãã [1710.08969] Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention ï¼ã¢ãã«ã«ååãã¤ãã¦ãªããããªã®ã§è«æåã®é æåãåã£ã¦ ETTTS ã¨å¼ã¶ãã¨ã«ãã¾ãï¼ããã®ä¸ã¤ã§ãã
CNN ããã¼ã¹ã«ããé³å£°åæã¨ããã° WaveNet ãæãæµ®ãã¹ãæ¹ããããã¨æãã¾ããã WaveNet ã¯çã®æ³¢å½¢ãç³ã¿è¾¼ãã§æ¬¡ã®æ³¢å½¢ãçæããã®ã«å¯¾ãã DeepVoice3 ã ETTTS ã¯ã¡ã«ã¹ãã¯ããã°ã©ã ï¼é³å£°ã®ç¹å¾´éï¼ãç³ã¿è¾¼ãã§æ¬¡ã®ã¡ã«ã¹ãã¯ããã°ã©ã ãçæãã¾ãã DeepVoice3 ã ETTTS ã¯æçµçã«çæããã¡ã«ã¹ãã¯ããã°ã©ã ã使ã£ã¦çã®æ³¢å½¢ãä½ããã®æ¹æ³ã§å¾©å ãã¾ãã
DeepVoice3 ã ETTTS ã§çæããã¡ã«ã¹ãã¯ããã°ã©ã ã WaveNet ã«ã¬ã¤ãã¨ãã¦å ¥ãã¦çã®æ³¢å½¢ãçæãããã§ããã®ã§ã DeepVoice3 ã ETTTS 㯠WaveNet ã¨ã¯ã¬ã¤ã¤ã¼ãå°ãéãæãã§ãã
話ãå°ãããã¾ãããããããã訳ã§ãé«éã«å¦ç¿ã§ããæ·±å±¤é³å£°åæã¢ãã«ãæå±ããã¾ããããããå®éã«è©¦ãã¦ã¿ãã®ããã®è¨äºã§ãã
DeepVoice3 ã§ã¯ãªã ETTTS ã試ããçç±ã¯åã«ã¢ãã«ãåç´ã§åãããããã精度ãããã¾ã§å¤ãããªãããã ã£ãããã§ããã¾ãä»åº¦ DeepVoice3 ã試ãããã¨æã£ã¦ãã¾ãã
ã¢ãã«
ETTTS ã¯å¤§ããåãã¦äºã¤ã®ãããã¯ã¼ã¯ããæ§æããã¾ãã
ã²ã¨ã¤ã¯ Text2Mel ã§ãããã¯ããã¹ãå ¨ä½ã¨ã¡ã«ã¹ãã¯ããã°ã©ã ããæ¬¡ã®ã¡ã«ã¹ãã¯ããã°ã©ã ãäºæ¸¬ãã¾ãããããã¡ã¤ã³é¨åã§ãã
ããã²ã¨ã¤ã¯ Spectrogram Super-resolution Network (SSRN) ã§ãã¡ã«ã¹ãã¯ããã°ã©ã ããæ¯å¹ ã¹ãã¯ããã°ã©ã ï¼çæéãã¼ãªã¨å¤æã®çµ¶å¯¾å¤ï¼ã復å ãããããã¯ã¼ã¯ã§ããã¡ã«ã¹ãã¯ããã°ã©ã ããçæ³¢å½¢ãåæããã®ã¯ãããããªææ³ãããã®ã§ããããã®è«æã§ã¯æ¯å¹ ã¹ãã¯ããã°ã©ã ã復å ããã¨ããã¾ã§ç¬èªã®ã¢ãã«ã§çæãã¦ãããããçæ³¢å½¢ã復å ããã®ã¯æ±ºå®çãªã¢ã«ã´ãªãºã (Griffin&Lim) ã使ç¨ãã¦ãã¾ãã
Text2Mel ã§æ±ãã¡ã«ã¹ãã¯ããã°ã©ã ã¯å¦ç¿ãç°¡åã«ããããã«ãæéæ¹åã« 4 ã¤ããã«åã£ããã®ï¼é·ã 1/4ï¼ã使ç¨ãã SSRN ã§å ã®é·ãã®æ¯å¹ ã¹ãã¯ããã°ã©ã ã«å¾©å ãã¾ããããã Super-resolution ã¨ããååã®ç±æ¥ã§ãã
Text2Mel
ã¾ãã¯ã¡ã¤ã³ã® Text2Mel ã§ããããã¯ããã« 4 ã¤ã®ãã¼ãã«åããã¾ãã
TextEnc ã¯ããã¹ãã convolution ãã¦ãç¹å¾´é (Value) (d x textlength 次å
) ã¨ãåæåã«ã¤ãã¦ã©ã®ãããçç®ããã°è¯ããã表ã
(Key) (d x textlength 次å
) ãåºåãã¾ãã 15 層ããã¾ãã
AudioEnc ã¯ã¡ã«ã¹ãã¯ããã°ã©ã ã causal convolution (ããæç¹ã®è¨ç®ã«ããã®æç¹ä»¥åã®å¤ã®ã¿ã使ã£ã¦ç³ã¿è¾¼ãï¼ãã¦ãç¹å¾´é (d x audiolength 次å
) ãåºåãã¾ãã 13 層ããã¾ãã
次㫠Attention ã (textlength x audiolength 次å
) ã«ããè¨ç®ãã¾ãï¼ç©ã¯è¡åç©ã§ãï¼ãããã¯ãé³å£°ã®åæç¹ã«ã¤ãã¦ã©ã®æåãçç®ãããã表ãéã§ããé³å£°åæãæ©æ¢°ç¿»è¨³ãªã©ã®åéã§ã¯é¦´æã¿æ·±ãæ¦å¿µã ã¨æãã¾ãã
ã¯ãé³å£°ã®
çªç®ã®æç¹ã§ã¯ãããã¹ãã®
æåç®ã«
ã®å¼·ãã§çç®ãã¦ããã¨ãããã¨ã表ãã¾ãã
ããã¹ãã®ç¹å¾´é ã«
ãï¼è¡åç©ã¨ãã¦ï¼æããããããã®ã
ã¨ãã¾ããããã¯ãããã¹ãã®åæåã®ç¹å¾´éãæ³¨ç®åº¦ã§éã¿ã¥ãããã¨ã«ãªãã¾ãã
AudioDec 㯠ã¨
ã causal convolution ãã¦ã 1 ã¹ãããå¾ã®ã¡ã«ã¹ãã¯ããã°ã©ã (F x audiolength 次å
) ã®äºæ¸¬å¤ãåºåãã¾ãã 11 層ããã¾ãã
ãããã®ãããã¯ã¼ã¯ã® convolution ã¯ãå容éãåºãããããã« dilation ã大ããããããå¾é ã伿ãããããã« Highway ã«ãã¦ããããã¾ããããã«è©³ç´°ãªãããã¯ã¼ã¯æ§æã«ã¤ãã¦ã¯åè«æãåç §ãã¦ãã ããã
SSRN
SSRN ã¯ãã¡ã«ã¹ãã¯ããã°ã©ã (F x audiolength 次å ) ã convoluion ãã¦ãæ¯å¹ ã¹ãã¯ããã°ã©ã (F' x (4audiolength) 次å ) ãåºåãã¾ãã 1/4 ã®é·ãã«ãªã£ãã¡ã«ã¹ãã¯ããã°ã©ã ãå ã®é·ãã«æ»ãããã«ãéä¸ deconvolution ã 2 度ã¯ãã¿ã¾ããå ¨é¨ã§ 16 層ããã¾ãã
å¦ç¿æ¹æ³
Text2Mel 㨠SSRN ã¯ããããç¬ç«ã«å¦ç¿ããã¾ãã
Text2Mel ã« Text[0: textlength] 㨠Audio[0: audiolength] ãå ¥åã㦠Y[0: audiolength] ãåºåãããã¨ãã®èª¤å·®é¢æ°ã¯ã Audio[1: audiolength+1] 㨠Y[0:audiolength] ã® MAE ã¨ã¯ãã¹ã¨ã³ãããã¼ (binary divergence) ã®åã«ã次ã«ç¤ºã Guided Attention Loss ãå ãããã®ã«ãªãã¾ãã
æ©æ¢°ç¿»è¨³ã® Attention ã¯ãä¾ãã° "I like yuyushiki". -> ãç§ã¯ããå¼ã好ãã§ããã ã®å ´åã (1, 3, 2) ã®é çªã§æ³¨ç®ããããã«ã Attention ãã©ã®ãããªå½¢ã«ãªããéèªæã§ãã䏿¹ãé³å£°åæã®å ´åã Attention ãå調ã«ãªãï¼é³å£°ã®æéãé²ãã«ã¤ãã¦ãèªãã¹ãæåãé²ãï¼ãç¹ã«ã»ã¨ãã©ç·å½¢ã«ãªãã¨ããç¹å¾´ãããã¾ãã
ããã§ããã®ãã¡ã¤ã³ç¥èãæ´»ããã¦ã Attention è¡å ã¨èª¤å·®è¡å
(textlength x audiolength 次å
) ã«ã¤ãã¦ã
(ç©ã¯ elementwise) ã®å¹³åã Guided Attention Loss ã¨ãã¦èª¤å·®é¢æ°ã«å ãã¾ãã
ã¯ã対è§ç·ä»è¿ã
ã«è¿ãã対è§ç·ããé ãé¢ããã«ãããã£ã¦å¤§ãããªããããªè¡åã§ããã¤ã¾ãã Attention
ã対è§ç·ããé¢ããä½ç½®ã«å¼·ãå¤ãæã¤ï¼èªãé çªãç·å½¢ããé¢ãã¦ãã¾ã£ãå ´åï¼ã«å¤§ããªããã«ãã£ã課ãã¨ãã¨ã«ãªã£ã¦ãã¾ããããã«ãã£ã¦ Attention ã®å¦ç¿ãæ©ãåæãã广ãããã¾ãã
SSRN ã§ãããããã¯åã« MAE ã¨ã¯ãã¹ã¨ã³ãããã¼ã®åãæå°åããããã«å¦ç¿ããã ãã§ãã SSRN ã¯é³å£°ãã¼ã¿ããããã°æå¸«ç¡ãã§å¦ç¿ã§ããä¸ãå®ç¨çãªç²¾åº¦ã¾ã§ä¸ããã®ãæ¯è¼çç°¡åã§ãã
å®é¨
ã¾ãã¯è«æã¨åããããããªãã¯ãã¡ã¤ã³ã®è±èªé³å£°ã³ã¼ãã¹ã§ãã The LJ Speech Dataset ãç¨ã㦠Text2Mel 㨠SSRN ãå¦ç¿ããã¾ããããã®ãã¼ã¿ã«ã¯ 24 æéåã®åä¸å¥³æ§ã®è±èªé³å£°ãã¡ã¤ã«ã¨ãã®å稿ãå«ã¾ãã¦ãã¾ããããã ããã¼ã¿ãããã°ã¾ã¨ããªé³å£°åæã¢ãã«ãå¦ç¿ã§ãã¾ãããã¾ããé«éå¦ç¿ã謳ã£ã¦ããã ãããã GTX 1080Ti ã使ã両ã¢ãã«ãããã 1 æ¥ç¨åº¦ã§å¦ç¿ãå®äºãã¾ããã
"No Event Good Life"
ã¡ããã¨èãåãã¾ããã
"All your base are belong to us"
ãã¼ã¿ã»ããã«å«ã¾ãã¦ããªãããããè±èªã¨ãã¦ã¡ãã£ã¨ã©ããã¨æãæãã¡ããã¨çºè©±ãã¦ããã¾ããå®¶åºã® GPU 䏿¥ã®å¦ç¿ã§ããã¾ã§ã§ããã®ã¯ãããï¼
ãå³ç¤ºããã¨ãã®å ´åãããªæãã«ãªãã¾ããã¡ããã¨å調ã«ãªã£ã¦ãã¾ããã
Text2Mel ã LJSpeech ã§å¦ç¿ããã loss ã¯ãããªæãã§ããã
æ¬¡ã«æ¬çªã®å¯ã®ãã¼ã¿ã»ããã§ãããã¡ãããããªãã¼ã¿ã»ããã¯ç¡ãã®ã§ãèªåã§ç¨æãã¾ãã
ã¾ã㯠JS ã§éã«ã¢ããã¼ã·ã§ã³ãã¼ã«ãèªä½ãã¾ããã
ã¢ããã¼ã·ã§ã³ã®ä»ãæ¹ã¯ IPA ãªã©é³å£°ã¨è¡¨è¨ã®çµã³ã¤ããå¼·ãè¨å·ãç¨ããã¹ãã ã£ãããããã¾ãããã LJSpeech ã§å¦ç¿ããã¢ãã«ãå¯ã®ãã¼ã¿ã§ãã¡ã¤ã³ãã¥ã¼ãã³ã°ãããã¨ãèãã¦ããã®ã§ããã¼ãå表è¨ã«ãã¾ãããä¿é³ã¨ããå¾®å¦ãªã®ã§ãã¼ãå表è¨ã¯é¿ããæ¹ãè¯ãã£ãããããã¾ããããããå¼ã -> "youyoushekey" ã¿ããã«è±èªã£ã½ã表è¨ããã¨ããèãã¦ãã¾ãããããã¯ä»ããã¨ãã«å¤§å¤ãªã®ã¨å¾ã§åå©ç¨ãé£ããã®ã§ããã«ãã¾ããã
次ã«å®éã«ã¢ããã¼ã·ã§ã³ãã¦ãã訳ã§ãããããããªããªã大å¤ã§ãããè使¨©çã«ã¯ã©ã¦ãã½ã¼ã·ã³ã°ããããã«ããããªãã®ã§ãèªåã§ãããããã£ã¦ãªãã¨ã 6 話ã¾ã§ã¢ããã¼ã·ã§ã³ããã¨ããã§åå°½ãã¾ãããï¼å¤ãã®é¨åãæåã§ããã¾ããããé©å½ãªã¢ãã«ã使ã£ã¦èªåã§éã«ããæãã¦ããããæã§æ´ã£ããããã¹ãã ã£ãããããã¾ãããï¼
6 話ã¾ã§çµãã£ãæç¹ã§ã®çºè©±ãã¼ã¿ã®é·ã㯠12 åã§ãããä¸è©±ããã 2 åã¨ã¯æã£ãããçãã£ãã§ãã
ä¸å¿ãã¼ã¿ã¯ï¼å ¨12話ã®ãã¡ï¼ååæã£ãã®ã§ããã¾ããããã®ãããã¯ä»ããããã ããã¨èãã¦å ã«é²ããã¨ã«ãã¾ããã
SSRN ã«ã¤ãã¦ã¯ã LJSpeech ã§å¦ç¿ãããã¢ãã«ã使ã£ã¦ã wav -> mel -> (SSRN) -> STFT -> wav ã®å¤æã§ã»ã¨ãã©æå¤±ãªãåç¾ã§ãããã¨ãåãã£ãã®ã§ãã®ã¢ãã«ãæµç¨ãããã¨ã«ãã¾ããã
Text2Mel ã«ã¤ãã¦ã¯ãã¡ããæµç¨ã¯ã§ããªãã®ã§ãããã®å¦ç¿ã«ã¨ãããããã¨ã«ãã¾ãã
å°ãå¦ç¿ããã¦ã¿ãã¨ããããã¯ãå¦ç¿ãã¼ã¿ãå°ãªãã ããã£ã¦ããéå¦ç¿ãã¦ãã¾ãã¾ããéå¦ç¿ãã§ããã ãé¿ããããã dropout çãªã©ãã¤ãã¼ãã©ã¡ã¼ã¿ã調æ´ãã¾ããã
æçµçã« GTX 1080Ti ã使ã䏏䏿¥å¦ç¿ããã¾ãããçµæã¯ä¸ã®ã»ãã«è²¼ã£ã¦ããã¾ãã loss ã¯ãããªæãã§ããã
æå¾ã«ã JSUT (Japanese speech corpus of Saruwatari Lab, University of Tokyo) - Shinnosuke Takamichi (高道 慎之介) ã使ã£ã¦å¦ç¿ããã¦ã¿ã¾ããããã®ãã¼ã¿ã«ã¯ 24 æéåã®åä¸å¥³æ§ã®æ¥æ¬èªé³å£°ãã¡ã¤ã«ã¨ãã®å稿ãå«ã¾ãã¦ãã¾ãã
ã風è¹ã¿ãããªé ã®åã ãªãã
ããã£ã±ãããå°ãèãã¦åããã
ææã¯å°ãããã¡ãªãã§ãããã»ã¼å®ç§ã«çºè©±ã§ãã¦ãã¾ããä½åº¦ãè¨ãã¾ãã GPU 1 æã®ä¸æ¥ã®å¦ç¿ã§ããã¾ã§ã§ããã®ã¯ãããï¼
loss ã¯ãããªæãã§ãã
æ¥æ¬èªã®ãã¼ã¿ã»ããã§ãæ£ããå¦ç¿ã§ããã¨ãããã¨ã¯ãå¯ã§ã®å®é¨ããã¾ããããªãã£ãã®ã¯ããã¯ããã¼ã¿ã®è³ªã¨éã«åé¡ãããããã§ãã
èå¯
- å¦ç¿ãã¼ã¿ãè¶³ããªãã®ã¯ç¢ºå®çã«æããã§ããå°ãªãã¨ã 12 è©±å ¨é¨ã¨ OVA çå©ç¨ã§ããé³å£°å ¨ã¦ã«ã¢ããã¼ã·ã§ã³ãä»ããå¿ è¦ã¯ããã¾ãããããã§ãå®ç¨çãªãã®ã¯ãªããªãå³ãããã§ããä¾ãã°è¹è¦çµè¡£ããï¼ããããï¼ãå¿ã ç¾ãããããï¼ããªãã£ã¼ãªãºã ï¼ãªã©ã®å£°ãæµç¨ããã¨ããã®ã¯ã¢ãªããããã¾ããï¼ãããã 36 話, 51 話ããã®ã§ããªããã¼ã¿ãå¢ããã¾ãï¼
- DeepVoice2, 3 ã®ããã«ãå¤äººæ°è©±è ã§å¦ç¿ã§ããã¢ãã«ã®ä¸äººã®å¦ç¿ãã¼ã¿ã¨ãã¦å¯ã®å£°ã使ãã¨ããæãããã¾ããDeepVoice3 ã§ã¯ VCTK ã¨ãã 108 話è ã»åè¨ 44 æéã®ãã¼ã¿ã»ããã LibriSpeech ã¨ãã 2484 話è ã»åè¨ 820 æéã®ãã¼ã¿ã»ããã§ã®å¦ç¿ãæåãã話è ä¸äººãããã®ãã¼ã¿ãå°ãªãæ¸ãã§ãããããªã®ã§ãå¤äººæ°ã®æ¥æ¬èªé³å£°ã³ã¼ãã¹ããéã¾ãã°ãã®æãè¯ãããã§ããåã®ç¥ã£ã¦ãéã 声åªçµ±è¨ã³ã¼ã㹠㨠JSUT ãããããæ¥æ¬èªé³å£°ã³ã¼ãã¹ãç¡ãã®ã§ç¾ç¶ã§ã¯å³ããããããã¾ããã試ãã¦ã¿ã価å¤ã¯ããã¨æãã¾ããããä»ã«è¯ããããªãã¼ã¿ãããã°æãã¦ããã ããã¨å¬ããã§ãã
- åãå°æ¥å¤§å¯è±ªã«ãªã£ããéäºã«å£°åªã³ã¼ãã¹ä½ãããä¾é ¼ãã¾ãããªã®ã§ç¥æ§ããéããã ããã
ãã¨ãã
ãã¼ã¿ããéã¾ãã°ããç¨åº¦ã®ã¯ãªãªãã£ã®é³å£°åæãã§ããã¨ãããã¨ã¨ãç¾ç¶ã®å°ãªããã¼ã¿æ°ã§ãæä½éã®çºè©±ã¯ã§ããã¨ããã®ãåãã£ãã®ã§è¯ãã¨ãã¾ãããï¼
ãã¤ã manga2anime (ãã³ã¬ãå ¥ããã¨ã¢ãã¡ãåºã¦ããï¼ã¢ãã«ãä½ããã®æ¥ã¾ã§è«¦ããã«ãã£ã¦ããããã¨æãã¾ãã
æå¾ã¾ã§èªãã§ããã ããããã¨ããããã¾ããï¼ã¡ãªã¼ã¯ãªã¹ãã¹ï¼ãããå¹´ãï¼ï¼
åèæç®
[3] [1703.10135] Tacotron: Towards End-to-End Speech Synthesis
[4] [1712.05884] Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
[5] [1702.07825] Deep Voice: Real-time Neural Text-to-Speech
[6] [1705.08947] Deep Voice 2: Multi-Speaker Neural Text-to-Speech
[7] [1710.07654] Deep Voice 3: 2000-Speaker Neural Text-to-Speech
[8] WaveNet: A Generative Model for Raw Audio | DeepMind
[9] Pythonで音声信号処理 - 人工知能に関する断創録
[11] 日本声優統計学会
[12] JSUT (Japanese speech corpus of Saruwatari Lab, University of Tokyo) - Shinnosuke Takamichi (高道 慎之介)