ç§å¦
NOLA / Nonzero Overlap Add: overlapped window ãå ¨åã§éã¼ã ï¼æ å ±ãè½ã¡ãªãï¼(scipy) COLA / Constant OverLap Add: overlapped window ãä¸å¤®é¨ã§ãã©ãã (scipy)
ããã®å£°ãå¦ä½ã«ã¢ãã«åãããããã®ã¢ãã«ãä½ãåç¾åºæ¥ã¦ä½ãåãæ¨ã¦ã¦ããã®ãã ãã®éã®ããªãåºæ¬çãªè¦³ç¹ã¨ãã¦ãããã®å£°ã¯monophonicã¨è¦åãããããããã polyphonicã«ãèããããä¾ ç°ãªãæå¾ã2ã¤èãããä¾ã¨ãã¦ãã®åç»ãæããããâ¦
èªåã®ããã°ã§ã¾ã¨ã¾ã£ããwikipediaã«ç§»ç®¡ãã¦ããã é³é¿ç¹å¾´é: wikipedia/é³å£°åæ#é³é¿ç¹å¾´é
Wikipedia/ç³ã¿è¾¼ã¿ãã¥ã¼ã©ã«ãããã¯ã¼ã¯#å容éã¸ç§»è¡ã
LPCNetç³»ã¨ã¯ãç·å½¢äºæ¸¬ã«NNæ®å·®äºæ¸¬ãçµã¿åãããLPCNetãèµ·æºã¨ããã¢ãã«ã¯ã©ã¹ãç·å½¢äºæ¸¬ç¬¦å·åã¨WaveRNNãç¥å ãã¢ãã«å¹çã®è¯ãããããé«éæ§ã¨ãªã¼ãã³ã½ã¼ã¹ã®Cå®è£ ãã¦ãªã è¦ç´ 解é Excitation/Residual - åãªãéç·å½¢è£å® å¤å ¸çã½ã¼ã¹ã»ãã£â¦
é³å£°æ³¢å½¢çæã¿ã¹ã¯ã«ããã¦çæããã波形ã«å¯¾ããSTFTãæ失é¢æ°ã«ä½¿ãç 究ã®ãµã¼ã㤠Parallel WaveGAN NSF HiFi-GAN MultiBand-MelGAN StyleMelGAN GANç³»ã§SoTAãªvocoderã¯ã¿ããªæ¡ç¨ãã¦ãã¤ã¡ã¼ã¸ãã. model loss name reference loss intent PWG1 muâ¦
確çåå¸ããã®å¾®åå¯è½ãªãµã³ããªã³ã°ææ³ ç¢ºçãã¯ãã«ã«ãã¤ãºå ¥ãã¦argmaxããã°indexãåãããã®ã¾ã¾one-hot vectorã«ãã§ãã. => ãã¤ãºã®å ¥ãæ¹ã工夫ããã¨åå¸éãã®ãµã³ããªã³ã°ãå¯è½ (Gumbel-Max Trick) ãµã³ããªã³ã°ã¯ãããã©å¾®åãããã aâ¦
VCC2020 T10ã¢ãã«1 (top score). ASRãã¼ã¹ã®rec-synã§MOS 4.0 & similarity 3.6 ãéæ. Models ASR SI-ASR (N10ã¨ä¸ç·?) Conversion model Encoder-Decoderã¢ãã« (â S2S). Encoder LSTM -> 2x time-compressing concat2 -> LSTM Decoder Attentionä»ãAR-â¦
Non-local Neural Networks (2018) ã¢ã¸ã¥ã¼ã«ã®ãæ°æã¡ ãç§ã欲ãããã®ã ããã¼ãã¶ãã ããã FC: ã¨ã«ããå ¨è¦ç´ ãåããã Conv: 決ãæã¡ã§å±æã ãåããã RNN: hiddent-1ã ãç´æ¥åããã => ç¾å¨å¤ã«åºã¥ãã¦åçã«ãå ¨é·ãã欲ããè¦ç´ ã ããåâ¦
FastSpeechã«ãããæ¨å®ãå ¼ãã¦ã¿ã¾ããè«æ. Durationã¨åããphonemeåä½ã§PitchPredictorãå¦ç¿. Scalaräºæ¸¬ããå¤ãlatentã¨åãFeature次å ã«å¤æããã®ã¡ããªãã¨ãã sumï¼segFCã§Feature次å ã«é£ã°ãã¦ããã®ã§å¦ç¿å¯ã«ãªãããã®è¾ºã§é³é«æ¬¡å ã§ãæâ¦
éããå·§ããï¼å®ããã¯å¾®å¦ï¼FastSpeech æ¦è¦ Transformerã§é³ç´ åãç³»åå¤æãåçã«ã¢ãããµã³ããªã³ã°ãTransformerã§âç³»åãmel-specã¸å¤æ. 以ä¸. åçã¢ãããµã³ããªã³ã°ã¯ LengthRegulator ã§å®è¡ãããé³ç´ ãã¨ã®åçã DurationPredictior ã§åçâ¦
ã¢ãã«ãä¸é表ç¾ã¨ãã¦ç¹å®ã®å¤ãåãããã«å¦ç¿ãã¦ã»ãã. A: ã¢ãã«ãã¤ã¢ã¹ã§èªç¶ã¨ããå¦ç¿ããããã«ç¥ã B: ã¢ãã«åå²ããã¦åå¥å¦ç¿ C: ãã®ä¸é表ç¾ã«å¯¾ãã¦Lossãè¨å® D: Lossãè¨å®ããããã§æ¬¡ã®å±¤ã¸ã¯æ師ãã¼ã¿ã渡ãï¼teacher forcingçï¼ â¦
主張ãTTSããããªãWaveNetãè¤éãªç¹å¾´éã§ç´æ¥æ¡ä»¶ä»ãããã "è¯ãchar2specã¢ãã«+spec2wave WaveNet" ããããã æ¦è¦ Attention Seq-to-Seq ã§æååããã¡ã«ã¹ãã¯ããã°ã©ã ãçæãWaveNetã§æ³¢å½¢çæ. LSTM Encoderãæç« ã丸åã¿ãæçµåºåãzã¨â¦
MelGANã«å¯¾ãã¦ã¢ãã«ã»Lossã®æé©åãããä¸ã§ãæçµåºåãã£ãã«ãè¤æ°ã«ãã¦ãããããµããã³ããäºæ¸¬. é称 MB-MelGAN ã¢ãã« MelGANãã¼ã¹ãããªãã¡ConvT1dãã¼ã¹. ResBlockå°å ¥ãDilatedConvã«ããå容éæ¡å¤§ã«ããããã«ãã³ãã¢ãã«ãã®ãã®ãã¾ãâ¦
Libri-light 㯠LibriVox ããçæãããã³ã¼ãã¹1. ãªã®ã§ LibriSpeech ã®è¦ªæ2. Unlabelled Speech Training Set unlab-60k unlab-6k unlab-600 Dev and Test Set (totally same as LibriSpeech3) dev-clean: 5.4 hours dev-other: 5.3 hours test-clean: â¦
ä»ããã¥ã¼ã©ã«ãã³ã¼ãã¯å½ããåã§ãç¨éã«åããã¦å¤ç¨®å¤æ§. ãã®åç¹ãWaveNet. ä»ã§ã¯WaveNetãã®ãã®ã¯ä½¿ãããªããããã®æ ¹æ¬çã¢ã¤ãã¢ã¯å½ããåã¬ãã«ã«æ®åããã¢ã¸ã¥ã¼ã«ãåæã§ä½¿ããã¦ãã. ããã°æ°å¤å ¸ã§ããWaveNetããã¾æ¯ãè¿ã. Summarâ¦
çæã¢ãã«: ãµã³ãã«åå¸å ¨ä½ãå¦ç¿ çæã¢ãã«ã¯ããªãé«çãªã¢ãã«. ããç¨ãªãµã³ãã«ã®ãã©ã¨ãã£ãå«ããå ¨ã¦ãã¢ããªã³ã°ãããã¨ãã. çæã¢ãã«ã®å®ç¨æã«ã¯åå¸ã®ç´°é¨ãç¡è¦ããã»ããçµæãè¯ããã¨ãå¤ã ãã. => çæã¢ãã«ã¨"温度"ãã©ã¡ã¼ã¿:â¦
Gated Activation Unit ã¯æ´»æ§åé¢æ°/ã¦ãããã®ä¸ç¨®. output = tanh(Wfilter â input) ⦿ Ï(Wgate â input) tanh(conv(input))ã§éç·å½¢å¤æããåºåã«å¯¾ããsigmoid(conv'(input)) ã§åºã¦ãã 0~1 ãç¨ããGatingãããã¦ããã¨ã¿ãªãã. Gated PixelRNNã«ãâ¦
ç¹æ§ä¸è¦§ quality latency â stream latency: ãµã³ãã«åºåæ - ãµã³ãã«åä¿¡æ â realtime factor; RTF: å¦çæé/ä¿¡å·é· â performance: ãªã½ã¼ã¹è¦æ±é/使ç¨é CPU/GPU/ã¡ã¢ãª ã¿ã¹ã¯ã¨ç¹æ§; è¦ä»¶ é³å£°ç´ ææ¤è¨¼ æç« ãè£å©å ¥åãèããªããæã£ã¦ããããâ¦
ã誤差ã¯ããã¤ããã¨ã誤差ãã©ãéã¿ã«åæ ããããã®å½¹å²åæ . ç¹å®ã®å±¤ã ãå¦ç¿ç¡ãã«ãããå ´åï¼ä¾: Encoder-FixNet-Decoderï¼ã誤差éä¼æãFixNetã§æ¢ããã®ã¯NG. ãªããªãEncoderã¸èª¤å·®ãä¼ãããªããã. Backwardã¯é常éãè¨ç®ãã¦ãOptimizerãEâ¦
Predictionã¨Reconstruction prediction: 対å¿ããç¹å¾´éç¡ãã«å¯¾è±¡ã®è¦³æ¸¬å¤ãæ¨å®ããã㨠reconstruction: 対å¿ããç¹å¾´éãã対象ã®è¦³æ¸¬å¤ãçæããã㨠Predictionã®å ´åãcontextãåºã«è¦³æ¸¬å¤ãæ¨å®ãã. ã¤ã¡ã¼ã¸ã¨ãã¦ã¯ã1æã ãè£è¿ãã§ä»ãå ¨é¨è¡¨â¦
sys.float_info.min (ã¨ã¦ãå°ããå¤) ãä»£å ¥ãã¦ãä½æ ã0.ã«ãªããlogãæ»ã¬ã ãã¼ã¿åãçãã«ãã¦ããããã£ã. åç¾ã³ã¼ã float_array = numpy.array([1., 2.,], dtype=np.float32) tiny = sys.float_info.min print(float_array) # [1. 2.] float_arâ¦
èæ¯ librosaã¯STFTãã¨ã£ã¦ã便å©ã ããªããã£ãã«FFTããããã¨æã£ããscipy.fft.rfftã«ãªã. ãã®2ã¤ãåãåä½ããã®ã ãããï¼ åä½ ããã©ã«ãåä½ã ã¨éãåãããã. 以ä¸ãåãåä½ããã librosa.stft(x_full, n_fft=n_fft, window="boxcar", ceâ¦
n-bitã®æ´æ°ãæ´æ°ã®çµã¿åããã§è¡¨ç¾ããæ¹æ³. ä¸ä½ããã/ä¸ä½ãããã¿ãããªãã¤. è¡¨ç¾ 6bit == 26 == 0~63 ãããä¸ä½3bitã¨ä¸ä½3bitã«åå²ãããã®ãã¢ã§è¡¨ç¾ãã. (3bit, 3bit) == (23, 23) == (0~8, 0~8) å¤ææ³ ä¸ä½decimal = valuedecial // 2nbit/â¦
Energy-Based Model: "ä¸å®å®ã"ã«ç¸å½ããEnergyã確çå¤æ°ã«å²ãå½ã¦ãã¢ãã«. ã¨ãã«ã®ã¼ãã確çãå¾ããã確çå¤æ°éã®ã¨ãã«ã®ã¼æ¯ãè¨ç®ããããã¦ä½¿ã. 確çã¨ãã¦æ±ãå ´åã¯ç¢ºçå¯åº¦é¢æ°ããã«ããã³åå¸ã¨ãã. ãã«ããã³åå¸ã¯ããã³ã·ã£ã«é¢æ°â¦
CycleGAN + linear spectrogram + WareRNN Vocoder => similarity MOS 4.5, naturalness MOS 3å¾å [ããã人åãè¨äº] Overview Masaya Tanaka, Takashi Nose, Aoi Kanagaki, Ryohei Shimizu, and Akira Ito (2020) Scyclone: High-Quality and Parallel-Daâ¦
声質å¤æï¼ãããã¤ã¸ãããããããã¤ã¸ããã1ï¼ã¨ã¯ã声ããã¤æå³ãå¤ããã«è³ªæã®ã¿ãå¤ãããã¨ãæ£ç¢ºã«ã¯ããå ¥åé³å£°ã«å¯¾ãã¦, çºè©±å 容ãä¿æãã¤ã¤, ä»ã®ææã®æ å ±ãæå³çã«å¤æããå¦çã2ã®ãã¨ã è±èªã§ã¯ãVoice ConversionãããVoice Tranâ¦
é³é¿ç¹å¾´éã¯é³ããã¤ç¹å¾´ãæããææ¨ã ã¨æå¾ ããã¦ãã. æ¢ç¥ã®é³é¿ç¹å¾´éã«åºã¥ããé³å£°èªèã»åæã»å¤æã¯ããã¾ããã¼ã¿ã«åºã¥ãç¹å¾´éãç¨ããããããã¯E2Eã®æ¹æ³ã«è¿½ãã¤ããã¤ã¤ããã æ¬å½ã«ãé³é¿ç¹å¾´éã¯æ¬è³ªãæããç¹å¾´éãªã®ãï¼
è±å¶ãã㪠hanashima-lab.wixsite.com 2017年度ããæ©ç¨²ç°ã¸ããã¨ãã¨çç CDB? çäºç 究室 çäºç 究室 ç¥çµåå¦ããªï¼ 渡éç 究室 Cognitive Science - Watanabe Laboratory 大é ç 究室 æ©ç¨²ç°å¤§å¦äººéç§å¦å¦è¡é¢å¤§é ç 究室 ATR æè²ã»ç·åç§å¦å¦è¡é¢ è±å¶â¦
çã®VR/ãã«ãã¤ãå®ç¾ã«ã¯ãéåæå³ã®èªã¿åãããå¿ é ã . ãããå®ç¾ããã°ãç¾å®ä¸çã§ã³ã³ããã¼ã©ã¼ãæããã«ãä»®æ³ä¸çã§çãããã¨ãã§ãã. æ¬è¨äºã§ã¯ãæå 端ã®ç§å¦ã§ã©ãã¾ã§ãéå(æå³)ã®èªã¿åãããå¯è½ã«ãªã£ããç´¹ä»ãã. å 容 ã¢ã¡ãªã« â¦