paper
æ°ããé³å£°ã³ã¼ãã¹ SRC4VC (Smartphone-Recorded Corpus for Voice Conversion) ãææ¡ã SRC4VC ã«ã¯çæ´»ç°å¢ã§é²é³ããé©åº¦ã«å£åããé³å£°ãåé²ããã¦ãããé³å£°å£åã«å¯¾ããèæ§ï¼degradation robustnessï¼ããã£ã VC ã®éçºã§å©ç¨ããããã¨ãæå¾ ãâ¦
æ§ã ãªé³å£°åæã®å ¬éå®è£ ã¨è©ä¾¡ããã¾ããè«æ by Fairseqãã¼ã 1ã æå/é³ç´ /Unit-to-Melã¨vocoderã§Speech Synthesisããã¦ãFairseqãå¾æã®ææ¨ç¾¤ã§å®¢è¦³è©ä¾¡ã Models models Implemented by Fairseq S2 Text-to-Mel: Tacotron 2, Transformer TTS, Faâ¦
ææ¡ã¢ãã«: mel-spec input (pitch-less), multiband LPCNet1 ã㢠ä¸å½èªã㢠wavecoder.github.io ConditioningNetwork å ¥åã«mel-specããã®ã¾ã¾å©ç¨2, 3, 4ï¼pitchç¡ã5, 80 dim6ï¼ã Mel2LPcoeff LPä¿æ°ã¯mel-specããè¨ç®7ãåãã³ãã§ã¯mel-specã®ãâ¦
LPCNetãå¹çåãã¾ãã (x2.5~)ã èæ¯ - ããã«ããã¯ã¯ããã£ã¦ããã観念ãã LPCNetã¯ã¢ãã¤ã«CPUãªã¢ã«ã¿ã¤ã æ¨è«ãã§ããã»ã©éãã 巨大åããã°å質ãè¯ãã ãããé度å¶ç´ãæºããä¸ã§ã®å質ã«ã¯æ¹åã®ä½å°ããã1ãä¸å±¤ã®å¹çåãæ±ãããã¦ããâ¦
å¤è¨èªASRã®äºåå¦ç¿ã«CPCãå©ç¨ãæ¢åã®æ師ããã¢ãã«ã¨åç以ä¸ã®æ§è½ãçºæ®. èæ¯ å°ãã¼ã¿ã®æã©ãããã => è¿ããã¡ã¤ã³ã®å¤§ãã¼ã¿ã§pre-training & Transfer learning ASRã¯é³ç´ ã£ã½ããã®ãäºåå¦ç¿ã§ããã°ããã¨å ±ç¨ã§ããã => CPC ææ³ CPCã®æâ¦
Multiband-WaveRNN ã¯ãWaveRNNã¯è¡¨ç¾åãä½ããã¦ããã¨ãã仮説ã®ä¸ã§ããµã¤ãºãå¤ãã¦ããªãWaveRNNã¸ãµããã³ãNåã®åæäºæ¸¬ã課ããã¢ãã«1. ãªãã¨å®éã«MOSå·®ç¡ãã§Nãã³ãäºæ¸¬ã«æå. åä½å¨æ³¢æ°ã1/Nã«ã§ããã®ã§RTFãå¤§å¹ ã«æ¹å. èæ¯ã»ã¢ãã« Waâ¦
ãã¥ã¼ã©ã«ãªé³é¿ç¹å¾´éï¼content, fo, speakerï¼ããneural vocoderã§é³å£°åæ/å¤æ/å§ç¸®ã§ããããªãã£ã¬ã³ã¸. 表ç¾å¦ç¿ã¨ãã³ã¼ãå¦ç¿ã¯å®å ¨åé¢ (表ç¾ã¢ãã«ãpretraining -> fix). fixãããã¢ãã«åºåãããã³ã¼ãå¦ç¿. content表ç¾ã¢ãã«ã¯CPC, HuBERâ¦
LPCNet: ç·å½¢äºæ¸¬ãã³ã¼ãã¼ã«excitation/æ®å·®äºæ¸¬ã®WaveRNNãçµã¿åãã1ã full neural Vocoders ããçãã©ã¡ã¼ã¿ã§å精度 ã¹ãã¼ã¹åããã¤ãºããå¦ç¿ãå ¨çµå層ã®å·¥å¤«ãªã©è²ã æé©åãã¦ãããªå¼·ããªãCPUã§ããªã¢ã«ã¿ã¤ã åæã«æå. speech synthesisâ¦