LPCNetãå¹çåãã¾ãã (x2.5~)ã
èæ¯ - ããã«ããã¯ã¯ããã£ã¦ããã観念ãã
LPCNetã¯ã¢ãã¤ã«CPUãªã¢ã«ã¿ã¤ã æ¨è«ãã§ããã»ã©éãã 巨大åããã°å質ãè¯ãã ãããé度å¶ç´ãæºããä¸ã§ã®å質ã«ã¯æ¹åã®ä½å°ããã1ãä¸å±¤ã®å¹çåãæ±ãããã¦ããã
original LPCNetã§ã¯è¨ç®æéã®ãã¡SamplingRateNetã98.2%ãå ãããã®ä¸ã§GRUa/GRUb/DualFCã¯47.8%/26.9%/21.3%ã«ãªã£ã¦ãã2ã
GRUaã®ã¹ãã¼ã¹åã«çã¾ã£ã¦ããã®ã§ãæ¹åã®ä½å°ããã
ã¾ãSamplingRateNetã§stepãã¨ã«çºçããweight転éã§L2ãã£ãã·ã¥å¸¯åã¯ãã£ã·ãã£ã·ã«ãªã£ã¦ãããããã«ããã«ããã¯ããã3ããã®ããweightãL2ãã£ãã·ã¥ã«è¼ããããæ´ã«ãããããã«ããã¯ã«ãªã£ã¦ããã¨èãããã4ãããã«weight軽éåã®çéææ³ãæ¡ç¨ãããã
第2ã®ããã«ããã¯ã¯æ´»æ§åé¢æ°ã§ãããNA=384ã§stepãã¨ã«2000 activationsã®è¨ç®ãçºçãã¦ãã5ãæ´»æ§åé¢æ°ã®è»½éåã¯ããã軽æ¸ã§ããã
LPCNetã¯ã¾ã ã¾ã éããªããã観念ãã¦å¹çåãããã®ã 6ã
ææ¡ææ³
- ã¢ãã«æ¹è¯
- Hierarchical Probability Distribution
- Increasing second GRU capacity
- è¨ç®æ¹è¯
Hierarchical Probability Distribution
åç
WaveRNNã® DualSoftmax ãBunched LPCNetã® Bit bunching ã«çæ³ãå¾ãåå¸åå²ææ³ãããã¯
P(s) = Î B(Lk|L<k)
ã§å®å¼åãããã
ãããã Lk ãä¸ä½ããã L<k 確å®ä¸ã§ 0/1 ã©ã¡ãã®å¤ãåããããåºæ¬è¦ç´ ã
ããã¯ãä¸ä½ãããã§æ¡ä»¶ä»ãããããããã®ãã«ãã¼ã¤åå¸ B(Lk|L<k)ãã«å®å¼åãããã
ãã®ããã§é¢æ£å¤ st ããããã®é層 (Hierarchical) æ§é ããã¤ã¨è¦ãªãã
ã¤ã¾ã 0 ⦠st ⦠2Q-1 (Q bit) ã¯Qé層ã®ãããæ¨ã¨è¦ãªããï¼ä¾: 5 (Q=3) ã¯1_0_1ï¼ã
ãã®éãé層 k ã ããã¿ã㨠0/1 ã©ã¡ããããã¨ããªãããä¸ä½å±¤ã 0/1 ã©ã¡ããã§ç¢ºçã¯å¤ãããã¤ã¾ãä¸ä½ãããã«æ¡ä»¶ä»ãããã¦ããã
ã¤ã¾ãåºæ¬è¦ç´ ã§ãã B(Lk|L<k) ãé層æ§é ã«ç¾ãã¦ããã
st ã¯å
¨ãããã®åæ確çã§ãããããããããæ¨ã¨ã¿ãªãã°ç¢ºçåå¸ã®å æ°å解ãã§ãããã¤ã¾ã
P(s) = B(L1|-) * B(L2|L1) * ... * B(Lq|L1, L2,...,Lq-1) = Î B(Lk|L<k)
ã®å¼ã«å¸°çããã
ç·åãã㨠Hierarchical Probability Distribution ã¨ã¯ãé¢æ£å¤ããããæ¨ã¨ã¿ãªããä¸ä½ãããã§æ¡ä»¶ä»ãããããããã®ãã«ãã¼ã¤åå¸ B(Lk|L<k) ãç¨ããåæ確çã®å æ°å解ã§ç¢ºçåå¸ãã¢ãã«åãããææ³ã¨è¨ããã
ä¾ãã° Q=3bit ã§st=5 ã®ç¢ºçåå¸ã¯
P(st=5=1_0_1) = B(L1=1|-) * B(L2=0|L1=1) * B(L3=1|L1=1,L2=0)
ã§æ±ããããã
å©ç¹
ãªãã§ãããªé¢åï¼ã ãã©ç´ ç´ï¼ãªå®å¼åããããã¨ããã¨ãè¨ç®éãå§åçã«åæ¸ã§ããããã
Q bitã®ç¢ºçåå¸ãããµã³ããªã³ã°ããã«ã¯ãã¾ã2Qåã®Energy (ex) ãåºããsoftmaxã®ããã«ç·åï¼åé é¢æ°ï¼ãåºãã¦å ¨è¦ç´ ãå²ãå¿ è¦ããããããã§å¾ã確çåå¸ãããµã³ããªã³ã°ããã
é層æ§é ãæãããã¨è©±ã¯ç°¡åãL1ç¨ã®å ¥å (ã¹ã«ã©) ãã·ã°ã¢ã¤ãé¢æ°ã«å ¥ã㦠B(L1) ã¨ãããµã³ããªã³ã°ã次ã«L1=0/1 givenã§ã®L2ç¨ Energy (ã¹ã«ã©) ãã·ã°ã¢ã¤ãé¢æ°ã«å ¥ãã¦ï¼ä»¥ä¸ç¥ï¼ã
ã¤ã¾ã 2Q +α -> Q ã¸åå¸ã®è¨ç®éãæ¿æ¸ããã
ããã¯ã¾ã ã¾ã åºãå£ã§ãæ大ã®è¨ç®éåæ¸ã¯FCé¨ã«ããã
DualFCï¼1層FC2並åï¼ã®è¨ç®é㯠2 * NB * 2QããªããªãEnergyç¨ã« 2Q 次å
ã®ãã¯ãã«ãåºãå¿
è¦ãããããã
ãããé層ãµã³ããªã³ã°ã®å ´åããã¯ãã«ã®åè¦ç´ 㯠B(Lk|L<k=Bs<k) ç¨ã®å¤ã«ãªã£ã¦ããã
ã ããå
ã«ä¸ä½å±¤ããµã³ããªã³ã°ã㦠L<k != Bs<k ã«ãªã£ããããã®å¤ã¯ãããã使ãããªãã®ã§è¨ç®ããªãã¦ãã7ã
ã¤ã¾ãFCé¨ã®è¨ç®éã 2 * NB * 2Q -> 2 * NB * Q ã«æ¿æ¸ãã8ãQ=8ãªã256 -> 8㧠1/32ã«ã¾ã§æ¸ãããã¡ããã¡ã¢ãªè»¢ééã1/32ã
Temperature
original LPCNetã¯temperatureã©ã¤ã¯ãªãã¤ã¢ã¹ããããããã«pitchã使ã£ã¦Energyãããã£ã¦ãã.
é層åãããã¨ã§åæ確çã«ä¸æ¬ãã¤ã¢ã¹ã¯ãããããªããªã£ã9ã®ã§ (ããããã¨ããã¨é¨åè©ä¾¡ã®ã¡ãªãããæ¨ã¦ããã¨ã«ãªãï¼ãP(Lk|L<k) ãã¨ã«é¾å¤ã§åã£ã¦ãã¤ã¢ã¹ãããã¦ãã10ã
GRU容é
For small models, the complexity shifts away from the main GRU For large models, the activation functions start taking an increasing fraction of the complexity, again suggesting that we can increase the density at little cost.
é層ãµã³ããªã³ã°ã«ããFCã®è¨ç®éãããªãå°ãããªã£ãã®ã§ãGRUBã®åºåãµã¤ãº (NB) ããããªã«æ°ã«ããªãã¦è¯ããªã£ãï¼FCãæªä½¿ç¨åã¾ã§L2ã«è¼ããããªãé æ ®ãå¿ è¦ï¼ã
GRU_Aãããã«ããã¯ã§ã¯ãªããªã£ãã®ã§sparsityãå°ããã«å¤æ´ã
GRU_Bã¯ãµã¤ãºãä¸ãã¤ã¤sparsityå°å
¥ã
çµæã¨ãã¦å®å¹weightæ°ã¯ä¸æã
Demo
Methods
- task: Speaker-independent, Language-independent speech synthesis
- models
- B192/B384/B640 (Baseline model, h_GRUa = 192/384/640)
- P192/P384/P640 (Proposed model, h_GRUa = 192/384/640)
- Data
- Train: 205 hours of 16-kHz speech from a combination of TTS datasets [19, 20, 21, 22, 23, 24, 25, 26, 27] including more than 900 speakers in 34 languages and dialects
- To make the data more consistent, we ensure that all training samples have a negative polarity. This is done by estimating the skew of the residual, in a way similar to [28].
- Val
- PTDB-TUG (en, 10 male, 10 female)
- NTT: NTT Multi-Lingual Speech Database for Telephonometry (en-US, en, 8 male, 8 female, 12 samples per speaker)
- Train: 205 hours of 16-kHz speech from a combination of TTS datasets [19, 20, 21, 22, 23, 24, 25, 26, 27] including more than 900 speakers in 34 languages and dialects
- Evaluation
- Speed
- Quality
- measure: naturalness MOS
ç´°ããéã
FrameRateNetwork Residual connectionå»æ¢
original LPCNetã«åå¨ãããFrameRateNetwork Res[Conv]-FC
ã®Resãããã£ã¨Fig.1ããæ¶æ»
ãç¹ã«æ¬æã§ã¯è§¦ãããã¦ããªãã
official LPCNet@master ã§ãResidual connectionã¯å»æ¢ããã¦ããï¼åè: tarepan/LPCNet - /training_tf2/lpcnet.py ï¼
conditioning å ¥åå æ示
original LPCNetã®Fig.1ã§ã¯ conditioning f ãGRUaã®ã¿ã«å
¥åããã¦ããããã«æ¸ããã¦ããã
å®éã®ã¨ãããè«æã§ä½¿ããã¦ãã official LPCNet @0ddcda0 ã§ã¯GRUbã«ãcatãã¦å
¥åããã¦ããã
æ¬è«æã§ã®Fig.1ã§ã¯ããããã¡ãã¨åæ ããã f ãåå²ãã¦GRUaã¨GRUbã«å
¥åãã¦ããã
ãã£ã«ã¿augmentation
é²é³ç°å¢ã¸ã®ããã¹ãæ§ããããããã«2次ãã£ã«ã¿ (ãªããã®) ãç¨ãã¦ã¹ãã¯ãã«ãaugmentationãã¦ã11ãå¼ã¯Valin (2018) ã®Eq.712ã
Original Paper
@misc{2202.11169, Author = {Jean-Marc Valin and Umut Isik and Paris Smaragdis and Arvindh Krishnaswamy}, Title = {Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet}, Year = {2022}, Eprint = {arXiv:2202.11169}, }
-
“there is still an inherent tradeoff between synthesis quality and complexity.” from original paper↩
-
Fig.2 of Kanagawa & Ijima. (2020). Lightweight LPCNet-based Neural Vocoder with Tensor Decomposition.↩
-
“According to our analysis, the main performance bottleneck is the L2 cache bandwidth required for the matrix-vector products.” from Valin & Skoglund. (2019). A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet.↩
-
“This is compounded by the fact that these weights often do not fit in the L2 cache of CPUs.” from original paper↩
-
“ A secondary bottleneck includes about 2000 activation function evaluations per sample (for NA = 384). ” from original paper↩
-
“In this work, we improve on LPCNet with the goal of making it even more efficient in terms of quality/complexity tradeoff.” from original paper↩
-
“Even though we still have 255 outputs in the last layer, we only need to sequentially compute 8 of them when sampling” from the paper↩
-
“compute 8 of them when sampling, making the sampling O (log Q) instead of O (Q).” from the paper↩
-
“With hierarchical sampling, we cannot directly manipulate individual sample probabilities.” from the paper↩
-
“each branching decision is biased to render very low probability events impossible” from the paper↩
-
“To ensure robustness against unseen recording environments, we apply random spectral augmentation filtering using a second-order filter” from the paper↩
-
“as described in Eq. (7) of [15] … [15] J.-M. Valin, âA hybrid DSP/deep learning approach to realtime full-band speech enhancement,â from the paper↩