è«–æ–‡è§£èª¬: Valin (2022) Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet

LPCNetã€åŠ¹çŽ‡åŒ–ã—ã¾ã—ãŸ (x2.5~)ã€‚

èƒŒæ™¯ - ãƒœãƒˆãƒ«ãƒãƒƒã‚¯ã¯ã‚ã‹ã£ã¦ã„ã‚‹ã€è¦³å¿µã—ã‚

LPCNetã¯ãƒ¢ãƒã‚¤ãƒ«CPUãƒªã‚¢ãƒ«ã‚¿ã‚¤ãƒ æŽ¨è«–ãŒã§ãã‚‹ã»ã©é€Ÿã„ã€‚ å·¨å¤§åŒ–ã™ã‚Œã°å“è³ªã‚‚è‰¯ã„ã€‚ ã—ã‹ã—é€Ÿåº¦åˆ¶ç´„ã‚’æº€ãŸã™ä¸ã§ã®å“è³ªã«ã¯æ”¹å–„ã®ä½™åœ°ãŒã‚ã‚Š¹ã€ä¸€å±¤ã®åŠ¹çŽ‡åŒ–ãŒæ±‚ã‚ã‚‰ã‚Œã¦ã„ã‚‹ã€‚

original LPCNetã§ã¯è¨ˆç®—æ™‚é–“ã®ã†ã¡SamplingRateNetãŒ98.2%ã‚’å ã‚ã€ãã®ä¸ã§GRUa/GRUb/DualFCã¯47.8%/26.9%/21.3%ã«ãªã£ã¦ã„ã‚‹²ã€‚ GRUaã®ã‚¹ãƒ‘ãƒ¼ã‚¹åŒ–ã«ç•™ã¾ã£ã¦ã„ãŸã®ã§ã€æ”¹å–„ã®ä½™åœ°ã‚ã‚Šã€‚
ã¾ãŸSamplingRateNetã§stepã”ã¨ã«ç™ºç”Ÿã™ã‚‹weightè»¢é€ã§L2ã‚ãƒ£ãƒƒã‚·ãƒ¥å¸¯åŸŸã¯ã‚ã£ã·ã‚ã£ã·ã«ãªã£ã¦ãŠã‚Šã€ã“ã“ã«ãƒœãƒˆãƒ«ãƒãƒƒã‚¯ãŒã‚ã‚‹³ã€‚ãã®ã†ãˆweightãŒL2ã‚ãƒ£ãƒƒã‚·ãƒ¥ã«è¼‰ã‚Šãã‚‰ãšæ›´ã«ã‚ãƒ„ã„ãƒœãƒˆãƒ«ãƒãƒƒã‚¯ã«ãªã£ã¦ã„ã‚‹ã¨è€ƒãˆã‚‰ã‚Œã‚‹⁴ã€‚ã“ã“ã«weightè»½é‡åŒ–ã®çŽ‹é“æ‰‹æ³•ã‚’æŽ¡ç”¨ã—ã†ã‚‹ã€‚
ç¬¬2ã®ãƒœãƒˆãƒ«ãƒãƒƒã‚¯ã¯æ´»æ€§åŒ–é–¢æ•°ã§ã‚ã‚Šã€N_A=384ã§stepã”ã¨ã«2000 activationsã®è¨ˆç®—ãŒç™ºç”Ÿã—ã¦ã„ã‚‹⁵ã€‚æ´»æ€§åŒ–é–¢æ•°ã®è»½é‡åŒ–ã¯ã“ã‚Œã‚’è»½æ¸›ã§ãã‚‹ã€‚

LPCNetã¯ã¾ã ã¾ã é€Ÿããªã‚Œã‚‹ã€è¦³å¿µã—ã¦åŠ¹çŽ‡åŒ–ã•ã‚Œã‚‹ã®ã ⁶ã€‚

ææ¡ˆæ‰‹æ³•

ãƒ¢ãƒ‡ãƒ«æ”¹è‰¯
- Hierarchical Probability Distribution
- Increasing second GRU capacity
è¨ˆç®—æ”¹è‰¯
- é‡ååŒ–
  - int8 dynamic quantization
  - Quantization aware training
- tanhè¿‘ä¼¼

Hierarchical Probability Distribution

åŽŸç†

WaveRNNã® DualSoftmax ã‚„Bunched LPCNetã® Bit bunching ã«ç€æƒ³ã‚’å¾—ãŸåˆ†å¸ƒåˆ†å‰²æ‰‹æ³•ã€‚ã“ã‚Œã¯
P(s) = Î B(L_k|L_<k)
ã§å®šå¼åŒ–ã•ã‚Œã‚‹ã€‚

ã€Œãƒ“ãƒƒãƒˆ L_k ãŒä¸Šä½ãƒ“ãƒƒãƒˆ L_<k ç¢ºå®šä¸‹ã§ 0/1 ã©ã¡ã‚‰ã®å€¤ã‚’å–ã‚‹ã‹ã€ãŒåŸºæœ¬è¦ç´ ã€‚
ã“ã‚Œã¯ã€Œä¸Šä½ãƒ“ãƒƒãƒˆã§æ¡ä»¶ä»˜ã‘ã‚‰ã‚ŒãŸãƒ“ãƒƒãƒˆã®ãƒ™ãƒ«ãƒŒãƒ¼ã‚¤åˆ†å¸ƒ B(L_k|L_<k)ã€ã«å®šå¼åŒ–ã•ã‚Œã‚‹ã€‚

ãã®ã†ãˆã§é›¢æ•£å€¤ s_t ãŒãƒ“ãƒƒãƒˆã®éšŽå±¤ (Hierarchical) æ§‹é€ ã‚’ã‚‚ã¤ã¨è¦‹ãªã™ã€‚
ã¤ã¾ã‚Š 0 â‰¦ s_t â‰¦ 2^Q-1 (Q bit) ã¯QéšŽå±¤ã®ãƒ“ãƒƒãƒˆæœ¨ã¨è¦‹ãªã›ã‚‹ï¼ˆä¾‹: 5 (Q=3) ã¯1_0_1ï¼‰ã€‚
ã“ã®éš›ã€éšŽå±¤ k ã ã‘ã‚’ã¿ã‚‹ã¨ 0/1 ã©ã¡ã‚‰ã‹ã—ã‹ã¨ã‚‰ãªã„ãŒã€ä¸Šä½å±¤ãŒ 0/1 ã©ã¡ã‚‰ã‹ã§ç¢ºçŽ‡ã¯å¤‰ã‚ã‚‹ã€ã¤ã¾ã‚Šä¸Šä½ãƒ“ãƒƒãƒˆã«æ¡ä»¶ä»˜ã‘ã•ã‚Œã¦ã„ã‚‹ã€‚
ã¤ã¾ã‚ŠåŸºæœ¬è¦ç´ ã§ã‚ã‚‹ B(L_k|L_<k) ãŒéšŽå±¤æ§‹é€ ã«ç¾ã‚Œã¦ã„ã‚‹ã€‚

s_t ã¯å…¨ãƒ“ãƒƒãƒˆã®åŒæ™‚ç¢ºçŽ‡ã§ã‚ã‚Šã€ã“ã‚Œã‚’ãƒ“ãƒƒãƒˆæœ¨ã¨ã¿ãªã›ã°ç¢ºçŽ‡åˆ†å¸ƒã®å› æ•°åˆ†è§£ãŒã§ãã‚‹ã€‚ã¤ã¾ã‚Š
P(s) = B(L₁|-) * B(L₂|L₁) * ... * B(L_q|L₁, L₂,...,L_q-1) = Î B(L_k|L_<k)
ã®å¼ã«å¸°ç€ã™ã‚‹ã€‚
ç·åˆã™ã‚‹ã¨ Hierarchical Probability Distribution ã¨ã¯ã€Œé›¢æ•£å€¤ã‚’ãƒ“ãƒƒãƒˆæœ¨ã¨ã¿ãªã—ã€ä¸Šä½ãƒ“ãƒƒãƒˆã§æ¡ä»¶ä»˜ã‘ã‚‰ã‚ŒãŸãƒ“ãƒƒãƒˆã®ãƒ™ãƒ«ãƒŒãƒ¼ã‚¤åˆ†å¸ƒ B(L_k|L_<k) ã‚’ç”¨ã„ã€åŒæ™‚ç¢ºçŽ‡ã®å› æ•°åˆ†è§£ã§ç¢ºçŽ‡åˆ†å¸ƒã‚’ãƒ¢ãƒ‡ãƒ«åŒ–ã™ã‚‹ã€æ‰‹æ³•ã¨è¨€ãˆã‚‹ã€‚

ä¾‹ãˆã° Q=3bit ã§s_t=5 ã®ç¢ºçŽ‡åˆ†å¸ƒã¯
P(s_t=5=1_0_1) = B(L₁=1|-) * B(L₂=0|L₁=1) * B(L₃=1|L₁=1,L₂=0) ã§æ±‚ã‚ã‚‰ã‚Œã‚‹ã€‚

åˆ©ç‚¹

ãªã‚“ã§ã“ã‚“ãªé¢å€’ï¼ˆã ã‘ã©ç´ ç›´ï¼‰ãªå®šå¼åŒ–ã‚’ã—ãŸã‹ã¨ã„ã†ã¨ã€è¨ˆç®—é‡ã‚’åœ§å€’çš„ã«å‰Šæ¸›ã§ãã‚‹ã‹ã‚‰ã€‚

Q bitã®ç¢ºçŽ‡åˆ†å¸ƒã‹ã‚‰ã‚µãƒ³ãƒ—ãƒªãƒ³ã‚°ã™ã‚‹ã«ã¯ã€ã¾ãš2^Qå€‹ã®Energy (e^x) ã‚’å‡ºã—ã€softmaxã®ãŸã‚ã«ç·å’Œï¼ˆåˆ†é…é–¢æ•°ï¼‰ã‚’å‡ºã—ã¦å…¨è¦ç´ ã‚’å‰²ã‚‹å¿…è¦ãŒã‚ã‚‹ã€‚ã“ã‚Œã§å¾—ãŸç¢ºçŽ‡åˆ†å¸ƒã‹ã‚‰ã‚µãƒ³ãƒ—ãƒªãƒ³ã‚°ã™ã‚‹ã€‚

éšŽå±¤æ§‹é€ ã‚’æŒãŸã›ã‚‹ã¨è©±ã¯ç°¡å˜ã€‚L1ç”¨ã®å…¥åŠ› (ã‚¹ã‚«ãƒ©) ã‚’ã‚·ã‚°ãƒ¢ã‚¤ãƒ‰é–¢æ•°ã«å…¥ã‚Œã¦ B(L1) ã¨ã—ã€ã‚µãƒ³ãƒ—ãƒªãƒ³ã‚°ã€‚æ¬¡ã«L1=0/1 givenã§ã®L2ç”¨ Energy (ã‚¹ã‚«ãƒ©) ã‚’ã‚·ã‚°ãƒ¢ã‚¤ãƒ‰é–¢æ•°ã«å…¥ã‚Œã¦ï¼ˆä»¥ä¸‹ç•¥ï¼‰ã€‚

ã¤ã¾ã‚Š 2^Q +Î± -> Q ã¸åˆ†å¸ƒã®è¨ˆç®—é‡ãŒæ¿€æ¸›ã™ã‚‹ã€‚

ã“ã‚Œã¯ã¾ã ã¾ã åºãƒŽå£ã§ã€æœ€å¤§ã®è¨ˆç®—é‡å‰Šæ¸›ã¯FCéƒ¨ã«ã‚ã‚‹ã€‚
DualFCï¼ˆ1å±¤FC2ä¸¦åˆ—ï¼‰ã®è¨ˆç®—é‡ã¯ 2 * N_B * 2^Qã€‚ãªãœãªã‚‰Energyç”¨ã« 2^Q æ¬¡å…ƒã®ãƒ™ã‚¯ãƒˆãƒ«ã‚’å‡ºã™å¿…è¦ãŒã‚ã‚‹ã‹ã‚‰ã€‚
ã—ã‹ã—éšŽå±¤ã‚µãƒ³ãƒ—ãƒªãƒ³ã‚°ã®å ´åˆã€ãƒ™ã‚¯ãƒˆãƒ«ã®å„è¦ç´ ã¯ B(L_k|L_<k=Bs_<k) ç”¨ã®å€¤ã«ãªã£ã¦ã„ã‚‹ã€‚
ã ã‹ã‚‰å…ˆã«ä¸Šä½å±¤ã‚’ã‚µãƒ³ãƒ—ãƒªãƒ³ã‚°ã—ã¦ L_<k != Bs_<k ã«ãªã£ãŸã‚‰ã€ãã®å€¤ã¯ãã‚‚ãã‚‚ä½¿ã‚ã‚Œãªã„ã®ã§è¨ˆç®—ã—ãªãã¦ã„ã„⁷ã€‚
ã¤ã¾ã‚ŠFCéƒ¨ã®è¨ˆç®—é‡ãŒ 2 * N_B * 2^Q -> 2 * N_B * Q ã«æ¿€æ¸›ã™ã‚‹⁸ã€‚Q=8ãªã‚‰256 -> 8ã§ 1/32ã«ã¾ã§æ¸›ã‚‹ã€‚ã‚‚ã¡ã‚ã‚“ãƒ¡ãƒ¢ãƒªè»¢é€é‡ã‚‚1/32ã€‚

Temperature

original LPCNetã¯temperatureãƒ©ã‚¤ã‚¯ãªãƒã‚¤ã‚¢ã‚¹ã‚’ã‹ã‘ã‚‹ãŸã‚ã«pitchã‚’ä½¿ã£ã¦Energyã‚’ã„ã˜ã£ã¦ã„ãŸ.
éšŽå±¤åŒ–ã—ãŸã“ã¨ã§åŒæ™‚ç¢ºçŽ‡ã«ä¸€æ‹¬ãƒã‚¤ã‚¢ã‚¹ã¯ã‹ã‘ã‚‰ã‚Œãªããªã£ãŸ⁹ã®ã§ (ã‹ã‘ã‚ˆã†ã¨ã™ã‚‹ã¨éƒ¨åˆ†è©•ä¾¡ã®ãƒ¡ãƒªãƒƒãƒˆã‚’æ¨ã¦ã‚‹ã“ã¨ã«ãªã‚‹ï¼‰ã€P(L_k|L_<k) ã”ã¨ã«é–¾å€¤ã§åˆ‡ã£ã¦ãƒã‚¤ã‚¢ã‚¹ã‚’ã‹ã‘ã¦ã„ã‚‹¹⁰ã€‚

GRUå®¹é‡

For small models, the complexity shifts away from the main GRU For large models, the activation functions start taking an increasing fraction of the complexity, again suggesting that we can increase the density at little cost.

éšŽå±¤ã‚µãƒ³ãƒ—ãƒªãƒ³ã‚°ã«ã‚ˆã‚ŠFCã®è¨ˆç®—é‡ãŒã‹ãªã‚Šå°ã•ããªã£ãŸã®ã§ã€GRU_Bã®å‡ºåŠ›ã‚µã‚¤ã‚º (N_B) ã‚’ãã‚“ãªã«æ°—ã«ã—ãªãã¦è‰¯ããªã£ãŸï¼ˆFCã‚’æœªä½¿ç”¨åˆ†ã¾ã§L2ã«è¼‰ã›ãŸã„ãªã‚‰é…æ…®ãŒå¿…è¦ï¼‰ã€‚

GRU_AãŒãƒœãƒˆãƒ«ãƒãƒƒã‚¯ã§ã¯ãªããªã£ãŸã®ã§sparsityã‚’å°ã•ã‚ã«å¤‰æ›´ã€‚
GRU_Bã¯ã‚µã‚¤ã‚ºã‚’ä¸Šã’ã¤ã¤sparsityå°Žå…¥ã€‚
çµæžœã¨ã—ã¦å®ŸåŠ¹weightæ•°ã¯ä¸Šæ˜‡ã€‚

Demo

Methods

task: Speaker-independent, Language-independent speech synthesis
models
- B192/B384/B640 (Baseline model, h_GRUa = 192/384/640)
- P192/P384/P640 (Proposed model, h_GRUa = 192/384/640)
Data
- Train: 205 hours of 16-kHz speech from a combination of TTS datasets [19, 20, 21, 22, 23, 24, 25, 26, 27] including more than 900 speakers in 34 languages and dialects
  - To make the data more consistent, we ensure that all training samples have a negative polarity. This is done by estimating the skew of the residual, in a way similar to [28].
- Val
  - PTDB-TUG (en, 10 male, 10 female)
  - NTT: NTT Multi-Lingual Speech Database for Telephonometry (en-US, en, 8 male, 8 female, 12 samples per speaker)
Evaluation
- Speed
  - measure: Synthesis speed
  - env: single core of various CPUs
    - x86: Intel i7-10810U (w/ AVX2, not AVX512-VNNI)
    - N1: a 2.5 GHz ARM Neoverse N1 (similar single-core performance as recent smartphones)
    - A72: a 1.5 GHz ARM Cortex-A72 (similar to older smartphones)
    - A53: a 1.4 GHz ARM Cortex-A53
- Quality
  - measure: naturalness MOS

ç´°ã‹ã„é•ã„

FrameRateNetwork Residual connectionå»ƒæ¢

original LPCNetã«å˜åœ¨ã—ãŸã€FrameRateNetwork Res[Conv]-FC ã®ResãŒã—ã‚Œã£ã¨Fig.1ã‹ã‚‰æ¶ˆæ»…ã€ç‰¹ã«æœ¬æ–‡ã§ã¯è§¦ã‚Œã‚‰ã‚Œã¦ã„ãªã„ã€‚
official LPCNet@master ã§ã‚‚Residual connectionã¯å»ƒæ¢ã•ã‚Œã¦ã„ã‚‹ï¼ˆå‚è€ƒ: tarepan/LPCNet - /training_tf2/lpcnet.py ï¼‰

conditioning å…¥åŠ›å…ˆæ˜Žç¤º

original LPCNetã®Fig.1ã§ã¯ conditioning f ãŒGRUaã®ã¿ã«å…¥åŠ›ã•ã‚Œã¦ã„ã‚‹ã‚ˆã†ã«æ›¸ã‹ã‚Œã¦ã„ã‚‹ã€‚
å®Ÿéš›ã®ã¨ã“ã‚ã€è«–æ–‡ã§ä½¿ã‚ã‚Œã¦ã„ãŸ official LPCNet @0ddcda0 ã§ã¯GRUbã«ã‚‚catã—ã¦å…¥åŠ›ã•ã‚Œã¦ã„ã‚‹ã€‚
æœ¬è«–æ–‡ã§ã®Fig.1ã§ã¯ã“ã‚ŒãŒãã¡ã‚“ã¨åæ˜ ã•ã‚Œã€ f ãŒåˆ†å²ã—ã¦GRUaã¨GRUbã«å…¥åŠ›ã—ã¦ã„ã‚‹ã€‚

ãƒ•ã‚£ãƒ«ã‚¿augmentation

éŒ²éŸ³ç’°å¢ƒã¸ã®ãƒãƒã‚¹ãƒˆæ€§ã‚’ã‚ã’ã‚‹ãŸã‚ã«2æ¬¡ãƒ•ã‚£ãƒ«ã‚¿ (ãªã‚‹ã‚‚ã®) ã‚’ç”¨ã„ã¦ã‚¹ãƒšã‚¯ãƒˆãƒ«ã‚’augmentationã—ã¦ã‚‹¹¹ã€‚å¼ã¯Valin (2018) ã®Eq.7¹²ã€‚

Original Paper

@misc{2202.11169,
Author = {Jean-Marc Valin and Umut Isik and Paris Smaragdis and Arvindh Krishnaswamy},
Title = {Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet},
Year = {2022},
Eprint = {arXiv:2202.11169},
}

“there is still an inherent tradeoff between synthesis quality and complexity.” from original paper↩
Fig.2 of Kanagawa & Ijima. (2020). Lightweight LPCNet-based Neural Vocoder with Tensor Decomposition.↩
“According to our analysis, the main performance bottleneck is the L2 cache bandwidth required for the matrix-vector products.” from Valin & Skoglund. (2019). A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet.↩
“This is compounded by the fact that these weights often do not fit in the L2 cache of CPUs.” from original paper↩
“ A secondary bottleneck includes about 2000 activation function evaluations per sample (for NA = 384). ” from original paper↩
“In this work, we improve on LPCNet with the goal of making it even more efficient in terms of quality/complexity tradeoff.” from original paper↩
“Even though we still have 255 outputs in the last layer, we only need to sequentially compute 8 of them when sampling” from the paper↩
“compute 8 of them when sampling, making the sampling O (log Q) instead of O (Q).” from the paper↩
“With hierarchical sampling, we cannot directly manipulate individual sample probabilities.” from the paper↩
“each branching decision is biased to render very low probability events impossible” from the paper↩
“To ensure robustness against unseen recording environments, we apply random spectral augmentation filtering using a second-order filter” from the paper↩
“as described in Eq. (7) of [15] … [15] J.-M. Valin, â€œA hybrid DSP/deep learning approach to realtime full-band speech enhancement,â€ from the paper↩

ãŸã‚Œã±ã‚“ã®ã³ã¼ãƒ¼ã‚ã

ã‚ãŸã—ã®å‚™å¿˜éŒ²ã€ç”Ÿç‰©å¦ã¨ãƒ—ãƒã‚°ãƒ©ãƒŸãƒ³ã‚°ãŒå¤šã„ã‹ã‚‚

è«–æ–‡è§£èª¬: Valin (2022) Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet

èƒŒæ™¯ - ãƒœãƒˆãƒ«ãƒãƒƒã‚¯ã¯ã‚ã‹ã£ã¦ã„ã‚‹ã€è¦³å¿µã—ã‚

ææ¡ˆæ‰‹æ³•

Hierarchical Probability Distribution

åŽŸç†

åˆ©ç‚¹

Temperature

GRUå®¹é‡

Demo

Methods

ç´°ã‹ã„é•ã„

FrameRateNetwork Residual connectionå»ƒæ¢

conditioning å…¥åŠ›å…ˆæ˜Žç¤º

ãƒ•ã‚£ãƒ«ã‚¿augmentation

Original Paper

èƒŒæ™¯ - ãƒœãƒˆãƒ«ãƒãƒƒã‚¯ã¯ã‚ã‹ã£ã¦ã„ã‚‹ã€è¦³å¿µã—ã‚

ææ¡ˆæ‰‹æ³•

Hierarchical Probability Distribution

åŽŸç†

åˆ©ç‚¹

Temperature

GRUå®¹é‡

Demo

Methods

ç´°ã‹ã„é•ã„

FrameRateNetwork Residual connectionå»ƒæ­¢

conditioning å…¥åŠ›å…ˆæ˜Žç¤º

ãƒ•ã‚£ãƒ«ã‚¿augmentation

Original Paper

èƒŒæ™¯ - ãƒœãƒˆãƒ«ãƒãƒƒã‚¯ã¯ã‚ã‹ã£ã¦ã„ã‚‹ã€è¦³å¿µã—ã‚

ææ¡ˆæ‰‹æ³•

åŽŸç†

GRUå®¹é‡

ç´°ã‹ã„é•ã„

FrameRateNetwork Residual connectionå»ƒæ¢