Paper | Audio Samples | Colab Demo
This is the official code base for Coding Speech through Vocal Tract Kinematics.
git clone https://github.com/cheoljun95/Speech-Articulatory-Coding.git
cd Speech-Articulatory-Coding
pip install -e .
from sparc import load_model
coder = load_model("en", device= "cpu") # For using CPU
coder = load_model("en", device= "cuda:0") # For using GPU
For pitch tracker, we found PENN is fast at inference. You can activate that with use_penn=True
. The default is using torchcrepe.
coder = load_model("en", device= "cpu", use_penn=True) # Use PENN for pitch tracker
The following model checkpoints are offered. You can replace en
with other models (multi
or en+
) in load_model
.
Model | Language | Training Dataset |
---|---|---|
en | English | LibriTTS-R |
multi | Multi | LibriTTS-R, Multilignual LibriSpeech, AISHELL, JVS, KSS |
en+ | English | LibriTTS-R, LibriTTS, EXPRESSO |
code = coder.encode(WAV_FILE) # Single inference
codes = coder.encode([WAV_FILE1, WAV_FILE2, ...]) # Batched processing
The articulatory code outputs have the following format.
# All features are in 50 Hz except speaker encoding
{"ema": (L, 12) array, #'TDX','TDY','TBX','TBY','TTX','TTY','LIX','LIY','ULX','ULY','LLX','LLY'
"loudness": (L, 1) array,
"pitch": (L, 1) array,
"periodicity": (L, 1) array, # auxiliary output of pitch tracker
"pitch_stats": (pitch mean, pitch std),
"spk_emb": (spk_emb_dim,) array, # all shared models use spk_emb_dim=64
"ft_len": Length of features, # usefull when batched processing with padding
}
wav = coder.decode(**code)
sr = coder.sr
wav = coder.convert(SOURCE_WAV_FILE, TARGET_WAV_FILE)
sr = coder.sr
Please check notebooks/demo.ipynb
for a demonstration of the functions.
- Add training codes.
- Add pypi installation.