-
Notifications
You must be signed in to change notification settings - Fork 1k
eSpeak NG roadmap
This page used to describe text tokenization and phonemization problems of eSpeak NG in a way, that it could be solve in small iterations through gradual changes and improvements.
To generate speech from text it has to be passed through two translation steps:
-
Tokenization of the text. This is text-to-text translation which converts input text with numbers, abbreviations, etc. to the other text, how it would be actually read by human. E.g.
100$
is translated tohundred dollars
,June 1
toFirst of June
,Ernst & Young
asernst and young
etc. -
Phonemization is text-to-phoneme translation, which translates tokenized text into written representation of chain of sounds. E.g.
hundred dollars
becomehˈʌndɹɪd dˈɒləz
(in IPA writing), orh'VndrI2d d'0l3z
in X-SAMPA like presentation used in eSpeak NG. -
Tone settings for tonal languages are set by specific phonemes, which change prosody data (tone) on the fly, thus making tonal languages completely different from other languages (completely different phonemes are used for tonal and non-tonal languages).
-
Generation of sound by handling stream of phonemes and additional speech characteristics (prosody like pitch, volume, etc., which is calculated mostly by grammatical signs like comma or fullstop. Current implementation of eSpeak NG text tokenization and text-to-phoneme translation is done at once by simple advancement of sliding pointer inside original text into and then doing tokenization and text-to-phonemes just by looking around (forward or backward) of current letter.
eSpeak NG was built as a simple engine only for English language and then it was extended to other languages and scripts. Though, it still lacks handy features, which would make text-to-speech translation much easier for syllabary, hieroglyphic and tonal languages. Because of that bottom-up extension several accidental complexities have appeared, which could be simplified with reorganizing entire translation approach.
Following are main problems in current implementation:
-
Text processing is not purely aware of Unicode letters. Most of characters are handled just as streams of bytes, which is fine until you need to know how much bytes need to advance pointer, when some of letters is processed. For that custom made functions are used like utf8_in, which internally use fixed length (4 bytes) UTF-16 representation of characters and then convert them back to number of bytes (in most cases, stored as UTF-8).
-
Tokenization is done only only in word boundaries and only around (after or before) current sliding pointer. Therefore it is not possible to do tokenization in scope of sentence (clause in eSpeak NG) or, in even bigger scope. Because of that, it is not possible to tokenize by context e.g. for ordinal numbers, translating numbers with different endings for gender (for synthetic languages which uses inflexions.
-
Checking for part or end of sentence is mostly oriented for alphabetical languages where fullstop and comma is used and boundary for words by space. This is not well suited for other writing systems syllabic, abjad or abiguda (syllable oriented) or hieroglyphic languages, where word boundaries are not shown or boundaries of sentences are marked with other symbols.
-
.replace works only for current character and is limited to 4 bytes in total (2 ASCII letters to 2 ASCII letters or 1 diacritized letter to 1 diacritized letter or combination of them).
-
Prosody settings flags can be checked and/or only for predefined set of words in list around current pointer.
Implementation should be fully compatible with UNICODE (practically UTF-8) using ucdt-tools
. Translation of emoji should be generated from CLDR data.
Tokenization, phonemization and handling of prosody data should be implemented as three different steps. It should be possible to provide input and get output from each of steps for debugging purposes and integration with other text-to-speech conversion tools.
Tokenization should be done (at least) in scope of sentence, it should support translation from pattern matched string to other string. It should be possible, to mark word matching some pattern with custom flag, e.g. to mark German nouns.
For prosody data, it could be possible to set general language settings (language/accent, voice, speed etc.) before handling of sentence is started. Then other setting could be added/changes for particular token. It should be possible to set tone for any regular language in tokenization or phonemization step by adding tone letters and SSML tags. For example, in general Latvian is not considered tonal language (because tone doesn't have different meaning), but in reality long vowels are said in lower tone than short vowels and it also has broken, rising and falling tone intonations.
There should be one superset of all phonemes, that with one voice (at least, as general approximation) phonemization of any language should be possible.
Generation of phoneme data should be possible by exporting them from standard version of the Praat.
Translating using different language should be set in configuration file and should allow to use accent of language (e.g. for Cyrillic text in Latvian, switch to _^_ru-LV
instead of just _^_ru
). This switch should be written in prosody data for entire token (otherwise, switch is inconsistent for normal words and abbreviations with numbers in them).
Handling of SSML tags go directly to the proper structures of the token data and should not be mixed with content.
As for languages like Arabic tokenization is very tricky and heavily depends on context, there should be simple (command line parameter) setting to say for eSpeak NG that text is already tokenized and/or phonemized. (Somewhat it is currently done for input in square brackets like espeak-ng -ven "[[h@loU]], [[w'3:ld]]!
, where phonemes are written inside square brackets, but prosody data still should be written outside.
There should be easy way to provide map (dynamically by parameters or in configuration file) to translate input phonemes to actually used phonemes by eSpeak NG. (Similarly as it is already done for MBROLA integration).
All (at least most) of settings in tr_languages.c
file should be set by text configuration file. Configuration files should allow to use include
statement and overriding only some settings got from included file (as it is currently done in C code).
There should be easy way to pass input text to external tokenizer (e.g. Mishkal for Arabic) and getting tokenized (and/or phonemized text) back before passing to speech synthesis.
There should be possibility to add custom flags for token in particular language, which could be used for phonemization decisions (e.g. gender/inflection for following/previous word).
Other specific problems noted in issues, which should be solved by the way:
- Language analysis improvements
- Improve the Arabic support
- Restructure ReadClause and TranslateClause to better handle Japanese, emoji and other things Other issues marked with language tag
Sliding pointer in buffer is fast, but has many limitations for tokenization and phonemization. More feature rich solution, which should cause much performance and memory of problems would be to split tokenization, phonemization and speech synthesis in three separate, consecutive steps (probably working it three different threads if necessary).
In tokenization step characters are read till to the end of sentence marker, which should be parametrized in configuration file (if set, it overrides default settings). Then current sentence is read from start till to the end and at each particular point tokenization rules are applied. That is similar to current model, with only difference, that:
- Tokenization is still text-to-text translation and matching scope is entire sentence.
- New replacement rules (e.g. new rule
.substitute
in_rules
file) can be used for arbitrary length strings. Replacement rules should allow to use patterns (e.g. regular expressions, probably with external C library). - At the end tokenization is re-winded back and for these parts of input text, which is not consumed by specified tokenization rules, is simply copied as without changes in proper places in list of tokens.
If current position matches some tokenization rule (e.g. number is found), text of matching pattern is translated according to tokenization rules and written into new token element with following structure (this is just proof of concept, it may be more complicated if really needed. Look e.g. ProcessSsmlTag):
struct Token {
struct Token* prev; // link to previous token
struct Token* next; // link to next token
char* start; // start of input text, which is represented by this token
char* end; // end of input text, which is represented by token
int flags; // bit-mask of flags
TokenType type; // what part of sentence
int flags; // bitmask of flags to mark question, exclamation etc.
Prosody* prosody; // language, voice, pitch and volume envelopes, filled by phonemization step
char* text; // tokenized text in written form
char* phonemes; // tokenized text in phonetic form
};
enum TokenType {
phoneme,
word,
clause, //several words as part of sentence
sentence,
paragraph
};
struct Prosody {
// TODO should reuse as much as possible from existing implementation,
// need do develop borderline between additional information for written presentation and phonetic presentation
}
If text is aready tokenized (e.g. xml file with SSML is passed), then XML elements are directly written into structure of the tokens.
In general, phonemization works mostly as it is — it replaces written text into representation of phonemes. Difference is following:
-
What was previously "on-the-fly" tokenization like
TranslateNumber()
and similar is performed in previously in separate tokenization step. -
During phonemization, instead of reading original text from buffer, phonemizer traverses through list of tokens. As current phonemization rules are mostly aware only for word boundary, in most cases it will read value of text element and, together with additional prosody data, flags etc, will write phonemes into phonemes element. There could be prosody data already set for tokens, e.g. with SSML.
-
Phonemization can set prosody settings (in most case tone settings) for particular token according to the phonemization rules. These prosody rules for tone phonemes should be configurable in configuration file. (E.g. for tonal languages low/high tone usually differs in fifth, but in Latvian low tone for long vowels is prime or second, and third for broken tone).
Phonemes element is filled directly into token elements, if already phonemized text is provided as an input. (e.g. [[aba]]
from standard input or parameter).
Speech synthesis traverses through list of tokens and generates speech from phonemes element, considering prosody settings for each token.
-
Add command line argument/parameter for espeak-ng (
-t
for test or-e
for experimental), which enables experimental features (and flow of data) for the new translation from text to speech. -
Develop data model where tokenization, phonemization, prosody settings can be easily split apart. E.g. what to set in token type (what is still text oriented) and what into prosody data (which are sound oriented). Consider, how exstensible vs simple data model should be (e.g. fixed length chars vs pointers to variable length arrays of chars).
-
Review current implementation and find places where parts of new implementation can be implemented and plugged into existing implementation. (E.g.
ReadClause
,WordToString
etc. methods). Probably, to prepare it for replacement, some of existing method calls will need to be changed. -
Decide how much of current tokenization/phonemization rules should be preserved in current rules. (E.g. if they can be transformed with a script, it is not so serious problem).
-
Decide, is it necessary to care about speed for multi step approach for tokenization-phonemization-synthesis steps (e.g. is it real problem to allow using regular expressions for tokenization rules).
-
Find cases where need to develop more tests before start rewriting. (E.g. one thing would be tricky xml file where SSML changes persons, language during some kind of dialog between two persons in two languages).
-
Prepare TODO list, which can be gradually implemented function by function in parallel to existing implementation or as a separate project, and then replugging that into proper places of existing solution.