gnuspeech

What is gnuspeech?

gnuspeech makes it easy to produce high quality computer speech output, design new language databases, and create controlled speech stimuli for psychophysical experiments. gnuspeechsa is a cross-platform module of gnuspeech that allows command line, or application-based speech output. The software has been released as two tarballs that are available in the project Downloads area of https://savannah.gnu.org/projects/gnuspeech. Those wishing to contribute to the project will find the OS X (gnuspeech) and CMAKE (gnuspeechsa) sources in the Git repository on that same page. The gnuspeech suite still lacks some of the database editing components (see the Overview diagram below) but is otherwise complete and working, allowing articulatory speech synthesis of English, with control of intonation and tempo, and the ability to view the parameter tracks and intonation contours generated. The intonation contours may be edited in various ways, as described in the Monet manual. Monet provides interactive access to the synthesis controls. TRAcT provides interactive access to the underlying tube resonance model that converts the parameters into sound by emulating the human vocal tract.

The suite of programs uses a true articulatory model of the vocal tract and incorporates models of English rhythm and intonation based on extensive research that sets a new standard for synthetic speech.

The original NeXT computer implementation is complete, and is available from the NeXT branch of the SVN repository linked above. The port to GNU/Linux under GNUStep, also in the SVN repository under the appropriate branch, provides English text-to-speech capability, but parts of the database creation tools are still in the process of being ported.

Credits for research and implementation of the gnuspeech system appear the section Thanks to those who have helped below. Some of the features of gnuspeech, with the tools that are part of the software suite, tools include:

A Tube Resonance Model (TRM) for the human vocal tract (also known as a transmission-line analog, or a waveguide model) that truly represents the physical properties of the tract, including the energy balance between the nasal and oral cavities as well as the radiation impedance at lips and nose.
A TRM Control Model, based on formant sensitivity analysis, that provides a simple, but accurate method of low-level articulatory control. By using the Distinctive Region Model (DRM) only eight slowly varying tube section radii need be specified. The glottal (vocal fold) waveform and various suitably “coloured” random noise signals may be injected at appropriate places to provide voicing, aspiration, frication and noise bursts.
Databases which specify: the characteristics of the articulatory postures (which loosely correspond to phonemes); rules for combinations of postures; and information about voicing, frication and aspiration. These are the data required to produce specific spoken languages from an augmented phonetic input. Currently, only the database for the English language exists, though French vowel postures are also included.
A text-to-augmented-phonetics conversion module (the Parser) to convert arbitrary text, preferably incorporating normal punctuation, into the form required for applying the synthesis methods.
Models of English rhythm and intonation based on extensive researchthat are automatically applied.
“Monet”—a database creation and editing system, with a carefully designed graphical user interface (GUI) that allows the databases containing the necessary phonetic data and dynamic rules to be set up and modified in order that the computer can “speak” arbitrary languages.
A 70,000+ word English Pronouncing Dictionary with rules for derivatives such as plurals, and adverbs, and including 6000 given names. The dictionary also provides part-of-speech information to faciltate later addition of grammatical parsing that can further improve the excellent pronunciation, rhythm and intonation .
Sub-dictionaries that allow different user- or application-specific pronunciations to be substituted for the default pronunciations coming from the main dictionary (not yet ported).
Letter-to-sound rules to deal with words that are not in the dictionaries
A parser to organise the input and deal with dates, numbers, abbreviations, etc.
Tools for managing the dictionary and carrying out analysis of speech.
“Synthesizer”—a GUI-based application to allow experimentation with a stand-alone TRM. All parameters, both static and dynamic, may be varied and the output can be monitored and analysed. It is an important component in the research needed to create the databases for target languages.

Overview of the main Articulatory Speech Synthesis System

Why is it called gnuspeech?

It is a play on words. This is a new (g-nu) “event-based” approach to speech synthesis from text, that uses an accurate articulatory model rather than a formant-based approximation. It is also a GNU project, aimed at providing high quality text-to-speech output for GNU/Linux, Mac OS X, and other platforms. In addition, it provides comprehensive tools for psychophysical and linguistic experiments as well as for creating the databases for arbitrary languages.

What is the goal of the gnuspeech project?

The goal of the project is to create the best speech synthesis software on the planet.

Releases

The first official release has now been made, as of October 14th 2015. Additional material is available for GNUStep, Mac OS X and NeXT (NeXTSTEP 3.0), for anonymous download from the project SVN repository (https://savannah.gnu.org/projects/gnuspeech). All provide text-to-speech capability. For GNUStep and OS X the database creation and inspection tools (such as TRAcT) can be used as intended, but work remains to be done to complete the database creation components of Monet that are needed for psychophysical/linguistic experiments, and for setting up new languages. The most recent SVN Repository material has now been migrated to a Git Repository on the savannah site whilst still keeping the older material on ther SVN repository. These repositories also provide the source for project members who continue to work on development. New members are welcome.

Development & “Coming Soon”

It would be be a good idea for those interested in the work to join the mailing lis, to provide some feedback, ask questions, work on the project, and so on.

Helpers and users can join the project mailing list by visiting the subscription page (https://lists.gnu.org/mailman/listinfo/gnuspeech-contact), and send mail to the group. Offers of help receive special attention! :-)

A brief technical history of gnuspeech, incorporating the current state

The full project implementation history is described in a separate page (https://www.gnu.org/software/gnuspeech/project-history.html) to avoid overloading this one.

In summary, much of the core software has been ported to the Mac under OS/X, and GNU/Linux under GNUStep. All current sources and builds are currently in the Git repository, though older material, including the Gnu/Linux/GNUStep and NeXT implementations are only in the SVN repository. Speech may be produced from input text. The development facilities for managing and creating new language databases, or modifying the existing English database for text-to-speech are incomplete, but mainly require only the file-writing components. The Monet provides the tools needed for psychophysical and linguistic experiments. TRAcT provides direct access to the tube model.

Obtaining gnuspeech

gnuspeech is currently fully available as a NextSTEP 3.x version in the SVN repository along with the Gnu/Linux/GNUStep version, which is incomplete though functional. Passwords for the original NeXT version (user and developer) are available in the “private” file in the NeXT branch. Tarballs for the initial release versions of gnuspeech and gnuspeechsa are available from the Downloads area of the savannah project page (https://savannah.gnu.org/projects/gnuspeech).

The original NeXT User and Developer Kits are complete, but do not run under OS X or under GNUStep on GNU/Linux. They also suffer from the limitations of a slow machine, so that shorter TRM lengths (< ~15 cm) cannot be used in real time, though the software synthesis option allows this restriction to be avoided. Any password can be selected to activate the NeXT kits from the file “nextstep / trunk / priv / SerialNumbers” and choosing a password such as “bb976d4a” for User 26 or “ebe22748” for Dev 15 from the very large selection provided. In fact, you can use these passwords. But you need a NeXT computer, of course—try Black Hole, Inc. (https://www.blackholeinc.com) if you'd like one.

Getting Help with gnuspeech

Developers should contact the authors/developers through the gnu project facilities (https://savannah.gnu.org/projects/gnuspeech). To join the project mailing list, you can go directly to the subscription page. (https://lists.gnu.org/mailman/listinfo/gnuspeech-contact). Papers and manuals are available on-line (see below).

Manuals and papers

A number of papers and manuals relevant to gnuspeech exist:

The manuals for Monet and TRAcT are included in the gnuspeech tarball for downloading the first release.
A paper presented at the American Voice I/O Society conference in 1995

provides a reasonably detailed explanation of the theory underlying the tube resonance model.

A heavily cross-referenced “conceptionary” is available to provide access to some of the background terms and research in the relevant scientific fields.

A guide to the pronunication notation used in the text-to-speech work showing the relationship between standard forms (IPA, Websters) and the ASCII-friendly form used in gnuspeech, with examples of actual pronunciations. This is also included in the new manual.

The Tube Resonance Model a write-up of the waveguide model of the acoustic tubes that form the underlying model of the human vocal apparatus.

Additional material, including sound files and a manual for the original NeXT version of Monet, is also available on Professor Hill's university web site.

Papers related to the research that has led to gnuspeech are also collected on Professor Hill's university web site. These include the development of the “event-based” approach to speech synthesis, which is also applicable to speech recognition.

Some examples of the papers by other researchers that helped us in developing gnuspeech include:

Carré, R. and Mrayati, M. (1992) Distinctive regions in acoustic tubes. Speech production modelling. J. Acoustique 5, 141-151
Fant, G. & Pauli, S. (1974) Spatial characteristics of vocal tract resonance models. Proc. Stockholm Speech Communication Seminar, KTH, Stockholm, Sweden.
Smith, J.O. (1992) Physical modelling using digital waveguides. Computer Music Journal, 16 (4) 74-91
Cook, P.R. (1989) Synthesis of the singing voice using a physically parameterised model of the human vocal tract. International Computer Music Conference, Columbus, Ohio.
Liberman, A.M., Ingemann, F., Lisker, L., Delattre, P. & Cooper, F.S. (1959) Minimal rules for synthesising speech. J. Acoust. Soc. Amer. 31 (11), 1490-1499, Nov
’t Hart, J. & Cohen, A. (1973). Intonation by rule: a perceptual quest. Journal of Phonetics, 1 (4), 309-327.
Wells, J.C. (1963) A study of the formants of the pure vowels of British English, Progress report for July, University College, London.

but there are far too many to list them all. Further papers may be found in the citations incorporated in the relevant papers noted above and/or listed on David Hill's university web site (https://pages.cpsc.ucalgary.ca/~hill).

Further information?

See the section on Manuals and papers

How to help with gnuspeech

To contact the maintainers of gnuspeech, to report a bug, or to contribute fixes or improvements, to join the development team, or to join the gnuspeech mailing list, please visit the gnuspeech project page (https://savannah.gnu.org/projects/gnuspeech) and use the facilities provided. The mailing list can be accessed under the section “Communication Tools”. To help with the project work you can also contact Professor David Hill (hilld-at-ucalgary-dot-ca) directly.

Thanks to those who have helped

The research that provides the foundation of the system was carried out in research departments in France, Sweden, Poland, and Canada and is ongoing. The original system was commercialised by a now-liquidated University of Calgary spin-off company—Trillium Sound Research Inc. All the software has subsequently been donated by its creators to the Free Software Foundation forming the basis of the GNU Project gnuspeech. It is freely available under a General Public Licence, as described herein.

Many people have contributed to the work, either directly on the project, or indirectly through relevant research. The latter appear in the citations to the papers referenced above. Of particular note are Perry Cook & Julius Smith (Center for Computer Research in Music and Acoustics) for the waveguide model and the DSP Music Kit), René Carré (at the D