Chinese speech synthesis

From Wikipedia, the free encyclopedia

Chinese speech synthesis is the application of speech synthesis to the Chinese language (usually Standard Mandarin). It poses additional difficulties due to the Chinese characters (which frequently have different pronunciations in different contexts), the complex prosody which is essential to convey the meaning of words, the more frequent occurrence of unexpected, unusual combinations of syllables, and sometimes the difficulty in obtaining agreement among native speakers as to what is 'correct' pronunciation.

[edit] Approaches taken

iflytek (formerly Ifly Info Tech) published a W3C paper [1] in which they adapted Speech Synthesis Markup Language to produce a dialect called Chinese Speech Synthesis Markup Language (CSSML) which can include additional markup to clarify the pronunciation of characters and to add some prosody information. Their synthesiser takes a "corpus-based" approach, which means it can sound very natural in most cases but can fault on awkward unusual phrases if they can't be matched with the corpus. The amount of data involved is not disclosed by iflytek but can be seen from the commercial products that iflytek have licensed their technology to; for example, Bider's SpeechPlus[2] is a 1.3 Gigabyte download, 1.2 Gigabytes of which is used for the highly-compressed data for a single Chinese voice. iflytek's synthesiser can also synthesise mixed Chinese and English text with the same voice (e.g. Chinese sentences containing some English words); they claim their English synthesis to be "average".

A corpus-based approach is also taken by Tsinghua University's SinoSonic[3], with the Harbin voice data taking 800 Megabytes. As of 2006, the download link for SinoSonic has not yet been activated.

A less complex approach is taken by cjkware.com's KeyTip Putonghua Reader[4], which contains 120 Megabytes of sound recordings (GSM-compressed to 40 Megabytes in the evaluation version), comprised of 10,000 multi-syllable dictionary words plus single-syllable recordings in 6 different prosodies (4 tones, neutral tone, and an extra third-tone recording for use at the end of a phrase). These recordings can be concatenated in any desired combination, but the joins sound forced (as is usual for simple concatenation-based speech synthesis) and this can severely affect prosody; the synthesiser is also inflexible in terms of speed and expression. However, because this synthesiser does not rely on a corpus, there is no noticable degradation in performance when it is given more unusual or awkward phrases.

Concatenation with a larger amount of recorded data (about 500 Megabytes) is apparently used by NeoSpeech's SAPI 5 voices "Lily" and "Wang"[5].

[edit] Non-Windows systems

The above-mentioned Chinese speech synthesis systems are available only for Windows, although iflytek also have a demonstration that can be used from their website. However, the spaced-interval repetition language-practice program Gradint[6] includes code and instructions for using KeyTIP and SpeechPlus data on other operating systems, by reading the data directly or using the WINE emulator.

It is not yet possible to run SAPI 5-based speech synthesisers on the WINE emulator.

Mac OS had Chinese speech synthesisers available up to version 9; this was removed in Mac OS X but is scheduled to be replaced in version 10.5, according to Apple's website.

[edit] Notable approaches not yet taken

As of 2006, it appears that there have been no projects to synthesize Chinese by simulating the human vocal tract, as GNU Speech[7] is doing for English. Chinese is also notably missing from the extensively-multilingual MBROLA project.