Speech and Vision Lab

  • Increase font size
  • Default font size
  • Decrease font size
Home Publications
Approximate matching of a syllable and use of global syllable set for Text-to-Speech in Indian Languages
Research Area: Speech Analysis Year: 2009
Type of Publication: Mastersthesis Keywords: Speech synthesis, unit size, syllable, approximate matching, global syllable set
Authors: Veera Raghavendra Elluru  
A text-to-speech system converts the given text into corresponding spoken form. A widely used approach of building text-to-speech system is based on concatenation of speech segments and is often referred to as concatenative synthesis technique. This method uses prerecorded speech units, which preserve co-articulation and prosody of the spoken language. The quality of the synthetic speech is thus a direct function of the available units, making the choice of unit size is an important issue. For good quality synthesis, all the units of the language should be present. In the context of Indian languages, syllable units are found to be a much better choice than units like phone, diphone, and half-phone and are widely used to build syllable based synthesizers. However, an important issue not addressed in the earlier works on syllable based synthesizers for Indian languages is the coverage of all possible syllables. The coverage of syllables in an Indian language is a non-trivial issue and it is difficult to build a speech database that provides a good coverage of all syllables. Hence syllable based synthesizers built for Indian languages in earlier work use a back-off strategy using diphone or phone to synthesize an utterance when a particular syllable is not found in the speech database. The question we would like to ask in this work is whether a syllable based synthesizer could be built without using any lower level units such as triphone or diphone as back-off units but still address the issue of coverage of syllables. It is in this context, we have investigated two approaches namely: 1) Approximate matching of a syllable and 2) Global syllable set. Approximate matching of a syllable deals with finding a nearest syllable either by substitution or by deletion of one of its phones. The hypothesis is that the perceptual mechanism of human beings may not notice a significant difference if we use an approximately matched syllable during synthesis of an utterance. The idea of global syllable set deals with merging syllable level units from different Indian languages to create a larger syllable database. However, such a database has to deal with multiple voice identities associated with different speakers. To address this issue we propose a cross-lingual voice conversion technique based on artificial neural networks. The usefulness of approximate matching of syllables and use of global syllable set with supportive experimentation and results are presented in this thesis. The contributions of this work are 1) experimental evidence that approximate matching of syllable could be used in syllable based text-to-speech systems in Indian languages, 2) use of global syllable set for building text-to-speech systems in Indian languages, 3) use of cross-lingual voice conversion technique and 4) a method for pruning large unit selection databases to be able to deploy text-to-speech synthesis in practical applications.
Digital version