Research Area: | Speech Analysis | Year: | 2009 | ||||
Type of Publication: | Mastersthesis | Keywords: | Speech synthesis, unit size, syllable, approximate matching, global syllable set | ||||
Authors: | Veera Raghavendra Elluru | ||||||
Abstract: | |||||||
A text-to-speech system converts the given text into corresponding spoken form. A
widely used approach of building text-to-speech system is based on concatenation of speech
segments and is often referred to as concatenative synthesis technique. This method uses
prerecorded speech units, which preserve co-articulation and prosody of the spoken language.
The quality of the synthetic speech is thus a direct function of the available units,
making the choice of unit size is an important issue. For good quality synthesis, all the
units of the language should be present. In the context of Indian languages, syllable units
are found to be a much better choice than units like phone, diphone, and half-phone and are
widely used to build syllable based synthesizers. However, an important issue not addressed
in the earlier works on syllable based synthesizers for Indian languages is the coverage of
all possible syllables. The coverage of syllables in an Indian language is a non-trivial issue
and it is difficult to build a speech database that provides a good coverage of all syllables.
Hence syllable based synthesizers built for Indian languages in earlier work use a back-off
strategy using diphone or phone to synthesize an utterance when a particular syllable is not
found in the speech database.
The question we would like to ask in this work is whether a syllable based synthesizer
could be built without using any lower level units such as triphone or diphone as back-off
units but still address the issue of coverage of syllables. It is in this context, we have investigated
two approaches namely: 1) Approximate matching of a syllable and 2) Global syllable
set. Approximate matching of a syllable deals with finding a nearest syllable either by substitution
or by deletion of one of its phones. The hypothesis is that the perceptual mechanism
of human beings may not notice a significant difference if we use an approximately matched
syllable during synthesis of an utterance. The idea of global syllable set deals with merging syllable level units from different Indian languages to create a larger syllable database.
However, such a database has to deal with multiple voice identities associated with different
speakers. To address this issue we propose a cross-lingual voice conversion technique based
on artificial neural networks. The usefulness of approximate matching of syllables and use
of global syllable set with supportive experimentation and results are presented in this thesis.
The contributions of this work are 1) experimental evidence that approximate matching
of syllable could be used in syllable based text-to-speech systems in Indian languages, 2)
use of global syllable set for building text-to-speech systems in Indian languages, 3) use of
cross-lingual voice conversion technique and 4) a method for pruning large unit selection
databases to be able to deploy text-to-speech synthesis in practical applications. |
|||||||
Digital version |