Research Area: | Speech Analysis | Year: | 2010 | ||||
Type of Publication: | Mastersthesis | Keywords: | Voice Conversion, Artificial Neural Networks, Spetral Mapping, Error Cor- rection Network, Cross-Lingual Voice Conversion | ||||
Authors: | Srinivas Desai | ||||||
Abstract: | |||||||
Voice conversion is a process of transforming an utterance of a source speaker so that
it is perceived as if spoken by a specified target speaker. Applications of voice conver-
sion include secured transmission, speech-to-speech translation and generating voices
for virtual characters/avatars. The process of voice conversion involves transforming
acoustic cues such as spectral parameters characterizing the vocal tract, fundamental
frequency, prosody etc., pertaining to the identity of a speaker. Spectral parameters
representing the vocal tract shape are known to contribute more to the speaker identity
and hence there have been efforts to find a better spectral mapping between the source
and the target speaker. In this dissertation, we propose an Artificial Neural Network
(ANN) based spectral mapping and compare its performance against the state-of-the-art
Gaussian Mixture Model (GMM) based mapping. We show that the ANN based voice
conversion system performs better than that of GMM based voice conversion system.
A typical requirement for a voice conversion system is to have both the source and tar-
get speakers record a same set of utterances, referred to as parallel data. A mapping
function obtained on such parallel data can be used to transform spectral characteristics
from a source speaker to the target speaker. If either of the speakers change then a new
transformation function has to be estimated which requires collection of parallel data.
However, it is not always feasible to find parallel utterances for training. The com-
plexity of building training data increases if the language of the source speaker and the
target speaker is different, which occurs in the case of cross-lingual voice conversion.
To circumvent the need of parallel data and to reduce the complexity in building training
data for a cross-lingual voice conversion system, we propose an algorithm which cap-
tures speaker specific characteristics (target speaker) so that there is no need of training
data from the source speaker. Such an algorithm needs to be trained on only the target
speaker data and hence any arbitrary source speaker could be transformed to the speci-
fied target speaker. We show that the proposed algorithm could be used in intra-lingual
and cross-lingual voice conversion. Subjective and objective evaluation reveals that the
quality of the transformed speech using the proposed approach is intelligible and posses
the characteristics of the target speaker.
A set of transformed utterances corresponding to results discussed in this work is avail-
able for listening at http://ravi.iiit.ac.in/ ?speech/uploads/taslp09_srinivas/ |
|||||||
Digital version |