Speech and Vision Lab

  • Increase font size
  • Default font size
  • Decrease font size
Home Publications
Spectral mapping using artificial neural networks for intra-lingual and cross-lingual voice conversion
Research Area: Speech Analysis Year: 2010
Type of Publication: Mastersthesis Keywords: Voice Conversion, Artificial Neural Networks, Spetral Mapping, Error Cor- rection Network, Cross-Lingual Voice Conversion
Authors: Srinivas Desai  
Voice conversion is a process of transforming an utterance of a source speaker so that it is perceived as if spoken by a specified target speaker. Applications of voice conver- sion include secured transmission, speech-to-speech translation and generating voices for virtual characters/avatars. The process of voice conversion involves transforming acoustic cues such as spectral parameters characterizing the vocal tract, fundamental frequency, prosody etc., pertaining to the identity of a speaker. Spectral parameters representing the vocal tract shape are known to contribute more to the speaker identity and hence there have been efforts to find a better spectral mapping between the source and the target speaker. In this dissertation, we propose an Artificial Neural Network (ANN) based spectral mapping and compare its performance against the state-of-the-art Gaussian Mixture Model (GMM) based mapping. We show that the ANN based voice conversion system performs better than that of GMM based voice conversion system. A typical requirement for a voice conversion system is to have both the source and tar- get speakers record a same set of utterances, referred to as parallel data. A mapping function obtained on such parallel data can be used to transform spectral characteristics from a source speaker to the target speaker. If either of the speakers change then a new transformation function has to be estimated which requires collection of parallel data. However, it is not always feasible to find parallel utterances for training. The com- plexity of building training data increases if the language of the source speaker and the target speaker is different, which occurs in the case of cross-lingual voice conversion. To circumvent the need of parallel data and to reduce the complexity in building training data for a cross-lingual voice conversion system, we propose an algorithm which cap- tures speaker specific characteristics (target speaker) so that there is no need of training data from the source speaker. Such an algorithm needs to be trained on only the target speaker data and hence any arbitrary source speaker could be transformed to the speci- fied target speaker. We show that the proposed algorithm could be used in intra-lingual and cross-lingual voice conversion. Subjective and objective evaluation reveals that the quality of the transformed speech using the proposed approach is intelligible and posses the characteristics of the target speaker. A set of transformed utterances corresponding to results discussed in this work is avail- able for listening at http://ravi.iiit.ac.in/ ?speech/uploads/taslp09_srinivas/
Digital version