Research Area: | Speech Synthesis | Year: | 2011 | ||||
Type of Publication: | Article | Keywords: | Audio books, forced alignment, large speech files, text to speech | ||||
Authors: | Kishore S. Prahallad, A. W. Black | ||||||
Abstract: | |||||||
One of the issues in using audio books for building a synthetic
voice is the segmentation of large speech files. The use of the Viterbi algorithm
to obtain phone boundaries on large audio files fails primarily
because of huge memory requirements. Earlier works have attempted to
resolve this problem by using large vocabulary speech recognition system
employing restricted dictionary and language model. In this paper, we propose
suitable modifications to the Viterbi algorithm and demonstrate its
usefulness for segmentation of large speech files in audio books. The utterances
obtained from large speech files in audio books are used to build
synthetic voices. We show that synthetic voices built from audio books in
the public domain have Mel-cepstral distortion scores in the range of 4–7,
which is similar to voices built from studio quality recordings such as CMU
ARCTIC. |
|||||||
Digital version |