Research Area: | Uncategorized | Year: | 2000 | ||||
Type of Publication: | Article | Keywords: | modulation spectrum; temporal processing; speaker verification; channel variability; data-driven filter design | ||||
Authors: | Narendranath Malayath, Hynek Hermansky, Sachin Kajarekar, B. Yegnanarayana | ||||||
Note: | |||||||
http://www.sciencedirect.com/science/article/B6WDJ-45F541V-5/2/832489fd051b5071e3f52108ad311ebe |
|||||||
Abstract: | |||||||
Malayath, Narendranath, Hermansky, Hynek, Kajarekar, Sachin, and Yegnanarayana, B., Data-Driven Temporal Filters and Alternatives to GMM in Speaker Verification, Digital Signal Processing10(2000), 55-74.
This paper discusses the research directions pursued jointly at the Anthropic Signal Processing Group of the Oregon Graduate Institute and at the Speech and Vision Laboratory of the Indian Institute of Technology Madras. Current methods for speaker verification are based on modeling the speaker characteristics using Gaussian mixture models (GMM). The performance of these systems significantly degrades if the target speakers use a telephone handset that is different from that used while training. Conventional methods for channel normalization include utterance-based mean subtraction (MS) and RelAtive SpecTrAl (RASTA) filtering. In this paper we introduce a novel method for designing filters that are capable of normalizing the variability introduced by different telephone handsets. The design of the filter is based on the estimated second-order statistics of handset variability. This filter is applied on the logarithmic energy outputs of Mel spaced filter banks. We also demonstrate the effectiveness of the proposed channel normalizing filter in improving speaker verification performance in mismatched conditions. GMM-based systems often use thousands of mixture components and hence require a large number of parameters to characterize each target speaker. In order to address this issue we propose an alternative to GMM for modeling speaker characteristics. The alternative is based on speaker-specific mapping and it relies on a speaker-independent representation of speech. |
|||||||
Digital version |