Speech and Vision Lab

  • Increase font size
  • Default font size
  • Decrease font size
Home Publications
Application of single frequency filtering for speech and speaker-specific tasks
Research Area: Uncategorized Year: 2021
Type of Publication: Phd Thesis  
Authors: Vishala Pannala  
Speech produced in practical environments is affected by several sources of degradations, which includenoises, reflections, reverberation and sounds from other sources such as background speech, vehicles, etc. Thesedegradations reduce the performance of several speech-based applications, such as speech and speaker recogni-tion systems. It is difficult to characterize these degradations, as they are not known in advance, and also theyare time varying. Hence, it is necessary to explore and extract speech-specific and speaker-specific informationin the collected speech signal to improve the performance of speech-based systems. The objective of the studiesproposed in this thesis is to explore methods based on single frequency filtering (SFF) analysis of speech signalsand artificial neural network (ANN) models for some speech-specific and speaker-specific tasks. The SFF analy-sis gives flexibility in the representation of speech information at different spectral and temporal resolutions. TheANN models are used for capturing the implicit features, specific to a given application. The SFF analysis can beused to provide a representation of the excitation source and the vocal tract system characteristics of the dynamicspeech production mechanism. The ANN models help in extracting the discriminative and descriptive featuresas needed for a given task. The speech-specific tasks explored in this thesis include speech/nonspeech detectionfrom degraded speech, discrimination of natural and synthetic speech. The speaker-specific tasks examined inthis thesis include speaker separation in two speaker conversation data, mimicked voice detection and speakerverification.The speech/nonspeech detection task is addressed using approaches based on signal processing and on ANNmodels. The signal processing approach exploits speech-specific characteristics in the degraded signal. The re-sults of speech/nonspeech detection are given for the speech data from TIMIT corpus, corrupted by syntheticallyadding noises of different types at different levels. The signal processing approach is also used for detectingthe speech regions, when the speech is corrupted by unknown sources of degradations in a practical informationaccess situation. The performance of speech/nonspeech discrimination using ANN models is examined for syn-thetically added degradations to speech signals, mainly to study the effectiveness of different types of SFF-basedrepresentations of the signal. These models have been successfully used for voice activity detection (VAD) taskon the naturalistic Apollo corpus. The effectiveness of SFF-based representations and the discriminative charac-teristics of ANN models are examined for several speech-specific tasks such as discrimination of close vs distantspeech, live vs recorded speech, and natural vs synthetic speech.Speaker-specific tasks such as study of speaker separation, mimicked voice detection and speaker verificationare explored using discriminative and descriptive models of ANN. Speaker separation task is successfully demon-strated on several cases of 2-speaker conversation data. The proposed approach for speaker verification task isdemonstrated only for a limited number of speakers, as the speaker verification task should not be dependenton the number of speakers. Overall, this study examined the effectiveness of different SFF-based representa-tions for some speech-specific and speaker-specific tasks using ANN models. It also highlighted deficiencies inrepresentation for some other tasks.
Digital version