|Type of Publication:||Mastersthesis|
|Authors:||Sree Harsha Yella|
|University:||International Institute of Information Technology|
Automatic speech summarization is the task of generating a concise summary of a speech signal using a digital computer. The existing speech summarization systems rely on automatic speech recognition (ASR) transcripts and gold standard human summaries to generate summaries of speech signals. The limitations with these approaches are, ASR errors make summaries less usable by humans, also ASR systems are not available for all languages, especially for less resource languages and it takes considerable resources and effort in building one. Gold standard human summaries are not available for all speech signals and building them is tedious and time consuming task. In this work, we propose two techniques for summarization: 1) Exploiting anchor speaker role in broadcast news (BN) show to construct summaries, 2) A generalized ranking of speech segments based on prominence values of syllables in them. By analyzing manual summaries of news shows, it was found that anchor speaker seg- ments are mostly picked in manual summaries. Therefore it is desirable for automatic summaries to exhibit this characteristic. We proposed two techniques to perform anchor speaker tracking, based on auto associative neural network model and Bayesian informa- tion criterion method. Audio summaries are generated for desired summary length by con- catenating anchor speaker segments based on their positional features. These summaries are evaluated with the help of ROUGE, an automatic text summarization evaluation package by transcribing the audio summary into text. ROUGE-N metric measures the N-gram overlap between human reference summaries and the automatic summary. The f-measure scores of the proposed system for ROUGE-1 and ROUGE-2 metrics are 0.561 and 0.392 respectively. These scores showed that the system is capable of generating summaries that are as good as supervised speech summarization system trained using gold standard human summaries which achieved 0.553 and 0.382 for ROUGE-1 and ROUGE-2 metrics respec- tively. Also, we performed a task based evaluation where, humans were asked to listen to the summary and answer questions regarding the contents of a news show. The per- centage of questions answered by the humans was 71 % for the proposed system which is better than 60.2 % of the supervised speech summarization system. The coherence of the summaries was also evaluated by asking the users to rate the summaries on a scale of 1-5 where 1 corresponds to very bad and 5 corresponds to very good. The mean opinion scores (MOS) of these ratings for the proposed method and the supervised speech summarization system are 4.05 and 3.2 respectively. The task based evaluation of these summaries by humans showed that, they prefer the summaries generated by the proposed techniques over the summaries generated by standard speech summarization methods. In other part of the work, a technique to rank segments in a speech signal using prosodic features that indicate importance is proposed. When humans convey message through speech, they attract listeners’ attention to information bearing parts of speech through vari- ations in pitch, amplitude, duration and stress. Speakers make some words prominent and reduce other words. The proposed method computes syllable level prominence values as a function of syllable nucleus duration, sub-band energy (300-2200 Hz), and pitch varia- tion and these values are used to obtain a segment level score, which is used for ranking the segment for summarization. It is shown that this type of scoring captures the prosodic infor- mation relevant to summarization in an unsupervised framework. We have also proposed a method to combine lexical and positional features with the prominence based scoring when text transcripts of speech signals are available. The proposed prominence based scor- ing captures complimentary information to lexical features derived from text transcripts of speech signals. The combination of these features perform better than the individual fea- tures. The proposed method was evaluated on two types of speech data; read style news speech and spontaneous telephone conversations. The proposed system based on promi- nence scoring achieved ROUGE-1 and ROUGE-2 f-measure scores of 0.508, 0.341 on read style news speech and 0.666, 0.464 on spontaneous conversations respectively. In read style speech the basic unit of extraction was obtained based on pause based segmentation which does not give semantically meaningful segments, where as in spontaneous telephone con- versations we have considered speaker turns which are semantically meaningful units as basic unit of extraction.