Research Area: | Uncategorized | Year: | 2010 | ||||
Type of Publication: | Mastersthesis | ||||||
Authors: | Sree Harsha Yella | ||||||
Abstract: | |||||||
Automatic speech summarization is the task of generating a concise summary of a
speech signal using a digital computer. The existing speech summarization systems rely
on automatic speech recognition (ASR) transcripts and gold standard human summaries
to generate summaries of speech signals. The limitations with these approaches are, ASR
errors make summaries less usable by humans, also ASR systems are not available for all
languages, especially for less resource languages and it takes considerable resources and
effort in building one. Gold standard human summaries are not available for all speech
signals and building them is tedious and time consuming task. In this work, we propose
two techniques for summarization:
1) Exploiting anchor speaker role in broadcast news (BN) show to construct summaries,
2) A generalized ranking of speech segments based on prominence values of syllables in
them.
By analyzing manual summaries of news shows, it was found that anchor speaker seg-
ments are mostly picked in manual summaries. Therefore it is desirable for automatic
summaries to exhibit this characteristic. We proposed two techniques to perform anchor
speaker tracking, based on auto associative neural network model and Bayesian informa-
tion criterion method. Audio summaries are generated for desired summary length by con-
catenating anchor speaker segments based on their positional features. These summaries
are evaluated with the help of ROUGE, an automatic text summarization evaluation package by transcribing the audio summary into text. ROUGE-N metric measures the N-gram
overlap between human reference summaries and the automatic summary. The f-measure
scores of the proposed system for ROUGE-1 and ROUGE-2 metrics are 0.561 and 0.392
respectively. These scores showed that the system is capable of generating summaries that
are as good as supervised speech summarization system trained using gold standard human
summaries which achieved 0.553 and 0.382 for ROUGE-1 and ROUGE-2 metrics respec-
tively. Also, we performed a task based evaluation where, humans were asked to listen
to the summary and answer questions regarding the contents of a news show. The per-
centage of questions answered by the humans was 71 % for the proposed system which is
better than 60.2 % of the supervised speech summarization system. The coherence of the
summaries was also evaluated by asking the users to rate the summaries on a scale of 1-5
where 1 corresponds to very bad and 5 corresponds to very good. The mean opinion scores
(MOS) of these ratings for the proposed method and the supervised speech summarization
system are 4.05 and 3.2 respectively. The task based evaluation of these summaries by
humans showed that, they prefer the summaries generated by the proposed techniques over
the summaries generated by standard speech summarization methods.
In other part of the work, a technique to rank segments in a speech signal using prosodic
features that indicate importance is proposed. When humans convey message through
speech, they attract listeners’ attention to information bearing parts of speech through vari-
ations in pitch, amplitude, duration and stress. Speakers make some words prominent and
reduce other words. The proposed method computes syllable level prominence values as
a function of syllable nucleus duration, sub-band energy (300-2200 Hz), and pitch varia-
tion and these values are used to obtain a segment level score, which is used for ranking the
segment for summarization. It is shown that this type of scoring captures the prosodic infor-
mation relevant to summarization in an unsupervised framework. We have also proposed
a method to combine lexical and positional features with the prominence based scoring when text transcripts of speech signals are available. The proposed prominence based scor-
ing captures complimentary information to lexical features derived from text transcripts of
speech signals. The combination of these features perform better than the individual fea-
tures. The proposed method was evaluated on two types of speech data; read style news
speech and spontaneous telephone conversations. The proposed system based on promi-
nence scoring achieved ROUGE-1 and ROUGE-2 f-measure scores of 0.508, 0.341 on read
style news speech and 0.666, 0.464 on spontaneous conversations respectively. In read style
speech the basic unit of extraction was obtained based on pause based segmentation which
does not give semantically meaningful segments, where as in spontaneous telephone con-
versations we have considered speaker turns which are semantically meaningful units as
basic unit of extraction. |
|||||||
Digital version |