Qian Hu, Fred J. Goodman, Stanley M. Boykin, Randall K. Fish, Warren R. Greiff, Stephen R. Jones, Stephen R. Moore
The availability of large volumes of multimedia data presents many challenges to content retrieval. Sophisticated modern systems must efficiently process, index, and retrieve terabytes of multimedia data, determining what is relevant based on the user's query criteria and the system's domain specific knowledge. This paper reports our approach to information extraction from cross-lingual multimedia data by automatically detecting, indexing, and retrieving multiple attributes from the audio track. The multiple time-stamped attributes the Audio Hot Spotting system automatically extracts from multimedia include speech transcripts and keyword indices, phonemes, speaker identity (if possible), spoken language ID and automatically identified non-lexical audio cues. The non-lexical audio cues include both non-speech attributes and background noise. Non-speech attributes include speech rate, vocal effort (e.g. shouting and whispering), which are indicative of the speaker’s emotional state, especially when combined with adjacent keywords. Background noise detection (such as laughter and applause) is suggestive of audience response to the speaker. In this paper, we describe how the Audio Hot Spotting prototype system detects these multiple attributes and how the system uses them to discover information, locate passages of interest within a large multi-media and cross-lingual data collection, and refine query results.
Submitted: Sep 8, 2008