Vector-based Representation and Clustering of Audio Using Onomatopoeia Words

Shiva Sundaram, Shrikanth Narayanan

We present results on organization of audio data based on their descriptions using onomatopoeia words. Onomatopoeia words are imitative of sounds that directly describe and represent different types of sound sources through their perceived properties. For instance, the word pop aptly describes the sound of opening a champagne bottle. We first establish this type of audio-to-word relationship by manually tagging a variety of audio clips from a sound effects library with onomatopoeia words. Using principal component analysis (PCA) and a newly proposed distance metric for word-level clustering, we cluster the audio data representing the clips. Due to the distance metric and the audio-to-word relationship, the resulting clusters of clips have similar acoustic properties. We found that as language level units, the onomatopoeic descriptions are able to represent perceived properties of audio signals. We believe that this form of description can be useful in relating higher-level descriptions of events in a scene by providing an intermediate perceptual understanding of the acoustic event.

Subjects: 1.10 Information Retrieval; 6.2 Multimedia

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.