Challenges in the Fusion of Video and Audio for Robust Speech Recognition

Jer-Sen Chen and Oscar N. Garcia

As speech recognizers become more robust, they are popularly accepted as an essential component of human-computer interaction. State-ofthe- art speaker-independent speech recognizers exist with word recognition error rates below 10%. To achieve even higher and robust recognition performance, multi-modal speech recognition techniques that combine video and audio information call be used. Speech reading, the video portion of bimodal speech recognizer, introduces not only additional computatalonal cost of video processing, but also chanllenges in the design of the integrated audio-video recognizer.

This page is copyrighted by AAAI. All rights reserved. Your use of this site constitutes acceptance of all of AAAI's terms and conditions and privacy policy.