Lipper: Speaker Independent Speech Synthesis Using Multi-View Lipreading
Lipreading is the process of understanding and interpreting speech by observing a speaker’s lip movements. In the past, most of the work in lipreading has been limited to classifying silent videos to a fixed number of text classes. However, this limits the applications of the lipreading since human language cannot be bound to a fixed set of words or languages. The aim of this work is to reconstruct intelligible acoustic speech signals from silent videos from various poses of a person which Lipper has never seen before. Lipper, therefore is a vocabulary and language agnostic, speaker independent and a near real-time model that deals with a variety of poses of a speaker. The model leverages silent video feeds from multiple cameras recording a subject to generate intelligent speech of a speaker. It uses a deep learning based STCNN+BiGRU architecture to achieve this goal. We evaluate speech reconstruction for speaker independent scenarios and demonstrate the speech output by overlaying the audios reconstructed by Lipper on the corresponding videos.