Lipper: Speaker Independent Speech Synthesis Using Multi-View Lipreading

Authors

  • Khwaja Mohd. Salik Netaji Subhas Institute of Technology
  • Swati Aggarwal Netaji Subhas Institute of Technology
  • Yaman Kumar Indian Institute of Technology Delhi
  • Rajiv Ratn Shah Indian Institute of Technology Delhi
  • Rohit Jain Netaji Subhas Institute of Technology
  • Roger Zimmermann National University of Singapore

DOI:

https://doi.org/10.1609/aaai.v33i01.330110023

Abstract

Lipreading is the process of understanding and interpreting speech by observing a speaker’s lip movements. In the past, most of the work in lipreading has been limited to classifying silent videos to a fixed number of text classes. However, this limits the applications of the lipreading since human language cannot be bound to a fixed set of words or languages. The aim of this work is to reconstruct intelligible acoustic speech signals from silent videos from various poses of a person which Lipper has never seen before. Lipper, therefore is a vocabulary and language agnostic, speaker independent and a near real-time model that deals with a variety of poses of a speaker. The model leverages silent video feeds from multiple cameras recording a subject to generate intelligent speech of a speaker. It uses a deep learning based STCNN+BiGRU architecture to achieve this goal. We evaluate speech reconstruction for speaker independent scenarios and demonstrate the speech output by overlaying the audios reconstructed by Lipper on the corresponding videos.

Downloads

Published

2019-07-17

How to Cite

Salik, K. M., Aggarwal, S., Kumar, Y., Shah, R. R., Jain, R., & Zimmermann, R. (2019). Lipper: Speaker Independent Speech Synthesis Using Multi-View Lipreading. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01), 10023-10024. https://doi.org/10.1609/aaai.v33i01.330110023

Issue

Section

Student Abstract Track