Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese

Xiaobing Zhang; Haigang Gong; Xili Dai; Fan Yang; Nianbo Liu; Ming Liu

doi:10.1609/aaai.v33i01.33019211

Authors

Xiaobing Zhang University of Electronic Science and Technology of China
Haigang Gong University of Electronic Science and Technology of China
Xili Dai University of Electronic Science and Technology of China
Fan Yang University of Electronic Science and Technology of China
Nianbo Liu University of Electronic Science and Technology of China
Ming Liu University of Electronic Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v33i01.33019211

Abstract

With the breakthrough of deep learning, lip reading technologies are under extraordinarily rapid progress. It is well-known that Chinese is the most widely spoken language in the world. Unlike alphabetic languages, it involves more than 1,000 pronunciations as Pinyin, and nearly 90,000 pictographic characters as Hanzi, which makes lip reading of Chinese very challenging. In this paper, we implement visual-only Chinese lip reading of unconstrained sentences in a two-step end-to-end architecture (LipCH-Net), in which two deep neural network models are employed to perform the recognition of Pictureto-Pinyin (mouth motion pictures to pronunciations) and the recognition of Pinyin-to-Hanzi (pronunciations to texts) respectively, before having a jointly optimization to improve the overall performance. In addition, two modules in the Pinyin-to-Hanzi model are pre-trained separately with large auxiliary data in advance of sequence-to-sequence training to make the best of long sequence matches for avoiding ambiguity. We collect 6-month daily news broadcasts from China Central Television (CCTV) website, and semi-automatically label them into a 20.95 GB dataset with 20,495 natural Chinese sentences. When trained on the CCTV dataset, the LipCH-Net model outperforms the performance of all stateof-the-art lip reading frameworks. According to the results, our scheme not only accelerates training and reduces overfitting, but also overcomes syntactic ambiguity of Chinese which provides a baseline for future relevant work.

Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese

Authors

DOI:

Abstract

Downloads

Published

How to Cite

Issue

Section

Information

Developed By

Subscription