Publication: DeepSR a Deep Learning Neural Network for Speech Reading
Date
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Speech reading or lip reading is the understanding of spoken language while watching the speaker. It is a natural extension to increase comprehension when hearing becomes challenging. One attentively observes the speaker’s mouth movement as it forms the word. Speech reading is a skill, in addition to hand sign language, that helps with comprehension. It is an interesting skill yet difficult to learn. This thesis describes the development of a deep learning model that translates mouth movement into words. The model stacks a Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) for learning the sequence of mouth images. The model is trained with a dataset from the GRID corpus. This corpus contains video recordings of 34 speakers with each speaks 1000 sentences in English. The trained model is capable of predicting words from the mouth video frames with a word accuracy of 52.25%, far exceeding human accuracy of 14.47%.