Publication: DeepSR a Deep Learning Neural Network for Speech Reading
No Thumbnail Available
Date
2020-03-03
Authors
Published Version
Published Version
Journal Title
Journal ISSN
Volume Title
Publisher
The Harvard community has made this article openly available. Please share how this access benefits you.
Citation
Winoto, Basuki. 2018. DeepSR a Deep Learning Neural Network for Speech Reading. Master's thesis, Harvard Extension School.
Research Data
Abstract
Speech reading or lip reading is the understanding of spoken language while watching the speaker. It is a natural extension to increase comprehension when hearing becomes challenging. One attentively observes the speaker’s mouth movement as it forms the word. Speech reading is a skill, in addition to hand sign language, that helps with comprehension. It is an interesting skill yet difficult to learn.
This thesis describes the development of a deep learning model that translates mouth movement into words. The model stacks a Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) for learning the sequence of mouth images.
The model is trained with a dataset from the GRID corpus. This corpus contains video recordings of 34 speakers with each speaks 1000 sentences in English. The trained model is capable of predicting words from the mouth video frames with a word accuracy of 52.25%, far exceeding human accuracy of 14.47%.
Description
Other Available Sources
Keywords
Deep Learning, Speech Reading, Neural Network
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material (LAA), as set forth at Terms of Service