DeepSR a Deep Learning Neural Network for Speech Reading
MetadataShow full item record
CitationWinoto, Basuki. 2018. DeepSR a Deep Learning Neural Network for Speech Reading. Master's thesis, Harvard Extension School.
AbstractSpeech reading or lip reading is the understanding of spoken language while watching the speaker. It is a natural extension to increase comprehension when hearing becomes challenging. One attentively observes the speaker’s mouth movement as it forms the word. Speech reading is a skill, in addition to hand sign language, that helps with comprehension. It is an interesting skill yet difficult to learn.
This thesis describes the development of a deep learning model that translates mouth movement into words. The model stacks a Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) for learning the sequence of mouth images.
The model is trained with a dataset from the GRID corpus. This corpus contains video recordings of 34 speakers with each speaks 1000 sentences in English. The trained model is capable of predicting words from the mouth video frames with a word accuracy of 52.25%, far exceeding human accuracy of 14.47%.
Citable link to this pagehttps://nrs.harvard.edu/URN-3:HUL.INSTREPOS:37364544