This article contains close paraphrasing of a non-free copyrighted source, https://ui.adsabs.harvard.edu/abs/2016arXiv161101599A/abstract (Copyvios report). (February 2021) |
LipNet is a deep neural network for audio-visual speech recognition (ASVR). It was created by University of Oxford researchers Yannis Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. The technique, outlined in a paper in November 2016,[1] is able to decode text from the movement of a speaker's mouth. Traditional visual speech recognition approaches separated the problem into two stages: designing or learning visual features, and prediction. LipNet was the first end-to-end sentence-level lipreading model that learned spatiotemporal visual features and a sequence model simultaneously.[2] Audio-visual speech recognition has enormous practical potential, with applications such as improved hearing aids, improving the recovery and wellbeing of critically ill patients,[3] and speech recognition in noisy environments,[4] implemented for example in Nvidia's autonomous vehicles.[5]
References
edit- ^ Assael, Yannis M.; Shillingford, Brendan; Whiteson, Shimon; de Freitas, Nando (2016-12-16). "LipNet: End-to-End Sentence-level Lipreading". arXiv:1611.01599 [cs.LG].
- ^ "AI that lip-reads 'better than humans'". BBC News. November 8, 2016.
- ^ "Home Elementor". Liopa.
- ^ Vincent, James (November 7, 2016). "Can deep learning help solve lip reading?". The Verge.
- ^ Quach, Katyanna. "Revealed: How Nvidia's 'backseat driver' AI learned to read lips". www.theregister.com.