Researchers from Google¡¯s AI division DeepMind and the University of Oxford have used artificial intelligence to create the most accurate lip-reading software ever. Using thousands of hours of TV footage from the BBC, scientists trained a neural network to annotate video footage with 46.8 percent accuracy. That might not seem that impressive at first — especially compared to AI accuracy rates when transcribing audio — but tested on the same footage, a professional human lip-reader was only able to get the right word 12.4 percent of the time.
The research follows similar work published by a separate group at the University of Oxford earlier this month. Using related techniques, these scientist were able to create a lip-reading program called LipNet that achieved 93.4 percent accuracy in tests, compared to 52.3 percent human accuracy. However, LipNet was only tested on specially-recorded footage that used volunteers speaking formulaic sentences. By comparison, DeepMind¡¯s software — known as ¡°Watch, Listen, Attend, and Spell¡± — was tested on far more challenging footage; transcribing natural, unscripted conversations from BBC politics shows.
More than 5,000 hours of footage from TV shows including Newsnight, Question Time, and the World Today, was used to train DeepMind¡¯s ¡°Watch, Listen, Attend, and Spell¡± program. The videos included 118,000 difference sentences and some 17,500 unique words, compared to LipNet¡¯s test database of video of just 51 unique words.
DeepMind¡¯s researchers suggest that the program could have a host of applications, including helping hearing-impaired people understand conversations. It could also be used to annotate silent films, or allow you to control digital assistants like Siri or Alexa by just mouthing words to a camera (handy if you¡¯re using the program in public).
But when most people learn that an AI program has learned how to lip-read, their first thought is how it might be used for surveillance. Researchers say that there¡¯s still a big difference in transcribing brightly-lit, high resolution TV footage, and grainy CCTV video with a low frame rate, but you can¡¯t ignore the fact, that artificial intelligence seems to be closing this gap.