With less than 20% of the world’s population speaking English as their first or second language, Google is ramping up the efficiency of video voice dubbing with technologies for cross-lingual voice transfer and lip reanimation using deep learning and TensorFlow. 

The first technology keeps the voice similar to that of the original speaker and the second adjusts the speaker’s lip movements in the video to better match the audio generated in the target language. 

Google performs cross-lingual voice transfer by creating synthetic voices in the target language that best fits the speaker’s original voice. This technology was made possible by first pre-training a multilingual text-to-speech (TTS) model based on the cross-language voice transfer approach. Then Google, fine-tuned the model parameters by retraining with a fixed mixing ratio of the adaptation data and original multilingual data 

For lip reanimation, Google trained a multistage model that maps audio to lip shapes and the appearance of the speaker. Then they used the original videos of the speaker for training, isolated the frequency and represented the faces in a space that decouples 3D geometry, head pose, texture and lighting.Then a GAN-based approach is used to blend these synthesized textures with the original video before later refinement using a  super-resolution network.

Google’s new system for video dubbing through deep learning uses a combination of natural language processing, speech recognition, and audio-video analysis to create more natural-sounding and accurate voice dubs. 

“We strongly believe that dubbing is a creative process. With these techniques, we strive to make a broader range of content available and enjoyable in a variety of other languages,” Google wrote in a blog post.