Google announced the integration of its text-to-speech synthesis with the Google Cloud Platform today in the form of Cloud Text-to-Speech. According to the company, developers have been asking for ways to add text-to-speech into their applications.
The text to speech conversion technology is powered by machine learning and comes with 32 different voices from 12 languages and variants as well as a collection of high-fidelity WaveNet generated voices, synthesized with WaveNet’s neural network of speech samples which allows for more natural-sounding and listener-preferred speech from fewer audio samples and in less time than other TTS technologies.
WaveNet is a generative model for raw audio that comes from DeepMind. According to the company, WaveNet provides more natural-sounding speech and is preferred over other text-to-speech technologies.
“In tests, people gave the new US English WaveNet voices an average mean-opinion-score (MOS) of 4.1 on a scale of 1-5 — over 20% better than for standard voices and reducing the gap with human speech by over 70%,” Dan Aharon, Google Cloud AI product manager, wrote in the announcement. “As WaveNet voices also require less recorded audio input to produce high quality models, we expect to continue to improve both the variety as well as quality of the WaveNet voices available to Cloud customers in the coming months.”
While WaveNet was first released in 2016, an updated version runs on Google’s Cloud TPU machine learning infrastructure.
“The new, improved WaveNet model generates raw waveforms 1,000 times faster than the original model, and can generate one second of speech in just 50 milliseconds,” Aharon wrote. “In fact, the model is not just quicker, but also higher-fidelity, capable of creating waveforms with 24,000 samples a second. We’ve also increased the resolution of each sample from 8 bits to 16 bits, producing higher quality audio for a more human sound.”
But Aharon writes that Cloud Text-to-Speech provides plenty of capability even before WaveNet.
“Cloud Text-to-Speech correctly pronounces complex text such as names, dates, times and addresses for authentic sounding speech right out of the gate,” he wrote. “Cloud Text-to-Speech also allows you to customize pitch, speaking rate, and volume gain, and supports a variety of audio formats, including MP3 and WAV.”
While there are numerous applications of convincing TTS, Aharon highlighted a few examples — response systems for call centers, IoT devices that talk back and the conversion of text media to spoken audio.
Cloud Text-to-Speech is now available in public beta.