The team at Mozilla recently announced the release of the latest Common Voice dataset. Common Voice is an initiative put in place in order to help teach machines how real people speak, and this newest dataset achieved a major milestone: more than 20,000 hours of open-source speech data that anyone, anywhere can use.
With this, the dataset has nearly doubled in size in the past year. Additionally, this release offers users the new languages of Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona, and Cantonese, as well as more speech data from female speakers.
Common Voice also has cross-sector backing from entities such as the Gates Foundation, GIZ, NVIDIA, and the UK FCDO.
According to Mozilla, this is the world’s largest multilingual, open-source dataset and it is used by researchers, academics, and developers globally in order to train voice-enabled technology and make it more inclusive and accessible.
Highlights from the latest dataset include
- 27 languages now offer at least 100 hours of speech data
- Nine languages now have at least 500 hours of speech data
- Nine languages now have at least 45% of their gender tags as female
- The Catalan community’s Project AINA fueled major growth
- And the highest community participation in decision making thanks to the Common Voice language Rep Cohort
“We are so glad to see new languages and increased representation in our latest dataset release. Our contributors have made this possible — from voice donations, to initiating their language in our project, to opening new opportunities for people to build voice technology tools that can support every language spoken across the world,” said Hillary Juma, Common Voice community manager.