Breakthrough Achieved: Global Language Dataset Expands To 133 Languages

Breakthrough Achieved: Global Language Dataset Expands To 133 Languages

The Common Voice team has launched its 20th multilingual speech dataset, a milestone that marks the inclusion of four new languages: Aragonese, IsiNdebele, Southern Sotho, and Tupuri. This significant update brings the total number of languages in the dataset to 133, thanks to the dedication and efforts of language activists, translators, and contributors.

The dataset now includes contributions made through December 6th, 2024, adding 566 new hours of speech and 515 newly validated hours of speech to the existing dataset. This brings the total number of available hours of speech data to a staggering 33,150 hours, with 22,108 hours having undergone rigorous quality assurance (validation) crowdsourced through the community.

The Common Voice dataset has become a cornerstone of linguistic research and community-driven projects, with its open nature and collaborative spirit empowering users to build upon and contribute to the dataset. The team invites contributors, dataset users, and language activists to join the growing conversation in their new Discord community or email them at [email protected].

With this latest update, the team is confident that the dataset will remain a vital resource for scholars, researchers, and developers working at the intersection of language, culture, and technology. The inclusion of these four new languages underscores the importance of linguistic diversity in the digital age, where access to diverse linguistic resources can facilitate greater understanding, communication, and cooperation across cultures.

By providing an open and inclusive platform for language data, the Common Voice team is helping to promote a more nuanced and empathetic global community, where diverse voices are valued and celebrated. This milestone marks a significant expansion of linguistic diversity in the dataset, making it an increasingly valuable resource for language preservation and promotion.

Latest Posts