NVIDIA Releases Open Multilingual Speech AI Dataset and Models

NVIDIA has released an open-source multilingual speech dataset and AI models designed to support speech recognition and translation across 25 European languages, including underrepresented languages like Croatian, Estonian and Maltese. The initiative addresses the challenge that only a tiny fraction of the world's approximately 7,000 languages are supported by AI language models.

The release includes three key components developed in collaboration with Carnegie Mellon University and Fondazione Bruno Kessler researchers. Granary, the massive open-source corpus, contains around one million hours of audio, including nearly 650,000 hours for speech recognition and over 350,000 hours for speech translation. The dataset covers nearly all of the European Union's 24 official languages, plus Russian and Ukrainian.

NVIDIA Canary-1b-v2, a billion-parameter model trained on Granary, provides high-quality transcription of European languages and translation between English and two dozen supported languages. The model tops Hugging Face's leaderboard for open multilingual speech recognition accuracy while expanding the Canary family's supported languages from four to 25. It offers transcription and translation quality comparable to models three times larger while running inference up to 10 times faster.

The streamlined NVIDIA Parakeet-tdt-0.6b-v3, featuring 600 million parameters, is designed for real-time or large-volume transcription applications. It achieves the highest throughput of multilingual models on the Hugging Face leaderboard and can transcribe 24-minute audio segments in a single inference pass with automatic language detection.

The development utilised NVIDIA's NeMo Speech Data Processor toolkit, which converted unlabelled audio through an innovative processing pipeline into structured, high-quality data without requiring resource-intensive human annotation. The team demonstrated that Granary requires approximately half as much training data compared to other popular datasets to achieve target accuracy levels for automatic speech recognition and automatic speech translation.

The tools enable developers to scale AI applications supporting global users through multilingual chatbots, customer service voice agents and near-real-time translation services. Both models provide accurate punctuation, capitalisation and word-level timestamps in their outputs, supporting production-scale enterprise implementations.

Organisations can leverage these open-source resources to develop more inclusive speech technologies that better reflect European linguistic diversity while reducing training data requirements. The availability of high-accuracy and high-throughput models under permissive licensing enables enterprises to customize solutions for specific applications without the computational overhead of larger models. The methodology can be adapted to additional languages and use cases, accelerating enterprise speech AI innovation globally.

Sign up for AI-360