arxiv:2505.13404

Granary: Speech Recognition and Translation Dataset in 25 European Languages

Published on May 19

Upvote

Authors:

Nithin Rao Koluguri ,

Nikolay Karpov ,

Yifan Peng ,

Sara Papi ,

Abstract

Granary significantly improves speech recognition and translation performance for low-resource European languages using a novel data collection and enhancement pipeline.

AI-generated summary

Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at https://hf.co/datasets/nvidia/Granary

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.13404 in a dataset README.md to link it from this page.

Spaces citing this paper 20

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.