Automatic Speech Recognition for Low-Resourced Middle Eastern Languages

📄 Paper: Interspeech 2025

Abstract

Despite significant advancements in language and speech technologies, many languages in the Middle East remain underserved, leading to a technological disparity that negatively impacts these languages. This paper presents a pioneering effort to address this issue by focusing on speech technologies for low-resourced languages in the Middle East. We introduce a community-driven volunteer-based initiative to collect audio recordings for six languages spoken by an estimated population of 30 million speakers. Through this initiative, we collect over 40 hours of speech data, with 75% of utterances based on multilingual parallel corpora. In our experiments, we demonstrate the impact of data collection and fine-tuning models on the performance of speech technologies for these languages. This research serves as a crucial step towards preserving and promoting linguistic diversity in the Middle East while ensuring equal access to speech technologies for all language communities.

Overview

This repository contains the code for collecting speech data and fine-tuning Whisper models for six low-resourced Middle Eastern languages: Gilaki, Laki Kurdish, Hawrami, Mazandarani, Southern Kurdish, and Zazaki.

Key Features

🎙️ Community-driven data collection via Telegram bot
🎯 40+ hours of speech data across 6 languages
🌐 Multilingual parallel corpora with English translations
🚀 Fine-tuned Whisper models for ASR
💬 Applications: ASR, speech-to-speech translation, and more

Data Collection

Telegram Bot (`tg_bot/`)

The Telegram bot facilitates community-driven audio data collection with quality control mechanisms.

1. Data Preparation (`prepare_data.py`)

Processes multilingual parallel corpora and creates language-specific CSV files:

cd tg_bot
python prepare_data.py

This script:

Reads common_sentences.csv containing parallel sentences
Sorts sentences by translation availability
Creates individual CSV files for each language in languages/

2. Bot Deployment (`bot.py`)

The main bot implementation with features:

User onboarding: Gender and language selection
Recording workflow: Present sentences, record audio, review & confirm
Quality control: Automatic duration checks, manual review options
Progress tracking: Skip problematic texts, track recorded sentences

To run the bot:

# Set up environment variables
export TOKEN_ID="your_telegram_bot_token"
export SEND_TO_CHANNEL="true"  # Optional
export CHANNEL_ID="your_channel_id"  # Optional

python bot.py

3. Data Analysis (`report.py`)

Generate comprehensive statistics about collected data:

python report.py

Provides:

Per-user contribution statistics
Language-wise data distribution
Gender demographics
Content and duration analytics

Dataset Structure

The collected data includes:

Audio files: MP3 format recordings
Metadata: CSV files with sentence text, English translations, speaker info
English TTS: Kokoro-generated speech for all English sentences

Model Fine-tuning

Whisper Fine-tuning (`finetune_whisper.py`)

Fine-tune OpenAI's Whisper models for both monolingual and multilingual ASR:

Configuration

Edit train.sh to customize training parameters:

python finetune_whisper.py \
    --model_name_or_path="openai/whisper-base" \
    --language="persian" \
    --num_train_epochs="3" \
    --output_dir="./whisper-base-me" \
    --per_device_train_batch_size="32" \
    --learning_rate="1e-5" \
    --do_train \
    --do_eval

Training Modes

Monolingual: Train separate models for each language
Multilingual: Train a single model for all languages

The script automatically:

Loads data from razhan/DOLMA-speech dataset
Processes audio and text pairs
Handles both transcription and translation tasks
Applies text normalization and preprocessing

Results

Our experiments show:

Monolingual models achieve 29-31% WER reduction
Best performance: Hawrami (37.9% WER)
Significant improvements for Arabic-script languages

Citation

@inproceedings{hameed2025asr,
  title={Automatic Speech Recognition for Low-Resourced Middle Eastern Languages},
  author={Hameed, Razhan and Ahmadi, Sina and Hadi, Hanah and Sennrich, Rico},
  booktitle={Interspeech 2025},
  year={2025}
}

Acknowledgments

This work was supported by the Swiss National Science Foundation (MUTAMUR project) and Stanford SILICON. Special thanks to all community volunteers who contributed recordings.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
tg_bot		tg_bot
.gitignore		.gitignore
README.md		README.md
finetune_whisper.py		finetune_whisper.py
requirements.txt		requirements.txt
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automatic Speech Recognition for Low-Resourced Middle Eastern Languages

📄 Paper: Interspeech 2025

Abstract

Overview

Key Features

Data Collection

Telegram Bot (`tg_bot/`)

1. Data Preparation (`prepare_data.py`)

2. Bot Deployment (`bot.py`)

3. Data Analysis (`report.py`)

Dataset Structure

Model Fine-tuning

Whisper Fine-tuning (`finetune_whisper.py`)

Configuration

Training Modes

Results

Citation

Acknowledgments

About

Uh oh!

Languages

DOLMA-NLP/asr

Folders and files

Latest commit

History

Repository files navigation

Automatic Speech Recognition for Low-Resourced Middle Eastern Languages

📄 Paper: Interspeech 2025

Abstract

Overview

Key Features

Data Collection

Telegram Bot (tg_bot/)

1. Data Preparation (prepare_data.py)

2. Bot Deployment (bot.py)

3. Data Analysis (report.py)

Dataset Structure

Model Fine-tuning

Whisper Fine-tuning (finetune_whisper.py)

Configuration

Training Modes

Results

Citation

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages

Telegram Bot (`tg_bot/`)

1. Data Preparation (`prepare_data.py`)

2. Bot Deployment (`bot.py`)

3. Data Analysis (`report.py`)

Whisper Fine-tuning (`finetune_whisper.py`)