Speech Language Models for Under-Represented Languages: Insights from Wolof

This research details our approach to training a speech language model for Wolof.

The Challenge: When Languages Are Primarily Oral

Most African languages, including Wolof, are primarily oral. This presents unique challenges for AI development: limited text resources, non-standardized orthography, and speech patterns that include frequent code-switching between languages. Traditional approaches that rely on massive text corpora simply don't work.

Our approach started with a fundamental question: How do we build effective speech AI when the language itself resists written standardization?

High-Quality Speech Data

We collected 1.4 TB of raw Wolof speech from public web sources, prioritizing natural, spontaneous conversations over read speech. Through a rigorous filtering pipeline including source separation, diarization, voice activity detection, and quality scoring, we distilled this into 860 hours of high-quality audio.

This Wolof-centric dataset became the foundation for continued pretraining of HuBERT, Meta's self-supervised speech model. Starting from a checkpoint trained on 960 hours of English (LibriSpeech), we continued pretraining on our Wolof data for just 33 epochs, far fewer than the hundreds of epochs needed to train from scratch.

Key Insight 1: Continued Pretraining Over Multilingual Models

Our continued pretraining approach outperformed both the base HuBERT model (39.48 WER → 35.65 WER) and an African-centric model trained from scratch on 65,000 hours of multilingual data (41.11 WER). This demonstrates that language-specific, high-quality data beats massive multilingual datasets for any given language, while requiring significantly less compute.

From Speech Recognition to Speech Understanding

With a strong speech encoder in place, we integrated it into a Wolof LLM (fine-tuned from Qwen2.5 3B) to create the first Speech Language Model for Wolof. Using a late-fusion architecture, we concatenate features from all HuBERT layers with text embeddings from the LLM, training only a lightweight alignment layer while keeping both the speech encoder and LLM frozen.

Our Results

The Speech LLM delivers compelling improvements:

18.4% better ASR performance compared to HuBERT alone (35.65 → 29.09 WER)
Speech-to-text translation capabilities unlocked (33.08 ChrF score)
All improvements achieved while the LLM remained frozen, suggesting the gains come from leveraging the linguistic knowledge already encoded in the language model.

Key Insight 2: LLMs Provide More Than Text Generation

By integrating our speech encoder with a Wolof LLM, we not only improved transcription accuracy but also enabled the model to perform speech translation, a capability not present in the speech encoder alone. This shows that LLMs trained on text can effectively augment speech understanding, even for underrepresented languages.

What We Learned About Chain-of-Thought

We experimented with multi-step Chain-of-Thought (CoT) reasoning, training the model to phonemize before transcribing, or translate before transcribing. Interestingly, these intermediate steps didn't improve performance with our 3B parameter model likely due to the model entering repetition loops. We expect larger models to benefit more from multi-step reasoning.

Key Insight 3: Model Size Matters for Complex Reasoning

While our Speech LLM successfully performs direct transcription and translation, multi-step CoT requires larger model capacity. Smaller models may struggle with the sequential reasoning needed for phonemization or reformulation steps.

Our learnings in summary

This work demonstrates that building sophisticated speech AI for underrepresented languages is not only possible but practical:

Data over compute: 860 hours of high-quality, language-specific data outperforms 65,000 hours of multilingual data
Continued pretraining is efficient: Leveraging existing models reduces training from hundreds to dozens of epochs
LLM integration unlocks capabilities: Combining speech encoders with text LLMs enables new tasks like translation without additional speech-to-speech training data

For the millions who speak Wolof and other oral languages, this research opens the door to voice-based interfaces, automated transcription, and cross-language communication.

Open Source & What's Next

Looking ahead, we plan to scale our model size to unlock the complex reasoning capabilities needed for multi-step Chain-of-Thought, which smaller models struggled to maintain.

Most importantly, we aim to push beyond simple transcription and translation toward true speech understanding, enabling the model to follow instructions and answer questions directly from audio. We will also validate this efficient recipe by expanding to other underrepresented languages, starting with Bambara.

True to our commitment to digital independence, we're releasing all models and code openly.

Read the full paper: arXiv:2509.15362

Try our models: All checkpoints and code available on our GitHub and HuggingFace

This research was conducted in collaboration with LORIA and CNRS. Special thanks to Christophe Cerisara and Irina Illina for their invaluable contributions.