Soynade's Open Source Month
February 19 to March 17, we are launching Soynade’s Open Source Month, four weeks dedicated to sharing models, datasets, and research artifacts with the community.
Each week, we will release cutting-edge AI models, high-quality datasets, and technical resources that anyone can study, use, improve, and build upon.
Our goal is ambitious: to actively stimulate and strengthen the African open-source ecosystem. We believe frontier technology, rigorous research, and high-value datasets should not be locked away but should circulate, be audited, extended, and collectively improved.
By opening our work, we aim to contribute to a collaborative ecosystem where innovation is shared and accessible. An ecosystem where African researchers, engineers, and builders do not just consume technology but shape it.
Release 1 - Oolel-Translator
Oolel-Translator is a batch inference pipeline for generating synthetic translation data at scale. Point it at a Hugging Face dataset, pick your model, and it produces parallel text - thousands, if not millions, of rows.
- Runs on vLLM with a PyTorch fallback
- Reads JSONL, JSON, or CSV inputs
- Pulls datasets from the Hub and pushes outputs back automatically
- Multi-GPU sharding and configurable memory utilization built in
You can use it to build translation pairs for any language pair and fine-tune models on the output, or expand an existing dataset.
Get the code: github.com/soynade-research/oolel-translator
Release 1 - Datasets
We built these with Oolel-Translator and are releasing them alongside it.
AfVoices-Translated (Bambara–English)
The African Next Voices (ANV) Bambara ASR dataset has ~159 hours of human-corrected audio (260k samples). What it did not have was English translations.
We translated the full corrected subset using Oolel-Translator. Acoustic tags ([um], [noise], [pause]) are preserved, so the translations stay aligned with the audio. If you are training speech translation, Bambara NMT, or bilingual ASR, this is ready to use.
Dataset available on Hugging Face: AfVoices-Translated.
Original audio data collected and corrected by RobotsMali.
FineWeb-Wolof-50k
50,000 rows from FineWeb2-HQ, translated into Wolof using our Oolel-v0.1 model. It contains educational and informative web text spanning a broad range of topics.
Dataset available on Hugging Face: FineWeb-Wolof-50k.
Source data from EPFML's FineWeb.
More releases are coming every week through March 17. Follow us on GitHub and Hugging Face.