The world's largest African language corpus.
Lughatna is a community data engine that turns native speakers into infrastructure contributors. Every recording, annotation, and validation directly trains the next generation of low-resource language models.
From your voice to production dataset in 4 steps
Record
Contributors record voice samples via the Lughatna mobile app or USSD gateway. Each recording captures speaker metadata: region, age range, gender, dialect tag.
Submit
Recordings upload with explicit consent timestamps attached. Speakers retain deletion rights. All data is encrypted in transit and at rest.
Expert Review
Minimum two native-speaker reviewers per sample. Inter-annotator agreement tracked per language batch. Dialect ambiguities flagged for specialist panel.
Dataset
Validated samples enter the corpus pipeline: force-aligned, morpheme-tokenized, tagged for dialect and register. Released under CC-BY-SA or used for model training.
Built for communities, not linguists
Offline-capable recording
Works on 2G USSD for feature phones. No smartphone required. Samples queue locally and sync when connectivity is available.
Consent-first architecture
Every speaker controls their data. The platform enforces deletion requests at the database layer — not just the UI.
Dialect-aware annotation
120+ dialect tags across 38 countries. Annotators flag code-switching, tone variance, and register shift — all modeled separately.
Open dataset releases
All community-validated corpora are released under CC-BY-SA. Researchers worldwide can reproduce our benchmarks from the same underlying data.
Your language deserves to be in this dataset.
Whether you speak Tigrinya, Wolof, Nuer, or Zarma — Lughatna wants your voice. Every contributed hour advances the model for your entire language community.