[ Lughatna / Community Data Engine ]

The world's largest African language corpus.

Lughatna is a community data engine that turns native speakers into infrastructure contributors. Every recording, annotation, and validation directly trains the next generation of low-resource language models.

0+
Contributors
0+
Validated Hours
0+
Languages
0
Countries
Contributor Journey

From your voice to production dataset in 4 steps

01

Record

Contributors record voice samples via the Lughatna mobile app or USSD gateway. Each recording captures speaker metadata: region, age range, gender, dialect tag.

02

Submit

Recordings upload with explicit consent timestamps attached. Speakers retain deletion rights. All data is encrypted in transit and at rest.

03

Expert Review

Minimum two native-speaker reviewers per sample. Inter-annotator agreement tracked per language batch. Dialect ambiguities flagged for specialist panel.

04

Dataset

Validated samples enter the corpus pipeline: force-aligned, morpheme-tokenized, tagged for dialect and register. Released under CC-BY-SA or used for model training.

Platform Capabilities

Built for communities, not linguists

Offline-capable recording

Works on 2G USSD for feature phones. No smartphone required. Samples queue locally and sync when connectivity is available.

Consent-first architecture

Every speaker controls their data. The platform enforces deletion requests at the database layer — not just the UI.

Dialect-aware annotation

120+ dialect tags across 38 countries. Annotators flag code-switching, tone variance, and register shift — all modeled separately.

Open dataset releases

All community-validated corpora are released under CC-BY-SA. Researchers worldwide can reproduce our benchmarks from the same underlying data.

Your language deserves to be in this dataset.

Whether you speak Tigrinya, Wolof, Nuer, or Zarma — Lughatna wants your voice. Every contributed hour advances the model for your entire language community.