Improving ASR Accuracy for African Languages
Tone-aware CTC decoding, phone-level cross-lingual transfer learning, and community-panel evaluation methodology for low-resource speech recognition in African language contexts.
This paper presents novel techniques for enhancing Automatic Speech Recognition (ASR) accuracy in low-resource African languages. We address the unique challenges of tonal languages and dialectal variation, proposing a hybrid approach combining cross-lingual transfer learning with specialized acoustic modeling for language-specific phonetic features.
Our work introduces tone-aware CTC decoding — a modification to standard Connectionist Temporal Classification that explicitly models tone sequences as a first-class acoustic feature. Combined with phone-level transfer from language family groups and a community-validated evaluation methodology, the approach achieves significant word error rate reductions across a 12-language evaluation benchmark. All model weights and evaluation corpora are released open access.
Measured outcomes
Relative WER reduction for languages with fewer than 100 hours of training data using cross-lingual transfer from related language families.
Reduction in word error rate compared to standard CTC decoding when using our tone-aware phone-level acoustic model.
Mean improvement across the full 12-language evaluation benchmark using the full proposed pipeline versus baseline models.
Highest single-language improvement, Swahili — benefiting from the largest native-speaker validated corpus in the study.
Multi-stage approach
Native-speaker corpus collection
Speech data collected from native speakers across multiple regions using the Lughatna mobile platform. Audio validated by community review panels before inclusion in training sets.
Tone-aware acoustic modeling
Specialized acoustic models augmented with tone sequence labels extracted using a phonological rule-based tagger. Tone labels treated as auxiliary CTC targets during training.
Cross-lingual transfer learning
Foundation models trained on high-resource languages within the same language family (Bantu, Semitic) then fine-tuned on low-resource target languages using phone-level alignment.
Data augmentation pipeline
Speed perturbation, SpecAugment, and dialect simulation augmentation applied during training. Dialect simulation uses phoneme substitution rules derived from field recordings.
Community-panel evaluation
Final evaluation uses a native-speaker panel protocol rather than automated metrics alone. 3 native speakers per language rate transcription quality on a 5-point naturalness scale.
WER reduction by language
Research team
Interested in this research area?
We partner with academic institutions and research organizations on ASR and related topics. Data, compute, and co-authorship available.