[ Resources / Research ]← Back to research
Speech RecognitionPeer-reviewedOpen access2023 · 14pp

Improving ASR Accuracy for African Languages

Tone-aware CTC decoding, phone-level cross-lingual transfer learning, and community-panel evaluation methodology for low-resource speech recognition in African language contexts.

Abstract

This paper presents novel techniques for enhancing Automatic Speech Recognition (ASR) accuracy in low-resource African languages. We address the unique challenges of tonal languages and dialectal variation, proposing a hybrid approach combining cross-lingual transfer learning with specialized acoustic modeling for language-specific phonetic features.

Our work introduces tone-aware CTC decoding — a modification to standard Connectionist Temporal Classification that explicitly models tone sequences as a first-class acoustic feature. Combined with phone-level transfer from language family groups and a community-validated evaluation methodology, the approach achieves significant word error rate reductions across a 12-language evaluation benchmark. All model weights and evaluation corpora are released open access.

Key Findings

Measured outcomes

25%
ASR improvement via transfer learning

Relative WER reduction for languages with fewer than 100 hours of training data using cross-lingual transfer from related language families.

18%
Tone-aware CTC decoding gain

Reduction in word error rate compared to standard CTC decoding when using our tone-aware phone-level acoustic model.

22%
Average WER reduction across evaluation set

Mean improvement across the full 12-language evaluation benchmark using the full proposed pipeline versus baseline models.

35%
Best-case WER reduction (Swahili)

Highest single-language improvement, Swahili — benefiting from the largest native-speaker validated corpus in the study.

Methodology

Multi-stage approach

01

Native-speaker corpus collection

Speech data collected from native speakers across multiple regions using the Lughatna mobile platform. Audio validated by community review panels before inclusion in training sets.

02

Tone-aware acoustic modeling

Specialized acoustic models augmented with tone sequence labels extracted using a phonological rule-based tagger. Tone labels treated as auxiliary CTC targets during training.

03

Cross-lingual transfer learning

Foundation models trained on high-resource languages within the same language family (Bantu, Semitic) then fine-tuned on low-resource target languages using phone-level alignment.

04

Data augmentation pipeline

Speed perturbation, SpecAugment, and dialect simulation augmentation applied during training. Dialect simulation uses phoneme substitution rules derived from field recordings.

05

Community-panel evaluation

Final evaluation uses a native-speaker panel protocol rather than automated metrics alone. 3 native speakers per language rate transcription quality on a 5-point naturalness scale.

Benchmark Results

WER reduction by language

~/benchmark/wer-eval-12-languages.json
Language
Baseline WER
Proposed WER
Δ WER
Swahili
41.2%
26.7%
−35.1%
Yoruba
58.4%
43.1%
−26.2%
Amharic
52.1%
40.5%
−22.3%
Hausa
47.8%
37.6%
−21.3%
Igbo
61.3%
49.2%
−19.7%
Zulu
55.0%
44.9%
−18.4%
Somali
63.7%
52.8%
−17.1%
Wolof
71.2%
60.4%
−15.2%
EVAL COMPLETE
Authors

Research team

KA
Kimathi, A.
Lead Researcher, LocaleNLP
OK
Osei, K.
Chief Research Officer, LocaleNLP
WN
Wanjiku, N.
Acoustic Modeling, LocaleNLP
MK
Mensah, K.
External Collaborator, Univ. of Ghana
Collaboration

Interested in this research area?

We partner with academic institutions and research organizations on ASR and related topics. Data, compute, and co-authorship available.