[ Resources / Research ]← Back to research

Speech RecognitionPeer-reviewedOpen access2023 · 14pp

Improving ASR Accuracy for African Languages

Tone-aware CTC decoding, phone-level cross-lingual transfer learning, and community-panel evaluation methodology for low-resource speech recognition in African language contexts.

Abstract

This paper presents novel techniques for enhancing Automatic Speech Recognition (ASR) accuracy in low-resource African languages. We address the unique challenges of tonal languages and dialectal variation, proposing a hybrid approach combining cross-lingual transfer learning with specialized acoustic modeling for language-specific phonetic features.

Our work introduces tone-aware CTC decoding — a modification to standard Connectionist Temporal Classification that explicitly models tone sequences as a first-class acoustic feature. Combined with phone-level transfer from language family groups and a community-validated evaluation methodology, the approach achieves significant word error rate reductions across a 12-language evaluation benchmark. All model weights and evaluation corpora are released open access.

Key Findings

Measured outcomes

25%

ASR improvement via transfer learning

Relative WER reduction for languages with fewer than 100 hours of training data using cross-lingual transfer from related language families.

18%

Tone-aware CTC decoding gain

Reduction in word error rate compared to standard CTC decoding when using our tone-aware phone-level acoustic model.

22%

Average WER reduction across evaluation set

Mean improvement across the full 12-language evaluation benchmark using the full proposed pipeline versus baseline models.

35%

Best-case WER reduction (Swahili)

Highest single-language improvement, Swahili — benefiting from the largest native-speaker validated corpus in the study.

Methodology

Multi-stage approach

Native-speaker corpus collection

Speech data collected from native speakers across multiple regions using the Lughatna mobile platform. Audio validated by community review panels before inclusion in training sets.

Tone-aware acoustic modeling

Specialized acoustic models augmented with tone sequence labels extracted using a phonological rule-based tagger. Tone labels treated as auxiliary CTC targets during training.

Cross-lingual transfer learning

Foundation models trained on high-resource languages within the same language family (Bantu, Semitic) then fine-tuned on low-resource target languages using phone-level alignment.

Data augmentation pipeline

Speed perturbation, SpecAugment, and dialect simulation augmentation applied during training. Dialect simulation uses phoneme substitution rules derived from field recordings.

Community-panel evaluation

Final evaluation uses a native-speaker panel protocol rather than automated metrics alone. 3 native speakers per language rate transcription quality on a 5-point naturalness scale.

Benchmark Results

WER reduction by language

~/benchmark/wer-eval-12-languages.json

Language

Baseline WER

Proposed WER

Δ WER

Swahili

41.2%

26.7%

−35.1%

Yoruba

58.4%

43.1%

−26.2%

Amharic

52.1%

40.5%

−22.3%

Hausa

47.8%

37.6%

−21.3%

Igbo

61.3%

49.2%

−19.7%

Zulu

55.0%

44.9%

−18.4%

Somali

63.7%

52.8%

−17.1%

Wolof

71.2%

60.4%

−15.2%

EVAL COMPLETE

Authors

Research team

Kimathi, A.

Lead Researcher, LocaleNLP

Osei, K.

Chief Research Officer, LocaleNLP

Wanjiku, N.

Acoustic Modeling, LocaleNLP

Mensah, K.

External Collaborator, Univ. of Ghana

Collaboration

Interested in this research area?

We partner with academic institutions and research organizations on ASR and related topics. Data, compute, and co-authorship available.

Contact research team →All research