Creating a corpus and building models of African languages applied to sustainable development goals

May 27, 2024

Coordinators: Joyce Nakatumba-Nabende ([email protected]), Ekta Vats ([email protected])

Researchers involved: Andrew Katumba (Makerere University, Uganda), Chika Yinko-Banjo (University of Lagos, Nigeria) , Solomon Gizaw Tulu (Addis Ababa University, Ethiopia)

Text and Voice data are crucial for the development of speech recognition and natural language processing technologies. However, there is a lack of voice data for many African languages, which hinders the development of these technologies for these languages. Collecting text and voice data is an important step towards improving speech recognition and natural language processing for African languages. We aim to build and expand text and speech language corpus for African languages. The aim is to have 9000 hours of Voice Data Across 12 African Languages (1000 hrs per language). We will also aim to build speech datasets for four domain specific applications (health, finance, agriculture and education). We plan to extend the text datasets for Named Entity Recognition in the African context.

Key research areas include:

Provide an evidence base for the amount of speech data required to build domain specific automatic speech recognition systems for low-resourced African languages.
Build domain specific applications from these datasets to support the sustainable development goals (e.g., provision of extension services for smallholder farmers using voice, applications for financial inclusion, application for online education).
Build models for bias mitigation for machine translation and Automatic speech recognition for African languages. For example: multiple pronunciation, accent and dialect recognition.
Build Named Entity Recognition models for the African context across several domains.
Improve machine translation models to cater for informal speech spoken across African languages.
Build unbiased gender Non-Functional Requirement Framework for Machine Learning (ML) based software development.
Develop datasets and models for sentiment and hate-speech Corpus for African languages.

Share the Post: