Open Speech and Language Resources


Identifier: SLR129

Summary: A large, high-fidelity, multilingual, and uniquely African speech corpus

Category: Speech

License: Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Downloads (use a mirror closer to you):
akuapem-twi.tgz [16G]   ( Akuapem Twi speech and text )   Mirrors: [US]   [EU]   [CN]  
asante-twi.tgz [15G]   ( Asante Twi speech and text )   Mirrors: [US]   [EU]   [CN]  
ewe.tgz [19G]   ( Ewe speech and text )   Mirrors: [US]   [EU]   [CN]  
hausa.tgz [21G]   ( Hausa speech and text )   Mirrors: [US]   [EU]   [CN]  
lingala.tgz [13G]   ( Lingala speech and text )   Mirrors: [US]   [EU]   [CN]  
yoruba.tgz [6.3G]   ( Yoruba speech and text )   Mirrors: [US]   [EU]   [CN]  

About this resource:

BibleTTS is a large high-quality open Text-to-Speech dataset with up to 80 hours of single speaker, studio quality 48kHz recordings for each language. We release aligned speech and text for six languages spoken in Sub-Saharan Africa, with unaligned data for four additional languages, derived from the Biblica project. The data is released under a commercial-friendly CC-BY-SA license.

This repository contains the data for the six aligned languages of the BibleTTS corpus (Asante Twi, Akuapem Twi, Ewe, Hausa, Lingala, Yoruba).
This data has been automatically verse-aligned and filtered for TTS training.
Each .tgz file contains: speech files for individual verses and corresponding transcripts for each standardized split per language (train, dev, test). Files in each split are grouped into subdirectories by book.
The speech data is distributed as flac files in the original 48kHz mono format; it may be desired to resample for TTS training.

For more information, see the:

  • [dataset paper] Interspeech 2022 paper describing the corpus and its creation
  • [project page] Masakhane-io project website with TTS models and samples


Citation: If you use the BibleTTS corpus in your work, please cite the dataset paper:

    title={BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus},
    author={Josh Meyer and David Adelani and Edresson Casanova and Alp {\"O}ktem and Daniel Whitenack and Julian Weber and Salomon Kabongo Kabenamualu and Elizabeth Salesky and Iroro Orife and Colin Leong and Perez Ogayo and Chris Chinenye Emezue and Jonathan Mukiibi and Salomey Osei and Apelete Agbolo and Victor Akinode and Bernard Opoku and Olanrewaju Samuel and Jesujoba Alabi and Shamsuddeen Hassan Muhammad},
    publisher = {{ISCA}},