Open Speech and Language Resources



CML-TTS Dataset

Identifier: SLR146

Summary: CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages

Category: Speech

License: CC-BY 4.0 license

Downloads (use a mirror closer to you):
cml_tts_dataset_dutch_v0.1.tar.bz [86G]   ( Dutch speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
cml_tts_dataset_french_v0.1.tar.bz [31G]   ( French speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
cml_tts_dataset_german_v0.1.tar.bz [190G]   ( German speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
cml_tts_dataset_italian_v0.1.tar.bz [14G]   ( Italian speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
cml_tts_dataset_polish_v0.1.tar.bz [5.5G]   ( Polish speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
cml_tts_dataset_portuguese_v0.1.tar.bz [9.7G]   ( Portuguese speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
cml_tts_dataset_spanish_v0.1.tar.bz [48G]   ( Spanish speech and transcripts )   Mirrors: [US]   [EU]   [CN]  
cml_tts_dataset_segments_v0.1.tar.bz [16M]   (Segments informations )   Mirrors: [US]   [EU]   [CN]  
cml_tts_dataset.md5 [560 bytes]   (Checksum of the files above )   Mirrors: [US]   [EU]   [CN]  

About this resource:

CML-TTS is a recursive acronym for CML-Multi-Lingual-TTS, a Text-to-Speech (TTS) dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias (UFG).

CML-TTS is a dataset composed of reading audiobooks from the LibriVox2 project, which uses books from Project Gutenberg3, released in the public domain. It consists of recordings in Dutch, German, French, Italian, Polish, Portuguese, and Spanish, with a sampling rate of 24kHz.

After downloading you must check the md5sum of each file:

a167148101dee6b6c0089e7bf9084f31  cml_tts_dataset_dutch_v0.1.tar.bz
0f2212fe03e0cc444225a6eb79fa099c  cml_tts_dataset_french_v0.1.tar.bz
332cae87fe03fd43d17d50b2c05bd872  cml_tts_dataset_german_v0.1.tar.bz
cccbc1f885a92594c028ee5ddf622acb  cml_tts_dataset_italian_v0.1.tar.bz
ab6385ed4acc613ee96ba7b75dfd2ba7  cml_tts_dataset_polish_v0.1.tar.bz
743bad054ca861688aa026e505b26aff  cml_tts_dataset_portuguese_v0.1.tar.bz
bb7128ec9f804b60485492a2433e18c7  cml_tts_dataset_spanish_v0.1.tar.bz
f529a908aba26a6d891b4fb17ab3125b  cml_tts_dataset_segments_v0.1.tar.bz
You can cite the data using the following BibTeX entry:

@InProceedings{Cmltts2023,
    title="CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages",
    author="Oliveira, Frederico S. and Casanova, Edresson and Junior, Arnaldo Candido and Soares, Anderson S. and Galv{\~a}o Filho, Arlindo R.", 
    editor="Ek{\v{s}}tein, Kamil and P{\'a}rtl, Franti{\v{s}}ek and Konop{\'i}k, Miloslav",
    booktitle="Text, Speech, and Dialogue",
    year="2023",
    publisher="Springer Nature Switzerland",
    address="Cham",
    pages="188--199",
    isbn="978-3-031-40498-6"
}