Open Speech and Language Resources



Contact
jtrmal@gmail.com
(Jan "Yenda" Trmal)

Resources

Resource Name Category Summary
SLR1 Yesno Speech Sixty recordings of one individual saying yes or no in Hebrew; each recording is eight words long.
SLR2 OpenFST Software A mirror of the OpenFst toolkit
SLR3 sph2pipe Software A mirror of the sph2pipe software
SLR4 sctk Software A mirror of the sctk scoring software
SLR5 MSU Switchboard transcipts Text A mirror of the Mississippi State transcripts and lexicon for Switchboard.
SLR6 Vystadial Speech English and Czech data, mirrored from the Vystadial project
SLR7 TED-LIUM Speech English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
SLR8 Sprakbanken Text Danish pronunciation dictionary generated using eSpeak
SLR9 The AMI pack Text Some auxiliary non-speech data used to build AMI systems with Kaldi
SLR10 SRE Data Misc Various files from SRE data that NIST used to host online
SLR11 LibriSpeech language models, vocabulary and G2P models Text Language modelling resources, for use with the LibriSpeech ASR corpus
SLR12 LibriSpeech ASR corpus Speech Large-scale (1000 hours) corpus of read English speech
SLR13 RWCP Sound Scene Database Speech + Software A database of recordings of real-world sounds and measured room impulse responses
SLR14 BEEP Dictionary Text Phonemic transcriptions of over 250,000 English words. (British English pronunciations)
SLR15 SRE Speaker List Misc A list linking speakers across NIST SRE corpra
SLR16 The AMI Corpus Speech Acoustic speech data and meta-data from The AMI corpus.
SLR17 MUSAN Audio A corpus of music, speech, and noise
SLR18 THCHS-30 Speech A Free Chinese Speech Corpus Released by CSLT@Tsinghua University
SLR19 TED-LIUMv2 Audio TED-LIUM corpus release 2, English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
SLR20 Aachen Impulse Response Database Audio Aachen Impulse Response database (AIR): a database of room impulse responses (mirrored here)
SLR21 Spanish Word list Text A list of words in Spanish with frequency derived from a large corpus (Spanish Gigaword).
SLR22 THUYG-20 Speech A free Uyghur speech database Released by CSLT@Tsinghua University & Xinjiang University
SLR23 NIST LRE 2007 Key Misc A file containing metadata for the utterances in the LRE 2007 evaluation
SLR24 Iban Speech Iban language text and speech corpora for ASR
SLR25 ALFFA (African Languages in the Field: speech Fundamentals and Automation) Speech Amharic, Swahili and Wolof data, mirrored from the ALFFA git repository
SLR26 Simulated Room Impulse Response Database Audio A database of simulated room impulse responses
SLR27 Cantab-TEDLIUM Release 1.1 (February 2015) Text Cantab Research Language models for the TEDLIUM database
SLR28 Room Impulse Response and Noise Database Audio A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.
SLR29 Sprakbanken_Swe Text Swedish pronunciation dictionary
SLR30 Sinhala TTS Speech Sinhalese multi-speaker TTS corpora
SLR31 Mini LibriSpeech ASR corpus Speech Subset of LibriSpeech corpus for purpose of regression testing
SLR32 High quality TTS data for four South African languages (af, st, tn, xh) Speech Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
SLR33 Aishell Speech Mandarin data, provided by Beijing Shell Shell Technology Co.,Ltd
SLR34 Santiago Spanish Lexicon Text A pronouncing dictionary for the Spanish language.
SLR35 Large Javanese ASR training data set Speech Javanese ASR training data set containing ~185K utterances.
SLR36 Large Sundanese ASR training data set Speech Sundanese ASR training data set containing ~220K utterances.
SLR37 High quality TTS data for Bengali languages Speech Multi-speaker TTS data for Bangladesh Bengali (bn-BD) and Indian Bengali (bn-IN).
SLR38 Free ST Chinese Mandarin Corpus Speech A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances;
SLR39 Heroico Speech Spanish data, mirrored from the LDC
SLR40 Zeroth-Korean Speech Corpus for Automatic Speech Recognition Korean Open-source Speech Corpus for Speech Recognition by Zeroth Project (https://github.com/goodatlas/zeroth)
SLR41 High quality TTS data for Javanese. Speech Multi-speaker TTS data for Javanese (jv-ID)
SLR42 High quality TTS data for Khmer. Speech Multi-speaker TTS data for Khmer (km-KH)
SLR43 High quality TTS data for Nepali. Speech Multi-speaker TTS data for Nepali (ne-NP)
SLR44 High quality TTS data for Sundanese. Speech Multi-speaker TTS data for Sundanese (su-ID)
SLR45 Free ST American English Corpus Speech A free American English corpus by Surfingtech (www.surfing.ai), containing utterances from 10 speakers, Each speaker has about 350 utterances;
SLR46 Tunisian_MSA Speech Tunisian Modern Standard Arabic
SLR47 Primewords Chinese Corpus Set 1 Speech Chinese Mandarin corpus released by Shanghai Primewords Co. Ltd. (www.primewords.cn), containing 100 hours of speech data.
SLR48 MADCAT Arabic data splits Other Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus
SLR49 VoxCeleb Data Misc Various files for the VoxCeleb datasets
SLR50 MADCAT Chinese data splits Other Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus
SLR51 TED-LIUM Release 3 Speech TED-LIUM corpus release 3
SLR52 Large Sinhala ASR training data set Speech Sinhala ASR training data set containing ~185K utterances.
SLR53 Large Bengali ASR training data set Speech Bengali ASR training data set containing ~196K utterances.
SLR54 Large Nepali ASR training data set Speech Nepali ASR training data set containing ~157K utterances.
SLR55 CLMAD Text A Chinese Language Model Adaptation Dataset (CLMAD).
SLR56 IAM Aachen splits Other Aachen data splits (train/test/val) for the IAM dataset.
SLR57 African Accented French Speech Recordings of African Accented French speech.
SLR58 Pansori-TEDxKR Speech Korean speech corpus generated from Korean language TEDx talks
SLR59 ParlamentParla Speech Catalan speech corpus generated from Catalan Parliamentary sessions
SLR60 LibriTTS corpus Speech Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
SLR61 Crowdsourced high-quality Argentinian Spanish speech data set. Speech Data set which contains 5739 recordings of native speakers of Spanish
SLR62 aidatatang_200zh Speech A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%.
SLR63 Crowdsourced high-quality Malayalam multi-speaker speech data set. Speech Data set which contains recordings of native speakers of Malayalam.
SLR64 Crowdsourced high-quality Marathi multi-speaker speech data set. Speech Data set which contains recordings of native speakers of Marathi
SLR65 Crowdsourced high-quality Tamil multi-speaker speech data set. Speech Data set which contains recordings of native speakers of Tamil.
SLR66 Crowdsourced high-quality Telugu multi-speaker speech data set. Speech Data set which contains recordings of native speakers of Telugu.
SLR67 TEDx Spanish Corpus Speech Spanish data taken from the TEDx Talks
SLR68 MAGICDATA Mandarin Chinese Read Speech Corpus Speech The corpus by Magic Data Technology Co., Ltd. , containing 755 hours of scripted read speech data from 1080 native speakers of the Mandarin Chinese spoken in mainland China. The sentence transcription accuracy is higher than 98%.
SLR69 Crowdsourced high-quality Catalan speech data set. Speech Data set which contains recordings of Catalan.
SLR70 Crowdsourced high-quality Nigerian English speech data set. Speech Data set which contains recordings of Nigerian English.
SLR71 Crowdsourced high-quality Chilean Spanish speech data set. Speech Data set which contains recordings of Chilean Spanish.
SLR72 Crowdsourced high-quality Colombian Spanish speech data set. Speech Data set which contains recordings of Colombian Spanish.
SLR73 Crowdsourced high-quality Peruvian Spanish speech data set. Speech Data set which contains recordings of Peruvian Spanish.
SLR74 Crowdsourced high-quality Puerto Rico Spanish speech data set. Speech Data set which contains recordings of Puerto Rico Spanish.
SLR75 Crowdsourced high-quality Venezuelan Spanish speech data set. Speech Data set which contains recordings of Venezuelan Spanish.
SLR76 Crowdsourced high-quality Basque speech data set. Speech Data set which contains recordings of Basque.
SLR77 Crowdsourced high-quality Galician speech data set. Speech Data set which contains recordings of Galician.
SLR78 Crowdsourced high-quality Gujarati multi-speaker speech data set. Speech Data set which contains recordings of native speakers of Gujarati.
SLR79 Crowdsourced high-quality Kannada multi-speaker speech data set. Speech Data set which contains recordings of native speakers of Kannada.
SLR80 Crowdsourced high-quality Burmese speech data set. Speech Data set which contains recordings of Burmese.
SLR81 Small Audio Clips Speech Contains 20 one-second audio clips from various sources, for testing compression algorithms
SLR82 CN-Celeb Speech A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University
SLR83 Crowdsourced high-quality UK and Ireland English Dialect speech data set. Speech Data set which contains male and female recordings of English from various dialects of the UK and Ireland.
SLR84 ScribbleLens Handwriting Dutch cursive, 16-18th century handwritings, pages and lines, for (un)supervised AI and other research.
SLR85 HI-MIA Speech A far-field text-dependent speaker verification database for AISHELL Speaker Verification Challenge 2019
SLR86 Crowdsourced high-quality Yoruba speech data set. Speech Data set which contains recordings of Yoruba.
SLR87 MobvoiHotwords Speech Chinese hotwords detection dataset, provided by Mobvoi CO.,LTD
SLR88 Att-HACK Speech French Expressive Speech Database with Social Attitudes
SLR89 Yoloxóchitl-Mixtec Speech Yolóxochitl Mixtec Speech with Transcription
SLR92 Puebla-Nahuatl Speech Puebla Nahuatl Speech with Transcription
SLR93 AISHELL-3 Speech Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd.
SLR94 Multilingual LibriSpeech (MLS) Speech A large multilingual corpus derived from LibriVox audiobooks
SLR95 Thorsten Müller (German Neutral-TTS dataset) Speech Free single german speaker dataset (> 23 hours) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for tts training
SLR96 Russian LibriSpeech (RuLS) Speech This dataset is based on LibriVox audiobooks
SLR97 Deeply Korean read speech corpus Speech Pairs of Korean reading the scripts with 3 text sentiments using 3 vocal sentiments. Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
SLR98 Deeply parent-child vocal interaction dataset Speech The interaction of pairs of parent and child(reading fairy tales, singing children’s songs, conversing, and others).Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
SLR99 Deeply Nonverbal Vocalization Dataset Audio A human nonverbal vocal sound dataset by Deeply Inc.
SLR100 Multilingual TEDx Speech a multilingual corpus of TEDx talks for speech recognition and translation
SLR101 speechocean762 Speech Pronunciation scoring dataset, labeled independently by five human experts
SLR102 Kazakh Speech Corpus (KSC) Speech A crowdsourced open-source Kazakh speech corpus developed by ISSAI (330 hours)
SLR103 Multilingual and code-switching ASR Challenge Dataset - sub-task1 Speech Datasets for sub-task1 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)
SLR104 Multilingual and code-switching ASR Challenge Dataset - sub-task2 Speech Datasets for sub-task2 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)
SLR105 nicolingua-0003-west-african-radio-corpus Speech West African Radio Corpus
SLR106 nicolingua-0004-west-african-va-asr-corpus Speech West African Virtual Assistant Speech Recognition Corpus
SLR107 Totonac Resources Speech Totonac Speech with Transcription
SLR108 MediaSpeech Speech French, Arabic, Turkish and Spanish media speech datasets
SLR109 Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS) Speech A multi-speaker English dataset for training text-to-speech models
SLR110 Thorsten Müller (German Emotional-TTS dataset) Speech Free EMOTIONAL single german speaker dataset (Neutral, Disgusted, Angry, Amused, Surprised, Sleepy, Drunk, Whispering) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for TTS training
SLR111 AISHELL-4 Speech A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Beijing Shell Shell Technology Co.,Ltd
SLR112 Samromur 21.05 Speech Samrómur Icelandic Speech corpus approved for release in May 2021
SLR113 SEOUL CORPUS Speech The Korean Corpus of Spontaneous Speech (aka, Seoul Corpus), created from the NRF(Korea)-funded project
SLR114 Golos Speech Russian ASR dataset (1240 hours) with trained acoustic and language models
SLR115 EmoV_DB Speech a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English (https://github.com/numediart/EmoV-DB)
SLR116 Samrómur Queries 21.12 Speech Samrómur Icelandic Speech corpus focused on queries and approved for release in December 2021
SLR117 Samrómur Children 21.09 Speech Samrómur Icelandic Speech from children (ages 4-17 years) approved for release in September 2021
SLR118 1111 Hours Hindi ASR Challenge Speech Datasets for 1111 Hours Hindi ASR Challenge Closed, Self Supervised Closed and Open - 2022 (https://sites.google.com/view/gramvaaniasrchallenge/home)
SLR119 AliMeeting Speech A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Alibaba Group
SLR120 HI-MIA-CW Speech A Free Mandarin Supplemental Speech Corpus to HI-MIA Database, whose contents are negative samples for wake-up words "Hi, Mia".
SLR121 WenetSpeech Speech A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition
SLR122 Kashmiri Data Corpus Speech An audio and text corpus for the Kashmiri language
SLR123 MAGICDATA Mandarin Chinese Conversational Speech Corpus Speech The corpus by Magic Data Technology Co., Ltd. , containing 180 hours of rich annotated Mandarin spontaneous conversational speech data.
SLR124 TIBMD@MUC speech data set Speech A Tibetan multi-dialect speech data ( 84.33 hours)
SLR125 Basic LAnguage Resource Kit 1.0 for Faroese Speech Faroese Speech corpus approved for release in July 2022
SLR126 IISc-MILE Kannada ASR Corpus Speech Kannada transcribed speech corpus for ASR
SLR127 IISc-MILE Tamil ASR Corpus Speech Tamil transcribed speech corpus for ASR
SLR128 Samrómur Unverified 22.07 Speech Samrómur Icelandic Speech, 2,200 hours of mostly unverified data approved for release in July 2022
SLR129 BibleTTS Speech A large, high-fidelity, multilingual, and uniquely African speech corpus
SLR130 Samrómur L2 22.09 Speech Samrómur Icelandic Speech, 150 hours from people with Icelandic as a second language. Approved for release in July 2022
SLR131 Samrómur Mimic 22.09 Speech Samrómur Icelandic Speech, 66.7 hours of speech where users mimic utterances. Approved in September 2022
SLR132 Mohammed Speech Arabic speech to text Quran data
SLR133 XBMU-AMDO31 Speech Tibetan Amdo dialect speech data from NLIT, Northwest Minzu University
SLR134 SASPEECH Speech Hebrew speech and transcripts by a single speaker (30 hours)
SLR135 Libri-Mixed-Speakers Speech English audio of simultaneous speakers derived from LibriTTS
SLR136 EMNS Speech, text-to-speech, automatic speech recognition An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.
SLR137 Silbo Gomero Speech Corpus Speech Corpus of the Silbo Gomero whistled language, based on 49 minutes of recordings created by 4 whistlers.
SLR138 SHALCAS22A Speech A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd.
SLR139 Audiocite.net Speech Spoken dataset of books read in French, initially collected from audiocite.net by the GETALP team for the LeBenchmark project.
SLR140 Kazakh Speech Dataset (KSD) Speech High-quality open source Kazakh speech corpus developed by the Department of Artificial Intelligence and Big Data of Al-Farabi Kazakh National University (554 hours)
SLR141 LibriTTS-R Speech Sound quality improved version of the LibriTTS corpus which is a large-scale corpus of English speech designed for TTS use
SLR142 The MC Speech Dataset Speech Free speech dataset consisting of 24018 short audio clips of a single speaker reading sentences in Polish
SLR143 Nepali Text-to-Speech Data (Male and Female) Speech Nepali speech and corresponding text data in male and female voice
SLR144 SlideSpeech Audio-Visual Speech A Large-scale English Multi-Modal Audio-Visual Corpus, provided by Alibaba Group
SLR145 LibriSpeech-PC Text LibriSpeech text with Punctuation and Capitalization
SLR146 CML-TTS Dataset Speech CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages
SLR147 Veracruz Orizaba Nahuatl Endangered Language Speech Audio corpus of Orizaba (Veracruz) Nahuatl speech (Glottocode: oriz1235; ISO 639-3: nlv)
SLR148 Tepetzintla Zacatlan Nahuatl Endangered Language Speech Audio corpus of Zacatlán-Ahuacatlán-Tepetzintla (Puebla) Nahuatl speech (Glottocode: zaca1241; ISO 639-3: nhi)
SLR149 Tibetan Greetings Speech Selected Tibetan greetings speech data categorized according to the dialectal region.
SLR150 CHiME-6 Speech English multi-channel far field meeting data used in the CHiME-6 Challenge. It is derived from CHiME-5 by fixing some array synchronization errors.
SLR151 Kallaama Speech Wolof, Pulaar and Sereer data
SLR152 Pragmatic Similarity Judgments Speech Judgments of perceived similarity between utterance pairs from dialogs, in English and Spanish.
SLR153 Yerevan City Magazine Text A Free Armenian News Text Corpus, provided by Qaghaki Amsagir LLC (Yerevan City Magazine, evnmag.com)
SLR154 ArmenianGrqaserAudioBooks Speech Cutted, Segmented, Processed (speech, text) paired data, derived from the Grqaser.org audiobooks
SLR155 SBCSAE Speech The Santa Barbara Corpus of Spoken American English, mirrored from UCSB