| Resource | Name | Category | Summary |
| SLR1 | Yesno
| Speech
| Sixty recordings of one individual saying yes or no in Hebrew; each recording is eight words long.
|
| SLR2 | OpenFST
| Software
| A mirror of the OpenFst toolkit
|
| SLR3 | sph2pipe
| Software
| A mirror of the sph2pipe software
|
| SLR4 | sctk
| Software
| A mirror of the sctk scoring software
|
| SLR5 | MSU Switchboard transcipts
| Text
| A mirror of the Mississippi State transcripts and lexicon for Switchboard.
|
| SLR6 | Vystadial
| Speech
| English and Czech data, mirrored from the Vystadial project
|
| SLR7 | TED-LIUM
| Speech
| English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
|
| SLR8 | Sprakbanken
| Text
| Danish pronunciation dictionary generated using eSpeak
|
| SLR9 | The AMI pack
| Text
| Some auxiliary non-speech data used to build AMI systems with Kaldi
|
| SLR10 | SRE Data
| Misc
| Various files from SRE data that NIST used to host online
|
| SLR11 | LibriSpeech language models, vocabulary and G2P models
| Text
| Language modelling resources, for use with the LibriSpeech ASR corpus
|
| SLR12 | LibriSpeech ASR corpus
| Speech
| Large-scale (1000 hours) corpus of read English speech
|
| SLR13 | RWCP Sound Scene Database
| Speech + Software
| A database of recordings of real-world sounds and measured room impulse responses
|
| SLR14 | BEEP Dictionary
| Text
| Phonemic transcriptions of over 250,000 English words. (British English pronunciations)
|
| SLR15 | SRE Speaker List
| Misc
| A list linking speakers across NIST SRE corpra
|
| SLR16 | The AMI Corpus
| Speech
| Acoustic speech data and meta-data from The AMI corpus.
|
| SLR17 | MUSAN
| Audio
| A corpus of music, speech, and noise
|
| SLR18 | THCHS-30
| Speech
| A Free Chinese Speech Corpus Released by CSLT@Tsinghua University
|
| SLR19 | TED-LIUMv2
| Audio
| TED-LIUM corpus release 2, English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
|
| SLR20 | Aachen Impulse Response Database
| Audio
| Aachen Impulse Response database (AIR): a database of room impulse responses (mirrored here)
|
| SLR21 | Spanish Word list
| Text
| A list of words in Spanish with frequency derived from a large corpus (Spanish Gigaword).
|
| SLR22 | THUYG-20
| Speech
| A free Uyghur speech database Released by CSLT@Tsinghua University & Xinjiang University
|
| SLR23 | NIST LRE 2007 Key
| Misc
| A file containing metadata for the utterances in the LRE 2007 evaluation
|
| SLR24 | Iban
| Speech
| Iban language text and speech corpora for ASR
|
| SLR25 | ALFFA (African Languages in the Field: speech Fundamentals and Automation)
| Speech
| Amharic, Swahili and Wolof data, mirrored from the ALFFA git repository
|
| SLR26 | Simulated Room Impulse Response Database
| Audio
| A database of simulated room impulse responses
|
| SLR27 | Cantab-TEDLIUM Release 1.1 (February 2015)
| Text
| Cantab Research Language models for the TEDLIUM database
|
| SLR28 | Room Impulse Response and Noise Database
| Audio
| A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.
|
| SLR29 | Sprakbanken_Swe
| Text
| Swedish pronunciation dictionary
|
| SLR30 | Sinhala TTS
| Speech
| Sinhalese multi-speaker TTS corpora
|
| SLR31 | Mini LibriSpeech ASR corpus
| Speech
| Subset of LibriSpeech corpus for purpose of regression testing
|
| SLR32 | High quality TTS data for four South African languages (af, st, tn, xh)
| Speech
| Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
|
| SLR33 | Aishell
| Speech
| Mandarin data, provided by Beijing Shell Shell Technology Co.,Ltd
|
| SLR34 | Santiago Spanish Lexicon
| Text
| A pronouncing dictionary for the Spanish language.
|
| SLR35 | Large Javanese ASR training data set
| Speech
| Javanese ASR training data set containing ~185K utterances.
|
| SLR36 | Large Sundanese ASR training data set
| Speech
| Sundanese ASR training data set containing ~220K utterances.
|
| SLR37 | High quality TTS data for Bengali languages
| Speech
| Multi-speaker TTS data for Bangladesh Bengali (bn-BD) and Indian Bengali (bn-IN).
|
| SLR38 | Free ST Chinese Mandarin Corpus
| Speech
| A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances;
|
| SLR39 | Heroico
| Speech
| Spanish data, mirrored from the LDC
|
| SLR40 | Zeroth-Korean
| Speech Corpus for Automatic Speech Recognition
| Korean Open-source Speech Corpus for Speech Recognition by Zeroth Project (https://github.com/goodatlas/zeroth)
|
| SLR41 | High quality TTS data for Javanese.
| Speech
| Multi-speaker TTS data for Javanese (jv-ID)
|
| SLR42 | High quality TTS data for Khmer.
| Speech
| Multi-speaker TTS data for Khmer (km-KH)
|
| SLR43 | High quality TTS data for Nepali.
| Speech
| Multi-speaker TTS data for Nepali (ne-NP)
|
| SLR44 | High quality TTS data for Sundanese.
| Speech
| Multi-speaker TTS data for Sundanese (su-ID)
|
| SLR45 | Free ST American English Corpus
| Speech
| A free American English corpus by Surfingtech (www.surfing.ai), containing utterances from 10 speakers, Each speaker has about 350 utterances;
|
| SLR46 | Tunisian_MSA
| Speech
| Tunisian Modern Standard Arabic
|
| SLR47 | Primewords Chinese Corpus Set 1
| Speech
| Chinese Mandarin corpus released by Shanghai Primewords Co. Ltd. (www.primewords.cn), containing 100 hours of speech data.
|
| SLR48 | MADCAT Arabic data splits
| Other
| Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus
|
| SLR49 | VoxCeleb Data
| Misc
| Various files for the VoxCeleb datasets
|
| SLR50 | MADCAT Chinese data splits
| Other
| Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus
|
| SLR51 | TED-LIUM Release 3
| Speech
| TED-LIUM corpus release 3
|
| SLR52 | Large Sinhala ASR training data set
| Speech
| Sinhala ASR training data set containing ~185K utterances.
|
| SLR53 | Large Bengali ASR training data set
| Speech
| Bengali ASR training data set containing ~196K utterances.
|
| SLR54 | Large Nepali ASR training data set
| Speech
| Nepali ASR training data set containing ~157K utterances.
|
| SLR55 | CLMAD
| Text
| A Chinese Language Model Adaptation Dataset (CLMAD).
|
| SLR56 | IAM Aachen splits
| Other
| Aachen data splits (train/test/val) for the IAM dataset.
|
| SLR57 | African Accented French
| Speech
| Recordings of African Accented French speech.
|
| SLR58 | Pansori-TEDxKR
| Speech
| Korean speech corpus generated from Korean language TEDx talks
|
| SLR59 | ParlamentParla
| Speech
| Catalan speech corpus generated from Catalan Parliamentary sessions
|
| SLR60 | LibriTTS corpus
| Speech
| Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
|
| SLR61 | Crowdsourced high-quality Argentinian Spanish speech data set.
| Speech
| Data set which contains 5739 recordings of native speakers of Spanish
|
| SLR62 | aidatatang_200zh
| Speech
| A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%.
|
| SLR63 | Crowdsourced high-quality Malayalam multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Malayalam.
|
| SLR64 | Crowdsourced high-quality Marathi multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Marathi
|
| SLR65 | Crowdsourced high-quality Tamil multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Tamil.
|
| SLR66 | Crowdsourced high-quality Telugu multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Telugu.
|
| SLR67 | TEDx Spanish Corpus
| Speech
| Spanish data taken from the TEDx Talks
|
| SLR68 | MAGICDATA Mandarin Chinese Read Speech Corpus
| Speech
| The corpus by Magic Data Technology Co., Ltd. , containing 755 hours of scripted read speech data from 1080 native speakers of the Mandarin Chinese spoken in mainland China. The sentence transcription accuracy is higher than 98%.
|
| SLR69 | Crowdsourced high-quality Catalan speech data set.
| Speech
| Data set which contains recordings of Catalan.
|
| SLR70 | Crowdsourced high-quality Nigerian English speech data set.
| Speech
| Data set which contains recordings of Nigerian English.
|
| SLR71 | Crowdsourced high-quality Chilean Spanish speech data set.
| Speech
| Data set which contains recordings of Chilean Spanish.
|
| SLR72 | Crowdsourced high-quality Colombian Spanish speech data set.
| Speech
| Data set which contains recordings of Colombian Spanish.
|
| SLR73 | Crowdsourced high-quality Peruvian Spanish speech data set.
| Speech
| Data set which contains recordings of Peruvian Spanish.
|
| SLR74 | Crowdsourced high-quality Puerto Rico Spanish speech data set.
| Speech
| Data set which contains recordings of Puerto Rico Spanish.
|
| SLR75 | Crowdsourced high-quality Venezuelan Spanish speech data set.
| Speech
| Data set which contains recordings of Venezuelan Spanish.
|
| SLR76 | Crowdsourced high-quality Basque speech data set.
| Speech
| Data set which contains recordings of Basque.
|
| SLR77 | Crowdsourced high-quality Galician speech data set.
| Speech
| Data set which contains recordings of Galician.
|
| SLR78 | Crowdsourced high-quality Gujarati multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Gujarati.
|
| SLR79 | Crowdsourced high-quality Kannada multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Kannada.
|
| SLR80 | Crowdsourced high-quality Burmese speech data set.
| Speech
| Data set which contains recordings of Burmese.
|
| SLR81 | Small Audio Clips
| Speech
| Contains 20 one-second audio clips from various sources, for testing compression algorithms
|
| SLR82 | CN-Celeb
| Speech
| A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University
|
| SLR83 | Crowdsourced high-quality UK and Ireland English Dialect speech data set.
| Speech
| Data set which contains male and female recordings of English from various dialects of the UK and Ireland.
|
| SLR84 | ScribbleLens
| Handwriting
| Dutch cursive, 16-18th century handwritings, pages and lines, for (un)supervised AI and other research.
|
| SLR85 | HI-MIA
| Speech
| A far-field text-dependent speaker verification database for AISHELL Speaker Verification Challenge 2019
|
| SLR86 | Crowdsourced high-quality Yoruba speech data set.
| Speech
| Data set which contains recordings of Yoruba.
|
| SLR87 | MobvoiHotwords
| Speech
| Chinese hotwords detection dataset, provided by Mobvoi CO.,LTD
|
| SLR88 | Att-HACK
| Speech
| French Expressive Speech Database with Social Attitudes
|
| SLR89 | Yoloxóchitl-Mixtec
| Speech
| Yolóxochitl Mixtec Speech with Transcription
|
| SLR92 | Puebla-Nahuatl
| Speech
| Puebla Nahuatl Speech with Transcription
|
| SLR93 | AISHELL-3
| Speech
| Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd.
|
| SLR94 | Multilingual LibriSpeech (MLS)
| Speech
| A large multilingual corpus derived from LibriVox audiobooks
|
| SLR95 | Thorsten Müller (German Neutral-TTS dataset)
| Speech
| Free single german speaker dataset (> 23 hours) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for tts training
|
| SLR96 | Russian LibriSpeech (RuLS)
| Speech
| This dataset is based on LibriVox audiobooks
|
| SLR97 | Deeply Korean read speech corpus
| Speech
| Pairs of Korean reading the scripts with 3 text sentiments using 3 vocal sentiments. Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
|
| SLR98 | Deeply parent-child vocal interaction dataset
| Speech
| The interaction of pairs of parent and child(reading fairy tales, singing children’s songs, conversing, and others).Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
|
| SLR99 | Deeply Nonverbal Vocalization Dataset
| Audio
| A human nonverbal vocal sound dataset by Deeply Inc.
|
| SLR100 | Multilingual TEDx
| Speech
| a multilingual corpus of TEDx talks for speech recognition and translation
|
| SLR101 | speechocean762
| Speech
| Pronunciation scoring dataset, labeled independently by five human experts
|
| SLR102 | Kazakh Speech Corpus (KSC)
| Speech
| A crowdsourced open-source Kazakh speech corpus developed by ISSAI (330 hours)
|
| SLR103 | Multilingual and code-switching ASR Challenge Dataset - sub-task1
| Speech
| Datasets for sub-task1 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)
|
| SLR104 | Multilingual and code-switching ASR Challenge Dataset - sub-task2
| Speech
| Datasets for sub-task2 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)
|
| SLR105 | nicolingua-0003-west-african-radio-corpus
| Speech
| West African Radio Corpus
|
| SLR106 | nicolingua-0004-west-african-va-asr-corpus
| Speech
| West African Virtual Assistant Speech Recognition Corpus
|
| SLR107 | Totonac Resources
| Speech
| Totonac Speech with Transcription
|
| SLR108 | MediaSpeech
| Speech
| French, Arabic, Turkish and Spanish media speech datasets
|
| SLR109 | Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS)
| Speech
| A multi-speaker English dataset for training text-to-speech models
|
| SLR110 | Thorsten Müller (German Emotional-TTS dataset)
| Speech
| Free EMOTIONAL single german speaker dataset (Neutral, Disgusted, Angry, Amused, Surprised, Sleepy, Drunk, Whispering) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for TTS training
|
| SLR111 | AISHELL-4
| Speech
| A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Beijing Shell Shell Technology Co.,Ltd
|
| SLR112 | Samromur 21.05
| Speech
| Samrómur Icelandic Speech corpus approved for release in May 2021
|
| SLR113 | SEOUL CORPUS
| Speech
| The Korean Corpus of Spontaneous Speech (aka, Seoul Corpus), created from the NRF(Korea)-funded project
|
| SLR114 | Golos
| Speech
| Russian ASR dataset (1240 hours) with trained acoustic and language models
|
| SLR115 | EmoV_DB
| Speech
| a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English (https://github.com/numediart/EmoV-DB)
|
| SLR116 | Samrómur Queries 21.12
| Speech
| Samrómur Icelandic Speech corpus focused on queries and approved for release in December 2021
|
| SLR117 | Samrómur Children 21.09
| Speech
| Samrómur Icelandic Speech from children (ages 4-17 years) approved for release in September 2021
|
| SLR118 | 1111 Hours Hindi ASR Challenge
| Speech
| Datasets for 1111 Hours Hindi ASR Challenge Closed, Self Supervised Closed and Open - 2022 (https://sites.google.com/view/gramvaaniasrchallenge/home)
|
| SLR119 | AliMeeting
| Speech
| A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Alibaba Group
|
| SLR120 | HI-MIA-CW
| Speech
| A Free Mandarin Supplemental Speech Corpus to HI-MIA Database, whose contents are negative samples for wake-up words "Hi, Mia".
|
| SLR121 | WenetSpeech
| Speech
| A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition
|
| SLR122 | Kashmiri Data Corpus
| Speech
| An audio and text corpus for the Kashmiri language
|
| SLR123 | MAGICDATA Mandarin Chinese Conversational Speech Corpus
| Speech
| The corpus by Magic Data Technology Co., Ltd. , containing 180 hours of rich annotated Mandarin spontaneous conversational speech data.
|
| SLR124 | TIBMD@MUC speech data set
| Speech
| A Tibetan multi-dialect speech data ( 84.33 hours)
|
| SLR125 | Basic LAnguage Resource Kit 1.0 for Faroese
| Speech
| Faroese Speech corpus approved for release in July 2022
|
| SLR126 | IISc-MILE Kannada ASR Corpus
| Speech
| Kannada transcribed speech corpus for ASR
|
| SLR127 | IISc-MILE Tamil ASR Corpus
| Speech
| Tamil transcribed speech corpus for ASR
|
| SLR128 | Samrómur Unverified 22.07
| Speech
| Samrómur Icelandic Speech, 2,200 hours of mostly unverified data approved for release in July 2022
|
| SLR129 | BibleTTS
| Speech
| A large, high-fidelity, multilingual, and uniquely African speech corpus
|
| SLR130 | Samrómur L2 22.09
| Speech
| Samrómur Icelandic Speech, 150 hours from people with Icelandic as a second language. Approved for release in July 2022
|
| SLR131 | Samrómur Mimic 22.09
| Speech
| Samrómur Icelandic Speech, 66.7 hours of speech where users mimic utterances. Approved in September 2022
|
| SLR132 | Mohammed
| Speech
| Arabic speech to text Quran data
|
| SLR133 | XBMU-AMDO31
| Speech
| Tibetan Amdo dialect speech data from NLIT, Northwest Minzu University
|
| SLR134 | SASPEECH
| Speech
| Hebrew speech and transcripts by a single speaker (30 hours)
|
| SLR135 | Libri-Mixed-Speakers
| Speech
| English audio of simultaneous speakers derived from LibriTTS
|
| SLR136 | EMNS
| Speech, text-to-speech, automatic speech recognition
| An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.
|
| SLR137 | Silbo Gomero Speech Corpus
| Speech
| Corpus of the Silbo Gomero whistled language, based on 49 minutes of recordings created by 4 whistlers.
|
| SLR138 | SHALCAS22A
| Speech
| A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd.
|
| SLR139 | Audiocite.net
| Speech
| Spoken dataset of books read in French, initially collected from audiocite.net by the GETALP team for the LeBenchmark project.
|
| SLR140 | Kazakh Speech Dataset (KSD)
| Speech
| High-quality open source Kazakh speech corpus developed by the Department of Artificial Intelligence and Big Data of Al-Farabi Kazakh National University (554 hours)
|
| SLR141 | LibriTTS-R
| Speech
| Sound quality improved version of the LibriTTS corpus which is a large-scale corpus of English speech designed for TTS use
|
| SLR142 | The MC Speech Dataset
| Speech
| Free speech dataset consisting of 24018 short audio clips of a single speaker reading sentences in Polish
|
| SLR143 | Nepali Text-to-Speech Data (Male and Female)
| Speech
| Nepali speech and corresponding text data in male and female voice
|
| SLR144 | SlideSpeech
| Audio-Visual Speech
| A Large-scale English Multi-Modal Audio-Visual Corpus, provided by Alibaba Group
|
| SLR145 | LibriSpeech-PC
| Text
| LibriSpeech text with Punctuation and Capitalization
|
| SLR146 | CML-TTS Dataset
| Speech
| CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages
|
| SLR147 | Veracruz Orizaba Nahuatl Endangered Language
| Speech
| Audio corpus of Orizaba (Veracruz) Nahuatl speech (Glottocode: oriz1235; ISO 639-3: nlv)
|
| SLR148 | Tepetzintla Zacatlan Nahuatl Endangered Language
| Speech
| Audio corpus of Zacatlán-Ahuacatlán-Tepetzintla (Puebla) Nahuatl speech (Glottocode: zaca1241; ISO 639-3: nhi)
|
| SLR149 | Tibetan Greetings
| Speech
| Selected Tibetan greetings speech data categorized according to the dialectal region.
|
| SLR150 | CHiME-6
| Speech
| English multi-channel far field meeting data used in the CHiME-6 Challenge. It is derived from CHiME-5 by fixing some array synchronization errors.
|
| SLR151 | Kallaama
| Speech
| Wolof, Pulaar and Sereer data
|
| SLR152 | Pragmatic Similarity Judgments
| Speech
| Judgments of perceived similarity between utterance pairs from dialogs, in English and Spanish.
|
| SLR153 | Yerevan City Magazine
| Text
| A Free Armenian News Text Corpus, provided by Qaghaki Amsagir LLC (Yerevan City Magazine, evnmag.com)
|
| SLR154 | ArmenianGrqaserAudioBooks
| Speech
| Cutted, Segmented, Processed (speech, text) paired data, derived from the Grqaser.org audiobooks
|
| SLR155 | SBCSAE
| Speech
| The Santa Barbara Corpus of Spoken American English, mirrored from UCSB
|
| SLR156 | SMIIP-TV dataset
| Speech
| A short-term time-varying speaker verificaition dataset
|
| SLR157 | Sagalee
| Speech
| Automatic Speech Recognition Dataset for Oromo Language
|
| SLR158 | NICT-Tib1
| Speech
| 33.5-hour Lhasa-Tibetan read-speech corpus with Kaldi-style transcripts
|
| SLR159 | AISHELL-5
| Speech
| The First Open-Source In-Car Multi-Channel Multi-Speaker Speech Dataset for Automatic Speech Diarization and Recognition, provided by Beijing AISHELL Technology Co.,Ltd.
|
| SLR160 | Armenian Speech Crowdsourcing Data
| Speech
| 70 hours of Armenian speech collected via crowdsourcing with Toloka and texts from Yerevan City Magazine.
|
| SLR161 | Emozionalmente
| Speech
| A crowdsourced Italian emotional speech corpus
|