Resource | Name | Category | Summary |
SLR1 | Yesno
| Speech
| Sixty recordings of one individual saying yes or no in Hebrew; each recording is eight words long.
|
SLR2 | OpenFST
| Software
| A mirror of the OpenFst toolkit
|
SLR3 | sph2pipe
| Software
| A mirror of the sph2pipe software
|
SLR4 | sctk
| Software
| A mirror of the sctk scoring software
|
SLR5 | MSU Switchboard transcipts
| Text
| A mirror of the Mississippi State transcripts and lexicon for Switchboard.
|
SLR6 | Vystadial
| Speech
| English and Czech data, mirrored from the Vystadial project
|
SLR7 | TED-LIUM
| Speech
| English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
|
SLR8 | Sprakbanken
| Text
| Danish pronunciation dictionary generated using eSpeak
|
SLR9 | The AMI pack
| Text
| Some auxiliary non-speech data used to build AMI systems with Kaldi
|
SLR10 | SRE Data
| Misc
| Various files from SRE data that NIST used to host online
|
SLR11 | LibriSpeech language models, vocabulary and G2P models
| Text
| Language modelling resources, for use with the LibriSpeech ASR corpus
|
SLR12 | LibriSpeech ASR corpus
| Speech
| Large-scale (1000 hours) corpus of read English speech
|
SLR13 | RWCP Sound Scene Database
| Speech + Software
| A database of recordings of real-world sounds and measured room impulse responses
|
SLR14 | BEEP Dictionary
| Text
| Phonemic transcriptions of over 250,000 English words. (British English pronunciations)
|
SLR15 | SRE Speaker List
| Misc
| A list linking speakers across NIST SRE corpra
|
SLR16 | The AMI Corpus
| Speech
| Acoustic speech data and meta-data from The AMI corpus.
|
SLR17 | MUSAN
| Audio
| A corpus of music, speech, and noise
|
SLR18 | THCHS-30
| Speech
| A Free Chinese Speech Corpus Released by CSLT@Tsinghua University
|
SLR19 | TED-LIUMv2
| Audio
| TED-LIUM corpus release 2, English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)
|
SLR20 | Aachen Impulse Response Database
| Audio
| Aachen Impulse Response database (AIR): a database of room impulse responses (mirrored here)
|
SLR21 | Spanish Word list
| Text
| A list of words in Spanish with frequency derived from a large corpus (Spanish Gigaword).
|
SLR22 | THUYG-20
| Speech
| A free Uyghur speech database Released by CSLT@Tsinghua University & Xinjiang University
|
SLR23 | NIST LRE 2007 Key
| Misc
| A file containing metadata for the utterances in the LRE 2007 evaluation
|
SLR24 | Iban
| Speech
| Iban language text and speech corpora for ASR
|
SLR25 | ALFFA (African Languages in the Field: speech Fundamentals and Automation)
| Speech
| Amharic, Swahili and Wolof data, mirrored from the ALFFA git repository
|
SLR26 | Simulated Room Impulse Response Database
| Audio
| A database of simulated room impulse responses
|
SLR27 | Cantab-TEDLIUM Release 1.1 (February 2015)
| Text
| Cantab Research Language models for the TEDLIUM database
|
SLR28 | Room Impulse Response and Noise Database
| Audio
| A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.
|
SLR29 | Sprakbanken_Swe
| Text
| Swedish pronunciation dictionary
|
SLR30 | Sinhala TTS
| Speech
| Sinhalese multi-speaker TTS corpora
|
SLR31 | Mini LibriSpeech ASR corpus
| Speech
| Subset of LibriSpeech corpus for purpose of regression testing
|
SLR32 | High quality TTS data for four South African languages (af, st, tn, xh)
| Speech
| Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.
|
SLR33 | Aishell
| Speech
| Mandarin data, provided by Beijing Shell Shell Technology Co.,Ltd
|
SLR34 | Santiago Spanish Lexicon
| Text
| A pronouncing dictionary for the Spanish language.
|
SLR35 | Large Javanese ASR training data set
| Speech
| Javanese ASR training data set containing ~185K utterances.
|
SLR36 | Large Sundanese ASR training data set
| Speech
| Sundanese ASR training data set containing ~220K utterances.
|
SLR37 | High quality TTS data for Bengali languages
| Speech
| Multi-speaker TTS data for Bangladesh Bengali (bn-BD) and Indian Bengali (bn-IN).
|
SLR38 | Free ST Chinese Mandarin Corpus
| Speech
| A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances;
|
SLR39 | Heroico
| Speech
| Spanish data, mirrored from the LDC
|
SLR40 | Zeroth-Korean
| Speech Corpus for Automatic Speech Recognition
| Korean Open-source Speech Corpus for Speech Recognition by Zeroth Project (https://github.com/goodatlas/zeroth)
|
SLR41 | High quality TTS data for Javanese.
| Speech
| Multi-speaker TTS data for Javanese (jv-ID)
|
SLR42 | High quality TTS data for Khmer.
| Speech
| Multi-speaker TTS data for Khmer (km-KH)
|
SLR43 | High quality TTS data for Nepali.
| Speech
| Multi-speaker TTS data for Nepali (ne-NP)
|
SLR44 | High quality TTS data for Sundanese.
| Speech
| Multi-speaker TTS data for Sundanese (su-ID)
|
SLR45 | Free ST American English Corpus
| Speech
| A free American English corpus by Surfingtech (www.surfing.ai), containing utterances from 10 speakers, Each speaker has about 350 utterances;
|
SLR46 | Tunisian_MSA
| Speech
| Tunisian Modern Standard Arabic
|
SLR47 | Primewords Chinese Corpus Set 1
| Speech
| Chinese Mandarin corpus released by Shanghai Primewords Co. Ltd. (www.primewords.cn), containing 100 hours of speech data.
|
SLR48 | MADCAT Arabic data splits
| Other
| Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus
|
SLR49 | VoxCeleb Data
| Misc
| Various files for the VoxCeleb datasets
|
SLR50 | MADCAT Chinese data splits
| Other
| Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus
|
SLR51 | TED-LIUM Release 3
| Speech
| TED-LIUM corpus release 3
|
SLR52 | Large Sinhala ASR training data set
| Speech
| Sinhala ASR training data set containing ~185K utterances.
|
SLR53 | Large Bengali ASR training data set
| Speech
| Bengali ASR training data set containing ~196K utterances.
|
SLR54 | Large Nepali ASR training data set
| Speech
| Nepali ASR training data set containing ~157K utterances.
|
SLR55 | CLMAD
| Text
| A Chinese Language Model Adaptation Dataset (CLMAD).
|
SLR56 | IAM Aachen splits
| Other
| Aachen data splits (train/test/val) for the IAM dataset.
|
SLR57 | African Accented French
| Speech
| Recordings of African Accented French speech.
|
SLR58 | Pansori-TEDxKR
| Speech
| Korean speech corpus generated from Korean language TEDx talks
|
SLR59 | ParlamentParla
| Speech
| Catalan speech corpus generated from Catalan Parliamentary sessions
|
SLR60 | LibriTTS corpus
| Speech
| Large-scale corpus of English speech derived from the original materials of the LibriSpeech corpus
|
SLR61 | Crowdsourced high-quality Argentinian Spanish speech data set.
| Speech
| Data set which contains 5739 recordings of native speakers of Spanish
|
SLR62 | aidatatang_200zh
| Speech
| A Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd, containing 200 hours of speech data from 600 speakers. The transcription accuracy for each sentence is larger than 98%.
|
SLR63 | Crowdsourced high-quality Malayalam multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Malayalam.
|
SLR64 | Crowdsourced high-quality Marathi multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Marathi
|
SLR65 | Crowdsourced high-quality Tamil multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Tamil.
|
SLR66 | Crowdsourced high-quality Telugu multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Telugu.
|
SLR67 | TEDx Spanish Corpus
| Speech
| Spanish data taken from the TEDx Talks
|
SLR68 | MAGICDATA Mandarin Chinese Read Speech Corpus
| Speech
| The corpus by Magic Data Technology Co., Ltd. , containing 755 hours of scripted read speech data from 1080 native speakers of the Mandarin Chinese spoken in mainland China. The sentence transcription accuracy is higher than 98%.
|
SLR69 | Crowdsourced high-quality Catalan speech data set.
| Speech
| Data set which contains recordings of Catalan.
|
SLR70 | Crowdsourced high-quality Nigerian English speech data set.
| Speech
| Data set which contains recordings of Nigerian English.
|
SLR71 | Crowdsourced high-quality Chilean Spanish speech data set.
| Speech
| Data set which contains recordings of Chilean Spanish.
|
SLR72 | Crowdsourced high-quality Colombian Spanish speech data set.
| Speech
| Data set which contains recordings of Colombian Spanish.
|
SLR73 | Crowdsourced high-quality Peruvian Spanish speech data set.
| Speech
| Data set which contains recordings of Peruvian Spanish.
|
SLR74 | Crowdsourced high-quality Puerto Rico Spanish speech data set.
| Speech
| Data set which contains recordings of Puerto Rico Spanish.
|
SLR75 | Crowdsourced high-quality Venezuelan Spanish speech data set.
| Speech
| Data set which contains recordings of Venezuelan Spanish.
|
SLR76 | Crowdsourced high-quality Basque speech data set.
| Speech
| Data set which contains recordings of Basque.
|
SLR77 | Crowdsourced high-quality Galician speech data set.
| Speech
| Data set which contains recordings of Galician.
|
SLR78 | Crowdsourced high-quality Gujarati multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Gujarati.
|
SLR79 | Crowdsourced high-quality Kannada multi-speaker speech data set.
| Speech
| Data set which contains recordings of native speakers of Kannada.
|
SLR80 | Crowdsourced high-quality Burmese speech data set.
| Speech
| Data set which contains recordings of Burmese.
|
SLR81 | Small Audio Clips
| Speech
| Contains 20 one-second audio clips from various sources, for testing compression algorithms
|
SLR82 | CN-Celeb
| Speech
| A Free Chinese Speaker Recognition Corpus Released by CSLT@Tsinghua University
|
SLR83 | Crowdsourced high-quality UK and Ireland English Dialect speech data set.
| Speech
| Data set which contains male and female recordings of English from various dialects of the UK and Ireland.
|
SLR84 | ScribbleLens
| Handwriting
| Dutch cursive, 16-18th century handwritings, pages and lines, for (un)supervised AI and other research.
|
SLR85 | HI-MIA
| Speech
| A far-field text-dependent speaker verification database for AISHELL Speaker Verification Challenge 2019
|
SLR86 | Crowdsourced high-quality Yoruba speech data set.
| Speech
| Data set which contains recordings of Yoruba.
|
SLR87 | MobvoiHotwords
| Speech
| Chinese hotwords detection dataset, provided by Mobvoi CO.,LTD
|
SLR88 | Att-HACK
| Speech
| French Expressive Speech Database with Social Attitudes
|
SLR89 | Yoloxóchitl-Mixtec
| Speech
| Yolóxochitl Mixtec Speech with Transcription
|
SLR92 | Puebla-Nahuatl
| Speech
| Puebla Nahuatl Speech with Transcription
|
SLR93 | AISHELL-3
| Speech
| Mandarin data, provided by Beijing Shell Shell Technology Co., Ltd.
|
SLR94 | Multilingual LibriSpeech (MLS)
| Speech
| A large multilingual corpus derived from LibriVox audiobooks
|
SLR95 | Thorsten Müller (German Neutral-TTS dataset)
| Speech
| Free single german speaker dataset (> 23 hours) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for tts training
|
SLR96 | Russian LibriSpeech (RuLS)
| Speech
| This dataset is based on LibriVox audiobooks
|
SLR97 | Deeply Korean read speech corpus
| Speech
| Pairs of Korean reading the scripts with 3 text sentiments using 3 vocal sentiments. Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
|
SLR98 | Deeply parent-child vocal interaction dataset
| Speech
| The interaction of pairs of parent and child(reading fairy tales, singing children’s songs, conversing, and others).Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
|
SLR99 | Deeply Nonverbal Vocalization Dataset
| Audio
| A human nonverbal vocal sound dataset by Deeply Inc.
|
SLR100 | Multilingual TEDx
| Speech
| a multilingual corpus of TEDx talks for speech recognition and translation
|
SLR101 | speechocean762
| Speech
| Pronunciation scoring dataset, labeled independently by five human experts
|
SLR102 | Kazakh Speech Corpus (KSC)
| Speech
| A crowdsourced open-source Kazakh speech corpus developed by ISSAI (330 hours)
|
SLR103 | Multilingual and code-switching ASR Challenge Dataset - sub-task1
| Speech
| Datasets for sub-task1 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)
|
SLR104 | Multilingual and code-switching ASR Challenge Dataset - sub-task2
| Speech
| Datasets for sub-task2 in Multilingual and code-switching ASR challenges for low resource Indian languages - MUCS 2021 (https://navana-tech.github.io/MUCS2021/)
|
SLR105 | nicolingua-0003-west-african-radio-corpus
| Speech
| West African Radio Corpus
|
SLR106 | nicolingua-0004-west-african-va-asr-corpus
| Speech
| West African Virtual Assistant Speech Recognition Corpus
|
SLR107 | Totonac Resources
| Speech
| Totonac Speech with Transcription
|
SLR108 | MediaSpeech
| Speech
| French, Arabic, Turkish and Spanish media speech datasets
|
SLR109 | Hi-Fi Multi-Speaker English TTS Dataset (Hi-Fi TTS)
| Speech
| A multi-speaker English dataset for training text-to-speech models
|
SLR110 | Thorsten Müller (German Emotional-TTS dataset)
| Speech
| Free EMOTIONAL single german speaker dataset (Neutral, Disgusted, Angry, Amused, Surprised, Sleepy, Drunk, Whispering) by Thorsten Müller (voice) and Dominik Kreutz (audio optimization) for TTS training
|
SLR111 | AISHELL-4
| Speech
| A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Beijing Shell Shell Technology Co.,Ltd
|
SLR112 | Samromur 21.05
| Speech
| Samrómur Icelandic Speech corpus approved for release in May 2021
|
SLR113 | SEOUL CORPUS
| Speech
| The Korean Corpus of Spontaneous Speech (aka, Seoul Corpus), created from the NRF(Korea)-funded project
|
SLR114 | Golos
| Speech
| Russian ASR dataset (1240 hours) with trained acoustic and language models
|
SLR115 | EmoV_DB
| Speech
| a database of emotional speech intended to be open-sourced and used for synthesis and generation purpose. It contains data for male and female actors in English (https://github.com/numediart/EmoV-DB)
|
SLR116 | Samrómur Queries 21.12
| Speech
| Samrómur Icelandic Speech corpus focused on queries and approved for release in December 2021
|
SLR117 | Samrómur Children 21.09
| Speech
| Samrómur Icelandic Speech from children (ages 4-17 years) approved for release in September 2021
|
SLR118 | 1111 Hours Hindi ASR Challenge
| Speech
| Datasets for 1111 Hours Hindi ASR Challenge Closed, Self Supervised Closed and Open - 2022 (https://sites.google.com/view/gramvaaniasrchallenge/home)
|
SLR119 | AliMeeting
| Speech
| A Free Mandarin Multi-channel Meeting Speech Corpus, provided by Alibaba Group
|
SLR120 | HI-MIA-CW
| Speech
| A Free Mandarin Supplemental Speech Corpus to HI-MIA Database, whose contents are negative samples for wake-up words "Hi, Mia".
|
SLR121 | WenetSpeech
| Speech
| A 10000+ Hours Multi-Domain Mandarin Corpus for Speech Recognition
|
SLR122 | Kashmiri Data Corpus
| Speech
| An audio and text corpus for the Kashmiri language
|
SLR123 | MAGICDATA Mandarin Chinese Conversational Speech Corpus
| Speech
| The corpus by Magic Data Technology Co., Ltd. , containing 180 hours of rich annotated Mandarin spontaneous conversational speech data.
|
SLR124 | TIBMD@MUC speech data set
| Speech
| A Tibetan multi-dialect speech data ( 84.33 hours)
|
SLR125 | Basic LAnguage Resource Kit 1.0 for Faroese
| Speech
| Faroese Speech corpus approved for release in July 2022
|
SLR126 | IISc-MILE Kannada ASR Corpus
| Speech
| Kannada transcribed speech corpus for ASR
|
SLR127 | IISc-MILE Tamil ASR Corpus
| Speech
| Tamil transcribed speech corpus for ASR
|
SLR128 | Samrómur Unverified 22.07
| Speech
| Samrómur Icelandic Speech, 2,200 hours of mostly unverified data approved for release in July 2022
|
SLR129 | BibleTTS
| Speech
| A large, high-fidelity, multilingual, and uniquely African speech corpus
|
SLR130 | Samrómur L2 22.09
| Speech
| Samrómur Icelandic Speech, 150 hours from people with Icelandic as a second language. Approved for release in July 2022
|
SLR131 | Samrómur Mimic 22.09
| Speech
| Samrómur Icelandic Speech, 66.7 hours of speech where users mimic utterances. Approved in September 2022
|
SLR132 | Mohammed
| Speech
| Arabic speech to text Quran data
|
SLR133 | XBMU-AMDO31
| Speech
| Tibetan Amdo dialect speech data from NLIT, Northwest Minzu University
|
SLR134 | SASPEECH
| Speech
| Hebrew speech and transcripts by a single speaker (30 hours)
|
SLR135 | Libri-Mixed-Speakers
| Speech
| English audio of simultaneous speakers derived from LibriTTS
|
SLR136 | EMNS
| Speech, text-to-speech, automatic speech recognition
| An emotive single-speaker dataset for narrative storytelling. EMNS is dataset containing transcriptions, emotion, emotion intensity, and description of acted speech.
|
SLR137 | Silbo Gomero Speech Corpus
| Speech
| Corpus of the Silbo Gomero whistled language, based on 49 minutes of recordings created by 4 whistlers.
|
SLR138 | SHALCAS22A
| Speech
| A Chinese Mandarin corpus by Shanghai Acoustics Laboratory, CAS and Wuxi Sandu Intelligent Technology Co., Ltd.
|
SLR139 | Audiocite.net
| Speech
| Spoken dataset of books read in French, initially collected from audiocite.net by the GETALP team for the LeBenchmark project.
|
SLR140 | Kazakh Speech Dataset (KSD)
| Speech
| High-quality open source Kazakh speech corpus developed by the Department of Artificial Intelligence and Big Data of Al-Farabi Kazakh National University (554 hours)
|
SLR141 | LibriTTS-R
| Speech
| Sound quality improved version of the LibriTTS corpus which is a large-scale corpus of English speech designed for TTS use
|
SLR142 | The MC Speech Dataset
| Speech
| Free speech dataset consisting of 24018 short audio clips of a single speaker reading sentences in Polish
|
SLR143 | Nepali Text-to-Speech Data (Male and Female)
| Speech
| Nepali speech and corresponding text data in male and female voice
|
SLR144 | SlideSpeech
| Audio-Visual Speech
| A Large-scale English Multi-Modal Audio-Visual Corpus, provided by Alibaba Group
|
SLR145 | LibriSpeech-PC
| Text
| LibriSpeech text with Punctuation and Capitalization
|
SLR146 | CML-TTS Dataset
| Speech
| CML-TTS: A Multilingual Dataset for Speech Synthesis in Low-Resource Languages
|
SLR147 | Veracruz Orizaba Nahuatl Endangered Language
| Speech
| Audio corpus of Orizaba (Veracruz) Nahuatl speech (Glottocode: oriz1235; ISO 639-3: nlv)
|
SLR148 | Tepetzintla Zacatlan Nahuatl Endangered Language
| Speech
| Audio corpus of Zacatlán-Ahuacatlán-Tepetzintla (Puebla) Nahuatl speech (Glottocode: zaca1241; ISO 639-3: nhi)
|
SLR149 | Tibetan Greetings
| Speech
| Selected Tibetan greetings speech data categorized according to the dialectal region.
|
SLR150 | CHiME-6
| Speech
| English multi-channel far field meeting data used in the CHiME-6 Challenge. It is derived from CHiME-5 by fixing some array synchronization errors.
|
SLR151 | Kallaama
| Speech
| Wolof, Pulaar and Sereer data
|
SLR152 | Pragmatic Similarity Judgments
| Speech
| Judgments of perceived similarity between utterance pairs from dialogs, in English and Spanish.
|
SLR153 | Yerevan City Magazine
| Text
| A Free Armenian News Text Corpus, provided by Qaghaki Amsagir LLC (Yerevan City Magazine, evnmag.com)
|
SLR154 | ArmenianGrqaserAudioBooks
| Speech
| Cutted, Segmented, Processed (speech, text) paired data, derived from the Grqaser.org audiobooks
|
SLR155 | SBCSAE
| Speech
| The Santa Barbara Corpus of Spoken American English, mirrored from UCSB
|