Open Speech and Language Resources



Kazakh Speech Dataset (KSD)

Identifier: SLR140

Summary: High-quality open source Kazakh speech corpus developed by the Department of Artificial Intelligence and Big Data of Al-Farabi Kazakh National University (554 hours)

Category: Speech

License: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0 US)

Downloads (use a mirror closer to you):
audio2.zip [12G]   ( Kazakh speech and transcripts (127.13 hours) )   Mirrors: [US]   [EU]   [CN]  
audio3.zip [13G]   ( Kazakh speech and transcripts (142 hours) )   Mirrors: [US]   [EU]   [CN]  
audio4.zip [15G]   ( Kazakh speech and transcripts (147 hours) )   Mirrors: [US]   [EU]   [CN]  
audio5.zip [16G]   ( Kazakh speech and transcripts (137.98 hours) )   Mirrors: [US]   [EU]   [CN]  
checksum.md5 [176 bytes]   (md5sums of the files )   Mirrors: [US]   [EU]   [CN]  

About this resource:

High-quality open source Kazakh speech corpus. The corpus contains about 554 hours of transcribed audio recordings, including 204250 utterances uttered by participants from different regions and age groups, as well as by both sexes. All audio files were recorded using mobile devices (iOS and Android). The corpus was selectively checked by native speakers of the Kazakh language to ensure high quality. The data set is primarily intended for use in training systems for automatic speech recognition. Technical characteristics of audio files: .wav format, 16 kB, 22 and 44 kHz. The founders of the corpus: Nurgali Kadyrbek(https://orcid.org/0000-0002-5461-8899), Madina Mansurova(https://orcid.org/0000-0002-9680-2758) To cite the dataset, please use the following BibTeX entry: @inproceedings{mansurova-kadyrbek-2023-kazakh-speech-dataset, title = "The Development of a Kazakh Speech Recognition Model Using a Convolutional Neural Network with Fixed Character Level Filters", author = "Madina Mansurova and Nurgali Kadyrbek", booktitle = "Proceedings of the Big Data and Cognitive Computing", month = "July 20", year = "2023", pages = "5--9", url = "https://doi.org/10.3390/bdcc7030132" }