Open Speech and Language Resources


Identifier: SLR40

Summary: Korean Open-source Speech Corpus for Speech Recognition by Zeroth Project (

Category: Speech Corpus for Automatic Speech Recognition

License: Attribution 4.0 International (CC BY 4.0)

Downloads (use a mirror closer to you):
zeroth_korean.tar.gz [10G]   (Korean Speech data, transcription, lexicon and language model )   Mirrors: [US]   [EU]   [CN]  

About this resource:

This is Zeroth-Korean corpus,
licensed under Attribution 4.0 International (CC BY 4.0)

The data set contains transcriebed audio data for Korean. There are 51.6 hours transcribed Korean audio for training data (22,263 utterances, 105 people, 3000 sentences) and 1.2 hours transcribed Korean audio for testing data (457 utterances, 10 people). This corpus also contains pre-trained/designed language model, lexicon and morpheme-based segmenter(morfessor).
Zeroth project introduces free Korean speech corpus and aims to make Korean speech recognition more broadly accessible to everyone.
This project was developed in collaboration between Lucas Jo(@Atlas Guide Inc.) and Wonkyum Lee(@Gridspace Inc.).

Contact: Lucas Jo(, Wonkyum Lee(

External URL:   Korean Speech data, transcription and language model