NICT-Tib1 is an open, CC BY 4.0-licensed audio corpus intended for developing
and benchmarking automatic speech recognition (ASR) systems for Tibetan.
It contains 33.5 hours of clean read speech recorded in studio conditions
from 20 native speakers of the Lhasa dialect (8 male, 12 female, aged 15–30).
Speakers read news manuscripts aloud; each utterance is provided with a
Kaldi-format transcription (wav.scp
, label.txt
) so the data
can serve both as training and test material.
Tibetan.zip
(~3 GB)
.wav
files (one per utterance)wav.scp
– Kaldi mapping of utterance IDs to audio pathslabel.txt
– Kaldi transcription file (UTF-8 Tibetan script)data/<spk-id>/<session-id>/
All audio and transcripts are distributed under the Creative Commons Attribution 4.0 International licence. You are free to use, share and adapt the material provided appropriate credit is given.
Please cite the following paper when using the corpus:
@inproceedings{nict-tib1, title = {{NICT-Tib1: A Public Speech Corpus of Lhasa Dialect for Benchmarking Tibetan Language Speech Recognition Systems}}, author = {Kak Soky and Zhuo Gong and Sheng Li}, booktitle = {Proc. O-COCOSDA}, pages = {1--5}, year = {2022}, doi = {10.1109/O-COCOSDA202257103.2022.9997917} }
Questions and feedback can be sent to the corpus maintainers via the contact information on the NICT release page.