NICT-Tib1: Lhasa-Tibetan Read-Speech Corpus (v1.0, released 2024-08-27)

NICT-Tib1 is an open, CC BY 4.0-licensed audio corpus intended for developing and benchmarking automatic speech recognition (ASR) systems for Tibetan. It contains 33.5 hours of clean read speech recorded in studio conditions from 20 native speakers of the Lhasa dialect (8 male, 12 female, aged 15–30). Speakers read news manuscripts aloud; each utterance is provided with a Kaldi-format transcription (wav.scp, label.txt) so the data can serve both as training and test material.

Package contents

Tibetan.zip (~3 GB)
- 16-kHz, 16-bit mono .wav files (one per utterance)
- wav.scp – Kaldi mapping of utterance IDs to audio paths
- label.txt – Kaldi transcription file (UTF-8 Tibetan script)
- Per-speaker directory structure: data/<spk-id>/<session-id>/
- README (collection protocol, microphone setup, segment duration statistics)

Licence

All audio and transcripts are distributed under the Creative Commons Attribution 4.0 International licence. You are free to use, share and adapt the material provided appropriate credit is given.

Citation

Please cite the following paper when using the corpus:

@inproceedings{nict-tib1,
  title     = {{NICT-Tib1: A Public Speech Corpus of Lhasa Dialect for Benchmarking Tibetan Language Speech Recognition Systems}},
  author    = {Kak Soky and Zhuo Gong and Sheng Li},
  booktitle = {Proc. O-COCOSDA},
  pages     = {1--5},
  year      = {2022},
  doi       = {10.1109/O-COCOSDA202257103.2022.9997917}
}

Questions and feedback can be sent to the corpus maintainers via the contact information on the NICT release page.