Open Speech and Language Resources


Identifier: SLR24

Summary: Iban language text and speech corpora for ASR

Category: Speech

License: Attribution-ShareAlike 2.0 Generic (CC BY-SA 2.0)

Downloads (use a mirror closer to you):
iban.tar.gz [913M]   ( Iban language corpora )   Mirrors: [US]   [EU]   [CN]  

About this resource:


This package contains Iban language text and speech suitable for Automatic Speech Recognition (ASR) experiments. In addition to transcribed speech, 2M tokens corpus crawled from an online newspaper sites is provided. News data provided by a local radio station in Sarawak, Malaysia.


Details on the corpora and the our experiments on iban ASR can be found in the following list of publication. We appreciate if you cite them if you intend to publish.
	Author = {Sarah Samson Juan and Laurent Besacier and Solange Rossato},
	Booktitle = {Proceedings of Workshop for Spoken Language Technology for Under-resourced (SLTU)},
	Month = {May},
	Title = {Semi-supervised G2P bootstrapping and its application to ASR for a very under-resourced language: Iban},
	Year = {2014}}

  	Title = {Using Resources from a closely-Related language to develop ASR for a very under-resourced Language: A case study for Iban},
  	Author = {Sarah Samson Juan and Laurent Besacier and Benjamin Lecouteux and Mohamed Dyab},
  	Booktitle = {Proceedings of INTERSPEECH},
  	Year = {2015},
  	Address = {Dresden, Germany},
  	Month = {September}}

Original source of the corpus

This OpenSLR release was created from data originally provided by Sarah Juan, but the format was changed to better fit the Kaldi practices. Some of the files were removed, as they are generated now automatically in the Kaldi Iban recipe.

The original source of the corpus is
See the README there for more details, most of it still applies.


Iban Data collected by Sarah Samson Juan and Laurent Besacier. Prepared by Sarah Samson Juan and Laurent Besacier. Created in GETALP, Grenoble, France

We would like to thank the Ministry of Higher Education Malaysia for providing financial support to conduct this study. We also thank The Borneo Post news agency for providing online materials for building the text corpus and also to Radio Televisyen Malaysia (RTM), Sarawak, Malaysia, for providing the news data.