Open Speech and Language Resources


Identifier: SLR89

Summary: Yolóxochitl Mixtec Speech with Transcription

Category: Speech

License: Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)

Downloads (use a mirror closer to you):
Yoloxochitl-Mixtec-Data.tgz [86G]   (Yolóxochitl Mixtec Speech and Transcription )   Mirrors: [US]   [EU]   [CN]  
Yoloxochitl-Mixtec-Manifest.tgz [51K]   (Train-Dev-Test Split and Channel Information for Multi-channel Wave )   Mirrors: [US]   [EU]   [CN] [1.3G]   (Data for Novice Transcription Correction (Mixtec speech data with novice transcription and expert correction) )   Mirrors: [US]   [EU]   [CN]  

About this resource:

Substantive material of Yoloxóchitl Mixtec (Glottocode: yolo1241 | ISO 639-3 = xty) presented here was brought together over a period of just over 10 years by Jonathan D. Amith (PI) and Rey Castillo García a native speaker linguist from the community of Yoloxóchitl. Yolóxochitl Mixtec is spoken in four communities: Yoloxóchitl (16.83425, -98.65281), Arroyo Cumiapa (16.87529, -98.61537), Cuanacaxtitlán (16.79992, -98.63940), and Buena Vista (16.96232, -98.58078). It is one of 52 Mixtec languages as designated by Glottocode and is particularly noteworthy for its high number of lexical tones on morae as the tone bearing unit (9 basic tones: four level tones from low [1] to high [4]; and 5 basic contour tones [13, 14, 43, 42, 32]). There are other uncommon lexical tones and inflectional morphology creates additional patterns. The greatest number of contrasts documented is 21 for the segmental sequence nama (which has a total of 24 words that vary only in tone).

Production of the corpus was supported by the National Science Foundation, Documentation Endangered Languages program and the Endangered Language Documentation Programme (ELDP) at the School of Oriental and African Studes :

NSF Award 0966462, “Corpus and lexicon development: Endangered genres of discourse and domains of cultural knowledge in Tu’un ísaví (Mixtec) of Yoloxóchitl, Guerrero”; NSF Award 1500738, “Collaborative Research: Speech technology-enhanced annotation and training tool for Yoloxóchitl Mixtec (xty)”; NSF Award 1761421, “A corpus-based, comparative, and multi-media lexicosemantic resource for Yoloxóchitl Mixtec (xty)”.

ELDP Pilot project PPG0048, “Corpus and lexicon development: Endangered genres of discourse in Tu’un ísaví (Mixtec) of Yoloxóchitl, Guerrero”; ELDP Major Documentation Project MDP0201, “Corpus and lexicon development: Endangered genres of discourse and domains of cultural knowledge in Tu’un ísaví (Mixtec) of Yoloxóchitl, Guerrero”.

All material is made available under the Creative Common license CC BY-SA (Attribution-ShareAlike). Please cite or use any material as follows (Corresponding author is

Amith, Jonathan D., and Rey Castillo Castillo. n.d. Audio corpus of Yoloxóchitl Mixtec with accompanying time-coded transcriptons in ELAN.

For ASR corpus and corresponding baseline results, please cite (Corresponding author is and

  title={Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yol{\'o}xochitl Mixtec},
  author={Shi, Jiatong and Amith, Jonathan D and Garc{\'\i}a, Rey Castillo and Sierra, Esteban Guadalupe and Duh, Kevin and Watanabe, Shinji},
  booktitle={Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume},