Open Speech and Language Resources

Veracruz Orizaba Nahuatl Endangered Language

Identifier: SLR147

Summary: Audio corpus of Orizaba (Veracruz) Nahuatl speech (Glottocode: oriz1235; ISO 639-3: nlv)

Category: Speech

License: Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)

Downloads (use a mirror closer to you):
info.pdf [61K]   (Document with information about this corpus )   Mirrors: [US]   [EU]   [CN] [39G]   (Speech data of Veracruz Orizaba Nahuatl, recorded in 48kHz, 16-bit )   Mirrors: [US]   [EU]   [CN]  
Veracruz-Orizaba-Nahuat_Collaborators.txt [5.4K]   (List of all native speaker collaborators for this corpus )   Mirrors: [US]   [EU]   [CN]  
Veracruz-Orizaba-Nahuatl_File-list.txt [69K]   (List of all filenames with duration )   Mirrors: [US]   [EU]   [CN]  
Plant-observations_Veracruz.csv [6.1K]   (List of all plant observations with observation number, family, scientific name, date collected, name of person who identified the plant )   Mirrors: [US]   [EU]   [CN]  
Plant-Labels_Tequila-Orizaba-ethnobotanical-field-trip_2023-10-22.pdf [307K]   (Labels for the 81 plant observations the audio of which is included in this corpus )   Mirrors: [US]   [EU]   [CN]  

About this resource:

The substantive material of this deposit was gathered over a 13-month period from February 2022 to March 2023. It comprised 657 files totaling approximately 119 hours, 26 minutes, 59 seconds of material. All but 81 files (i.e., 576 files) were recorded by Jonathan D. Amith (Project director) or Amelia Domínguez Alcántara and Ceferino Salgado Castañeda using a Sound Devices 722 digital recorder and Countryman e6 omnidirectional microphones. Most of these recordings are two-channel conversations, with each speaker on a separate channel although a few (e.g., stories) are single-speaker and single-channel recordings. The remaining 81 recordings, totaling 14 hours, 31 minutes, 59 seconds, are all coded Teotz_BotFl or Ixpal_BotFl. These were made by Mariano Gorostiza Salazar and Miriam Jiménez Chimil during a short trip (7 March to 16 March 2023) to photograph plants that have names in the Nahuatl spoken in the region of Tequila and Orizaba, Veracruz. The (ethno)botanical labels for these 81 plant observations are included as reference in this OpenSRL resource (see pdf file named: Plant-Labels_Tequila-Orizaba-ethnobotanical-field-trip_2023-10-22.pdf). As plants continue to be identified with their scientific names from the field photos taken, this file will be updated.

Fieldwork was coordinated locally by Gabriela Citlahua Zepahua, who also participated as a speaker in some of the recordings. Citlahua Zepahua was responsible for contacting the native speakers who generously participated in this research.

Please note that this initial OpenSLR deposit focuses on the audio corpus. Five future enhancements to the metadata for this corpus are envisioned at this present time: (1) Completed metadata, particularly a description of the content of each recording; (2) 10 hours of transcription by hand in ELAN, material that will provide the initial basis for transfer ASR; (3) A final deposit of the results of ASR transcriptions; (4) Corrections to the ASR transcriptions by Amith and native speakers of Orizaba Nahuatl; (5) Reference to the ASR end2end recipe (GitHub) used to generate the ASR transcriptions.

The fieldwork for developing this corpus was supported by NSF Dynamic Language Infrastructure grant #2123578 entitled “Collaborative Research: Improving Techniques of Automatic Speech Recognition and Transfer Learning using Documentary Linguistic Corpora” (Jonathan D. Amith, PI). The speech processing facet of this research (Award #2123624) will be carried out by Shinji Watanabe (PI) and his team at Carnegie Mellon University.

All material is made available under the Creative Common license CC BY-SA (Attribution-ShareAlike). Please cite or use any material as follows (Corresponding author is Jonathan D. Amith

Amith, Jonathan D., Amelia Domínguez Alcántara, Ceferino Salgado Castañeda, Gabriela Citlahua Zepahua, Mariano Gorostiza Salazar, and Miriam Jiménez Chimil, 2022–23, Audio corpus of Orizaba (Veracruz) Nahuatl speech (Glottocode: oriz1235; ISO 639-3: nlv). Accessed [date] at