Identifier: SLR154

Summary: Cutted, Segmented, Processed (speech, text) paired data, derived from the audiobooks

Category: Speech

License: Creative Commons Attribution Share Alike 4.0 (CC BY-SA 4.0)

About this resource:

This dataset is part of our effort to increase the amount of data available for low-resource languages like Armenian and Georgian.
It consists of processed audiobooks, which initially consisted of single big transcript and tens of minutes long audios for each chapter.
To make the data ASR/TTS friendly we converted the raw corpus and many multi second long audio chunks (typically 3-15seconds) with corresponding texts.

We coordinated with the original authors from, who agreed on the selection of new books we processed.
To make the reconstruction of the books (usually different speakers per book) harder, we encoded the names of audios
and hide book, chapter and author information. This is done to avoid Voice Cloning attempts on TTS setup (as the
majority of the data were collected on voluntary bases and cloning the voices of those people is forbidden).

The .tgz file contains the following directories:

  • texts/ - Contains text transcripts in .txt format
  • audios/ - Contains audio files in .wav format
About the original source (Grqaser):
    "Grqaser" is a non-governmental organization dedicated to promoting Armenian language preservation globally through
    the creation of a comprehensive library of Armenian audiobooks. Established in 2015, "Grqaser" aims to facilitate
    access to Armenian literature for diaspora communities and individuals with visual impairments. Their initiative
    provides a valuable resource for listeners to engage with Armenian culture and language through accessible audio
    formats, supporting educational and cultural enrichment worldwide.

Author(s) of the corpus

Ara Yeroyan, ar23yeroyan

