Open Speech and Language Resources


Identifier: SLR132

Summary: Arabic speech to text Quran data

Category: Speech

License: MIT

Downloads (use a mirror closer to you):
Quran_Speech_Dataset.tar.xz [24G]   ()   Mirrors: [US]   [EU]   [CN]  

About this resource:

Quran Speech to Text Dataset

Quran Speech to Text Dataset

The dataset is structured as the following:

     001001.mp3 (surah 1, ayah 1)
      114006.mp3 (surah 114, ayah 6)

Note, that not all readers have all the 6236 ayat of Quran, some may not even have all the 114 surahs.

The text of the surahs is in the all_ayat.json file. all_ayat.json file has all the surahs and ayas in Arabic text.
json key format is "1_1" for surah 1 ayah 1, or "114_2" (surah 114 ayah 2). In other words, "xxx_yyy" where x is surah number and y is ayah number up to 3 digits long.
{"tafsir":{"1_1":{"text":"بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"},"1_2":{"text":"الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ"},"1_3":{"text":"الرَّحْمَٰنِ الرَّحِيمِ"},"1_4":{"text":"مَالِكِ يَوْمِ الدِّينِ"},"1_5":{"text":"إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ"},"1_6":{"text":"اهْدِنَا الصِّرَاطَ الْمُسْتَقِيمَ"}, ...}

Some extra machine learning input convenience notes:

audo_list.txt has a list of all mp3 files found in the audio_data directory, transcripts.tsv is a tab-separated-value file that can be used as an input to a machine learning program. It has the format Path-Duration(in seconds) -Arabic text.

External URL: