Open Speech and Language Resources

Basic LAnguage Resource Kit 1.0 for Faroese

Identifier: SLR125

Summary: Faroese Speech corpus approved for release in July 2022

Category: Speech

License: CC BY 4.0

Downloads (use a mirror closer to you):
BLARK_1.0_update.tar.gz [28G]   ( Faroese speech and transcripts, lexicon and background text )   Mirrors: [US]   [EU]   [CN]  

About this resource:

This data is a Basic Language Resource Kit (BLARK 1.0) for Faroese. It contains 100 hours of transcribed Faroese speech (over 400 speakers).

The BLARK 1.0 for Faroese was made by the Project Group Ravnur under the Talutøkni Foundation ( This project started its work on gathering and creating language resources for Faroese in January 2019 and is set to end with the release of BLARK 1.0 in July 2022. The aim was to create open-source resources that can be used for language technology for Faroese, while the main goal for this project group was to get resources that can be used for Faroese automatic speech recognition (ASR).

The audio was collected by recording speakers reading texts. The 433 speakers are aged between 18-83 and divided into the main six dialect areas. The recordings were made on TASCAM DR-40 Linear PCM audio recorders using the built-in stereo microphones in WAVE 16 bit with a sample rate of 48kHz. All recordings have been transcribed. Alongside the recordings and transcriptions there is a pronunciation dictionary complete with PoS-tags, as well as a text collection of 25 million words and an acoustic model and language model for ASR.

You can cite the data using the following BibTeX entry:

@inproceedings{Simonsen et al_2022,
  title={{Creating a Basic Language Resource Kit for Faroese}},
  author={Simonsen, Annika and Lamhauge, Sandra Saxov and Debess, Iben Nyholm and Henrichsen, Peter Juel},
  booktitle={Proceedings of the 13th International Conference on Language Resources and Evaluation (LREC 2022)},

External URL: