openslr.org

Open Speech and Language Resources

MAGICDATA Mandarin Chinese Conversational Speech Corpus

Identifier: SLR123

Summary: The corpus by Magic Data Technology Co., Ltd. , containing 180 hours of rich annotated Mandarin spontaneous conversational speech data.

Category: Speech

License: Attribution-NonCommercial-NoDerivatives 4.0 International Public License (CC BY-NC-ND 4.0)

Downloads (use a mirror closer to you):
MagicData-RAMC.tar.gz [15G] ( All speech and annotations ) Mirrors: [EU] [EU] [CN]

About this resource:

The contents and the corresponding descriptions of the corpus include:

The corpus contains 180 hours of speech data, which is all mobile recorded data.
663 speakers from different accent areas in China are invited to participate in the recording.
All speech data are manually labeled and the transcriptions are proofed by professional inspectors to ensure the labeling quality.
Recordings are conducted in a quiet indoor environment.
The database is divided into training set, validation set, and testing set in a ratio of 15: 1: 2.
Detail information such as speaker and topic information and is preserved in the metadata file.
The topic of dialogues is diversified, ranging from science and technology to ordinary life.

The corpus aims to support researchers in speech recognition, machine translation, speaker recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use. Citation You can cite the data using the following BibTeX entry:

@article{yang2022open,
  title={Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset},
  author={Yang, Zehui and Chen, Yifan and Luo, Lei and Yang, Runyan and Ye, Lingxuan and Cheng, Gaofeng and Xu, Ji and Jin, Yaohui and Zhang, Qingqing and Zhang, Pengyuan and others},
  journal={arXiv preprint arXiv:2203.16844},
  year={2022}
}

About us

Magic Data Technology Co., Ltd. (referred to as Magic Data) was established in 2016. Through our higher-expertise and higher-precision data services, Magic Data has quickly grown into one of the foremost companies in artificial intelligence industry. We strive to provide the most efficient and highest quality one-stop data services for customers in the fields of speech recognition, intelligent imaging and Natural Language Understanding (NLU). Our services include data scheme design, data collection, data annotation/transcription, etc.

Contact