Open Speech and Language Resources

MAGICDATA Mandarin Chinese Conversational Speech Corpus

Identifier: SLR123

Summary: The corpus by Magic Data Technology Co., Ltd. , containing 180 hours of rich annotated Mandarin spontaneous conversational speech data.

Category: Speech

License: Attribution-NonCommercial-NoDerivatives 4.0 International Public License (CC BY-NC-ND 4.0)

Downloads (use a mirror closer to you):
MagicData-RAMC.tar.gz [15G]   ( All speech and annotations )   Mirrors: [US]   [EU]   [CN]  

About this resource:

The contents and the corresponding descriptions of the corpus include:

  • The corpus contains 180 hours of speech data, which is all mobile recorded data.
  • 663 speakers from different accent areas in China are invited to participate in the recording.
  • All speech data are manually labeled and the transcriptions are proofed by professional inspectors to ensure the labeling quality.
  • Recordings are conducted in a quiet indoor environment.
  • The database is divided into training set, validation set, and testing set in a ratio of 15: 1: 2.
  • Detail information such as speaker and topic information and is preserved in the metadata file.
  • The topic of dialogues is diversified, ranging from science and technology to ordinary life.

The corpus aims to support researchers in speech recognition, machine translation, speaker recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use. Citation You can cite the data using the following BibTeX entry:
  title={Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset},
  author={Yang, Zehui and Chen, Yifan and Luo, Lei and Yang, Runyan and Ye, Lingxuan and Cheng, Gaofeng and Xu, Ji and Jin, Yaohui and Zhang, Qingqing and Zhang, Pengyuan and others},
  journal={arXiv preprint arXiv:2203.16844},
About us

Magic Data Technology Co., Ltd. (referred to as Magic Data) was established in 2016. Through our higher-expertise and higher-precision data services, Magic Data has quickly grown into one of the foremost companies in artificial intelligence industry. We strive to provide the most efficient and highest quality one-stop data services for customers in the fields of speech recognition, intelligent imaging and Natural Language Understanding (NLU). Our services include data scheme design, data collection, data annotation/transcription, etc.