Open Speech and Language Resources

MADCAT Chinese data splits

Identifier: SLR50

Summary: Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus

Category: Other

License: Apache 2.0

Downloads (use a mirror closer to you): [725K]   ( dev set )   Mirrors: [US]   [EU]   [CN]  
madcat.test.raw.lineid [734K]   (test set )   Mirrors: [US]   [EU]   [CN]  
madcat.train.raw.lineid [2.8M]   (train set )   Mirrors: [US]   [EU]   [CN]  

About this resource:

These are unofficial data splits for the corpus MADCAT Chinese Pilot Training Set (LDC2014T13). LDC is providing only training data for this corpus and not the original dev/eval sets, so the original training data have been split into three different disjoint parts (i.e. there shouldn't be sentences/lines from the same document in different sets -- as each document is handwritten/transcribed by a different author in the MADCAT data) to allow for evaluation of the performance in the usual way.

Also, please not that the license relates only for the splits. You still need to obtain the original databases and respect the databases' license!

It contains the madcat xml name and segment id (s{1,2,3,4}). For example:

	GMW_CMN_20070118.0014_001_LDC0632.madcat.xml s1
	GMW_CMN_20070118.0014_001_LDC0632.madcat.xml s2
	GMW_CMN_20070118.0014_001_LDC0632.madcat.xml s3