Customized Europarl Corpus

The customized Europarl corpus has been extracted from the Europarl corpus in order to support research on corpus-based translations. The corpus contains 3,152,650 sentences from 21 European languages of 7 language (sub-) families. The format of the corpus is TEI P5. For more details on this corpus please refer to:

Islam, Md. Zahurul
Multilingual Text Classification using Information-Theoretic Features
PhD Thesis; Goethe University Frankfurt; 2014

In the case that you use this corpus, please cite the paper above in conjunction with the following paper:

Philipp Koehn
Europarl: A Parallel Corpus for Statistical Machine Translation
MT Summit 2005

Copyright: We are not aware of any copyright restrictions on this resource. If you notice any problems please let us know.

Acknowledgements: The work is supported by the LOEWE Digital-Humanities Project at the Goethe University Frankfurt.

Download as 7z file (92,4 MB)