Customized Europarl Corpus

The customized Europarl corpus has been extracted from the Europarl corpus in order to support research on corpus-based translations. The corpus contains 3,152,650 sentences from 21 European languages of 7 language (sub-) families. The format of the corpus is TEI P5. For more details on this corpus please refer to:[1]

Reference
Islam, Md. Zahurul
Multilingual Text Classification using Information-Theoretic Features
PhD Thesis; Goethe University Frankfurt; 2014

In the case that you use this corpus, please cite the paper above in conjunction with the following paper:

Reference
Philipp Koehn
Europarl: A Parallel Corpus for Statistical Machine Translation
MT Summit 2005
[web]

Copyright: We are not aware of any copyright restrictions on this resource. If you notice any problems please let us know.

Acknowledgements: The work is supported by the LOEWE Digital-Humanities Project at the Goethe University Frankfurt.

Download as 7z file (92,4 MB)


[1] [pdf] M. Z. Islam and A. Mehler, “Customization of the Europarl Corpus for Translation Studies,” in Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC), 2012.
[Bibtex]
@InProceedings{Islam:Mehler:2012:a,
  Author         = {Islam, Md. Zahurul and Mehler, Alexander},
  Title          = {Customization of the Europarl Corpus for Translation
                   Studies},
  BookTitle      = {Proceedings of the 8th International Conference on
                   Language Resources and Evaluation (LREC)},
  abstract       = {Currently, the area of translation studies lacks
                   corpora by which translation scholars can validate
                   their theoretical claims, for example, regarding the
                   scope of the characteristics of the translation
                   relation. In this paper, we describe a customized
                   resource in the area of translation studies that mainly
                   addresses research on the properties of the translation
                   relation. Our experimental results show that the
                   Type-Token-Ratio (TTR) is not a universally valid
                   indicator of the simplification of translation.},
  owner          = {zahurul},
  pdf            = {http://www.lrec-conf.org/proceedings/lrec2012/pdf/729_Paper.pdf},
  timestamp      = {2012.02.02},
  year           = 2012
}