English Wikipedia Corpus

The English Wikipedia corpus has been extracted from the English Wikipedia articles. The corpus is collected in the year of 2012. the corpus collected with the aim to support research multilingual text readability analysis. The corpus contains 641 documents 277,691 sentences and 5,949,254 tokens. The format of the corpus is TEI P5. For more details on this corpus please refer to:

[1]
[1] [pdf] M. Z. Islam and A. Mehler, “Automatic Readability Classification of Crowd-Sourced Data based on Linguistic and Information-Theoretic Features,” in 14th International Conference on Intelligent Text Processing and Computational Linguistics, 2013.
[BibTeX]
@InProceedings{Islam:Mehler:2013:a,
  Author         = {Islam, Md. Zahurul and Mehler, Alexander},
  Title          = {Automatic Readability Classification of Crowd-Sourced
                   Data based on Linguistic and Information-Theoretic
                   Features},
  BookTitle      = {14th International Conference on Intelligent Text
                   Processing and Computational Linguistics},
  abstract       = {This paper presents a classifier of text readability
                   based on information-theoretic features. The classifier
                   was developed based on a linguistic approach to
                   readability that explores lexical, syntactic and
                   semantic features. For this evaluation we extracted a
                   corpus of 645 articles from Wikipedia together with
                   their quality judgments. We show that
                   information-theoretic features perform as well as their
                   linguistic counterparts even if we explore several
                   linguistic levels at once.},
  owner          = {zahurul},
  pdf            = {http://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/download/1516/1497},
  timestamp      = {2013.01.22},
  website        = {http://www.redalyc.org/articulo.oa?id=61527437002},
  year           = 2013
}
Islam, Md. Zahurul
Multilingual Text Classification using Information-Theoretic Features
PhD Thesis; Goethe University Frankfurt; 2014
In the case that you use this corpus, please cite the publications above.

Acknowledgements: The work is supported by the LOEWE Digital-Humanities Project at the Goethe University Frankfurt.

Download as ZIP archive (14.6 MB)