Bangla Textbook Corpus

The Bangla textbook corpus has been extracted from textbooks that have been used for teaching in public schools in Banglades. The corpus is collected in the year of 2012. the corpus collected with the aim to support research multilingual text readability analysis. The corpus contains 661 documents 105,897 sentences and 1,029,354 tokens. The format of the corpus is TEI P5. For more details on this corpus please refer to:[1]

Reference
Islam, Md. Zahurul
Multilingual Text Classification using Information-Theoretic Features
PhD Thesis; Goethe University Frankfurt; 2014
In the case that you use this corpus, please cite the publications above.

Acknowledgements: The work is supported by the LOEWE Digital-Humanities Project at the Goethe University Frankfurt.

Download as ZIP archive (4.25 MB)


[1] [pdf] Islam, M. Z., Mehler, A., & Rahman, R.. (2012). Text Readability Classification of Textbooks of a Low-Resource Language. Paper presented at the Accepted in the 26th Pacific Asia Conference on Language, Information, and Computation (PACLIC 26).
[Bibtex]
@InProceedings{Islam:Mehler:Rahman:2012,
  Author         = {Islam, Md. Zahurul and Mehler, Alexander and Rahman,
                   Rashedur},
  Title          = {Text Readability Classification of Textbooks of a
                   Low-Resource Language},
  BookTitle      = {Accepted in the 26th Pacific Asia Conference on
                   Language, Information, and Computation (PACLIC 26)},
  abstract       = {There are many languages considered to be low-density
                   languages, either because the population speaking the
                   language is not very large, or because insufficient
                   digitized text material is available in the language
                   even though millions of people speak the language.
                   Bangla is one of the latter ones. Readability
                   classification is an important Natural Language
                   Processing (NLP) application that can be used to judge
                   the quality of documents and assist writers to locate
                   possible problems. This paper presents a readability
                   classifier of Bangla textbook documents based on
                   information-theoretic and lexical features. The
                   features proposed in this paper result in an F-score
                   that is 50% higher than that for traditional
                   readability formulas.},
  owner          = {zahurul},
  pdf            = {http://www.aclweb.org/anthology/Y12-1059},
  timestamp      = {2012.08.14},
  website        = {http://www.researchgate.net/publication/256648250_Text_Readability_Classification_of_Textbooks_of_a_Low-Resource_Language},
  year           = 2012
}