The Bangla textbook corpus has been extracted from textbooks that have been used for teaching in public schools in Banglades. The corpus is collected in the year of 2012. the corpus collected with the aim to support research multilingual text readability analysis. The corpus contains 661 documents 105,897 sentences and 1,029,354 tokens. The format of the corpus is TEI P5. For more details on this corpus please refer to:[1]
Reference
Islam, Md. Zahurul
Multilingual Text Classification using Information-Theoretic Features
PhD Thesis; Goethe University Frankfurt; 2014
Multilingual Text Classification using Information-Theoretic Features
PhD Thesis; Goethe University Frankfurt; 2014
Acknowledgements: The work is supported by the LOEWE Digital-Humanities Project at the Goethe University Frankfurt.
Download as ZIP archive (4.25 MB)
[1]
Islam, M. Z., Mehler, A., & Rahman, R.. (2012). Text Readability Classification of Textbooks of a Low-Resource Language. Paper presented at the Accepted in the 26th Pacific Asia Conference on Language, Information, and Computation (PACLIC 26).
[Bibtex]
![[pdf]](https://www.texttechnologylab.org/wp-content/plugins/papercite/img/pdf.png)
[Bibtex]
@InProceedings{Islam:Mehler:Rahman:2012,
Author = {Islam, Md. Zahurul and Mehler, Alexander and Rahman,
Rashedur},
Title = {Text Readability Classification of Textbooks of a
Low-Resource Language},
BookTitle = {Accepted in the 26th Pacific Asia Conference on
Language, Information, and Computation (PACLIC 26)},
abstract = {There are many languages considered to be low-density
languages, either because the population speaking the
language is not very large, or because insufficient
digitized text material is available in the language
even though millions of people speak the language.
Bangla is one of the latter ones. Readability
classification is an important Natural Language
Processing (NLP) application that can be used to judge
the quality of documents and assist writers to locate
possible problems. This paper presents a readability
classifier of Bangla textbook documents based on
information-theoretic and lexical features. The
features proposed in this paper result in an F-score
that is 50% higher than that for traditional
readability formulas.},
owner = {zahurul},
pdf = {http://www.aclweb.org/anthology/Y12-1059},
timestamp = {2012.08.14},
website = {http://www.researchgate.net/publication/256648250_Text_Readability_Classification_of_Textbooks_of_a_Low-Resource_Language},
year = 2012
}