Corpora

GerParCor
Source Parliamentary minutes of the 16 German federal states; the Bundestag, the Bundesrat, and the German parliaments since 1867. In addition: plenary minutes of the Swiss National Council, the Austrian National Council, and the parliament of Lichtenstein. Preprocessed with spaCy and serialized in UIMA-XMI.
Published 2022
URL GerParCor
License CC BY 4.0
Reference [1]
GermanWordEmbeddingsNER
Primary source Leipzig Corpus 2018, COW Corpus 2016
Published 2018
URL GermanWordEmbeddingsNER
License CC BY 4.0
Reference [2]
FIGURE — Frankfurt Image Gestures
Primary source Annotations of videos of hand and arm gestures
Size 260 gestures
Published 2016
URL FIGURE corpus
License CC BY-SA 4.0
Reference [3] [4]
Language levels Image schemas
Purpose Depiction strategies

TGermaCorp
Primary source Lemmatized and tagged documents.
Size
  • v0.1 (2016): 242 documents, 7.336 sentences, 122.913 tokens
  • v0.2 (2017): 244 documents, 8.941 sentences, 157.210 tokens
Published 2016, 2017
URL
Reference Alexander Mehler
License CC BY-NC-ND 3.0 DE, or restricted
ISLRN 536-382-801-278-5
Language levels Lemma, POS
Purpose

Bible Corpus Tokenization Extension
Primary source http://christos-c.com/bible/
Size 4 Files of tokenized bible data
Published 2015
URL BibleCorpusTokenizationTTLAB extension to http://christos-c.com/bible/
License CC BY-NC-SA 3.0
Reference Armin Hoenen
Language levels Words
Purpose Community extension

Syntactic Language Networks (SLN)
Primary source SLNs are induced from dependency treebanks
Size Networks for 13 languages, 6 free available
Published 2007-2009
URL Linguistic Networks
License CC BY-NC-ND 3.0 DE, or restricted
Reference Olga Pustylnikov, Alexander Mehler
Language levels Syntax, dependencies
Purpose Language typology
Morphological Derivation Networks
Primary source Networks are induced using a morphological derivation game that implements a word decomposition algorithm
Size Networks for 5 languages
Published 2009
URL Linguistic Networks
License CC BY-SA 4.0 DE
Reference Olga Pustylnikov
Language levels Morphology
Purpose Language typology, Productivity
Wikipedia Networks (WN)
Primary source WNs are induced from language-specific releases of the Wikipedia
Size Networks of different sizes for 264 languages
Published 2008-2009
URL Linguistic Networks
License CC BY-SA 4.0 DE
Reference Alexander Mehler
Language levels Articles, ontologies, categories
Purpose Social ontologies
Lexical Co-occurrence Networks
Primary source Co-occurrence network based on the Patrologia Latina
Size
Published 2007-2010
URL Linguistic Networks
License Restricted, due to primary source
Reference Alexander Mehler
Language levels Words, co-occurrences
Purpose Language change, historical semantics
Frankfurter OCR-Korpus
Primary source Comparison of scan subpart of Patrologia Latinae (Flodoardi Canonici Remensis Historiae Remensis Ecclesiae Libri Quatuor, ed. Jean-Jacques Migne (Patrologia Latina T. 135), Paris 1853, col. 27A-328B.) with the original
Size 5,213 pairs of words (wrongly scanned,correction)
Published 2014
URL Download as ZIP file (33.7 KB)
License CC BY-SA 4.0 DE
Reference Steffen Eger
Language levels Words
Purpose Spelling error correction
Tascfe
Primary sources Paper Sheets of handwritten copies
Size 54 documents, ca. 6500 digitized tokens
Published in preparation
URL Download as ZIP file (84.0 KB)
License CC BY-SA 4.0 DE
Reference Armin Hoenen, Goethe University Frankfurt
Language levels Persian Shahname excerpt
Purpose Copy errors, influence of oral versions in stemmatology, artificial non Latin scirpt corpus, copy from print, copy from handwriting
Person Database Deutsche Nationalbibliothek filtered
Primary sources DNB dump XML, filtered tsv File
Size ca. 250 MB; 1,912,675 person entries with variant names, occupations and kinship relations
Published in preparation
URL link will follow soon
License CC0 1.0 Universell (CC0 1.0)
Reference Goethe University Frankfurt
Language levels Named entities, persons
Purpose Large database of variant names
Avesta Yasna Ceremony
Primary sources Excel lexicon file with concordance, cf. Geldner (1896) – Titus, fully annotated, Tei-File TTLab format for text
Size 7,744 lexical entries, ca. 30,000 text token
Published Jügel forthcoming
URL Available via the Avestan language from Linguistic Networks
License CC BY-SA 4.0 DE
Reference Thomas Jügel,Goethe University Frankfurt
Language levels Ceremonial text
Purpose General analyses (syntactic, semantic, cooccurrence, comparison Young Avestan-Old Avestan)
Bangla Textbook Corpus
Primary sources Bangla textbooks that have been used in public schools in Bangladesh
Size 661 documents, 105,897 sentences, 1,029,354 tokens
Published 2014
URL Bangla Textbook Corpus
License CC BY-SA 4.0 DE
Reference Zahurul Islam, Goethe University Frankfurt
Language levels Textbooks
Purpose Text readablity analysis
English Textbook Corpus
Primary sources English versions of textbooks that have been used in public schools in Bangladesh
Size 519 documents, 95,470 sentences, 1,184,124 tokens
Published 2014
URL English Textbook Corpus
License CC BY-SA 4.0 DE
Reference Zahurul Islam, Goethe University Frankfurt
Language levels Textbooks
Purpose Text readablity analysis
English Wikipedia Corpus
Primary sources English Wikipedia articles
Size 641 documents, 277,691 sentences, 5,949,254 tokens
Published 2014
URL English Wikipedia Corpus
License CC BY-SA 4.0 DE
Reference Zahurul Islam, Goethe University Frankfurt
Language levels Articles
Purpose Text readablity analysis
Customized Europarl Corpus
Primary sources Europarl corpus
Size 3,152,650 sentences from 21 European languages
Published 2012-2014
URL Customized Europarl Corpus
License CC BY-SA 4.0 DE
Reference Zahurul Islam, Goethe University Frankfurt
Language levels Sentences
Purpose Translation studies
Unified Dependency Treebanks
Primary sources
Size
Published 2008
URL Unified Dependency Treebanks
License CC BY-SA 4.0 DE
Reference [5] [4]
Language levels Sentence
Purpose Dependency Analysis
NLP Resources for Latin and German
Resources for article: “A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin” [6]
Primary source Capitularies Sentence Corpus, Proiel Sentence Corpus, TGermaCorp, Tiger (for training models)
Published 2019
URL
License
Reference [6]
[1] [pdf] G. Abrami, M. Bagci, L. Hammerla, and A. Mehler, “German Parliamentary Corpus (GerParCor),” in Proceedings of the Language Resources and Evaluation Conference, Marseille, France, 2022, pp. 1900-1906.
[Bibtex]
@InProceedings{Abrami:Bagci:Hammerla:Mehler:2022,
  author    = {Abrami, Giuseppe  and  Bagci, Mevlüt  and  Hammerla, Leon  and  Mehler, Alexander},
  title     = {German Parliamentary Corpus (GerParCor)},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1900--1906},
  abstract  = {Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliamentary Corpus (GerParCor). GerParCor is a genre-specific corpus of (predominantly historical) German-language parliamentary protocols from three centuries and four countries, including state and federal level data. In addition, GerParCor contains conversions of scanned protocols and, in particular, of protocols in Fraktur converted via an OCR process based on Tesseract. All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date. GerParCor is made available in the XMI format of the UIMA project. In this way, GerParCor can be used as a large corpus of historical texts in the field of political communication for various tasks in NLP.},
  poster   = {https://www.texttechnologylab.org/wp-content/uploads/2022/06/GerParCor_LREC_2022.pdf},
  pdf    = {http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.202.pdf}

}
[2] [pdf] S. Ahmed and A. Mehler, “Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora,” in Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018.
[Bibtex]
@InProceedings{Ahmed:Mehler:2018,
author = {Sajawel Ahmed and Alexander Mehler},
title = {{Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora}},
abstract = {This study improves the performance of neural named entity recognition by a margin of up to 11% in terms of F-score on the example of a low-resource language like German, thereby outperforming existing baselines and establishing a new state-of-the-art on each single open-source dataset (CoNLL 2003, GermEval 2014 and Tübingen Treebank 2018). Rather than designing deeper and wider hybrid neural architectures, we gather all available resources and perform a detailed optimization and grammar-dependent morphological processing consisting of lemmatization and part-of-speech tagging prior to exposing the raw data to any training process. We test our approach in a threefold monolingual experimental setup of a) single, b) joint, and c) optimized training and shed light on the dependency of downstream-tasks on the size of corpora used to compute word embeddings.},
booktitle = {Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA)},
location = {Orlando, Florida, USA},
pdf = {https://arxiv.org/pdf/1807.10675.pdf},
year = 2018
}
[3] [pdf] A. Lücking, A. Mehler, D. Walther, M. Mauri, and D. Kurfürst, “Finding Recurrent Features of Image Schema Gestures: the FIGURE corpus,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016.
[Bibtex]
@InProceedings{Luecking:Mehler:Walther:Mauri:Kurfuerst:2016,
  Author         = {L\"{u}cking, Andy and Mehler, Alexander and Walther,
                   D\'{e}sir\'{e}e and Mauri, Marcel and Kurf\"{u}rst,
                   Dennis},
  Title          = {Finding Recurrent Features of Image Schema Gestures:
                   the {FIGURE} corpus},
  BookTitle      = {Proceedings of the 10th International Conference on
                   Language Resources and Evaluation},
  Series         = {LREC 2016},
  location       = {Portoro\v{z} (Slovenia)},
  pdf            = {http://www.texttechnologylab.org/wp-content/uploads/2016/04/lrec2016-gesture-study-final-version-short.pdf},
  year           = 2016
}
[4] Unknown bibtex entry with key [FIGURE:annotation]
[Bibtex]
[5] [pdf] O. Abramov, A. Mehler, and R. Gleim, “A Unified Database of Dependency Treebanks. Integrating, Quantifying and Evaluating Dependency Data,” in Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco), 2008.
[Bibtex]
@InProceedings{Pustylnikov:Mehler:Gleim:2008,
  Author         = {Abramov, Olga and Mehler, Alexander and Gleim,
                   Rüdiger},
  Title          = {A Unified Database of Dependency Treebanks.
                   Integrating, Quantifying and Evaluating Dependency Data},
  BookTitle      = {Proceedings of the 6th Language Resources and
                   Evaluation Conference (LREC 2008), Marrakech (Morocco)},
  abstract       = {This paper describes a database of 11 dependency
                   treebanks which were unified by means of a
                   two-dimensional graph format. The format was evaluated
                   with respect to storage-complexity on the one hand, and
                   efficiency of data access on the other hand. An example
                   of how the treebanks can be integrated within a unique
                   interface is given by means of the DTDB interface. },
  pdf            = {http://wwwhomes.uni-bielefeld.de/opustylnikov/pustylnikov/pdfs/LREC08_full.pdf},
  year           = 2008
}
[6] [pdf] [doi] R. Gleim, S. Eger, A. Mehler, T. Uslu, W. Hemati, A. Lücking, A. Henlein, S. Kahlsdorf, and A. Hoenen, “A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin,” Journal of Language Modeling, 2019.
[Bibtex]
@article{Gleim:Eger:Mehler:2019,
  author    = {Gleim, R\"{u}diger and Eger, Steffen and Mehler, Alexander and Uslu, Tolga and Hemati, Wahed and L\"{u}cking, Andy and Henlein, Alexander and Kahlsdorf, Sven and Hoenen, Armin},
  title     = {A practitioner's view: a survey and comparison of lemmatization and morphological tagging in German and Latin},
  journal   = {Journal of Language Modeling},
  year      = {2019},
  pdf = {https://www.texttechnologylab.org/wp-content/uploads/2019/07/jlm-tagging.pdf},
  doi = {10.15398/jlm.v7i1.205},
  url = {http://jlm.ipipan.waw.pl/index.php/JLM/article/view/205} 
}