Source |
Parliamentary minutes of the 16 German federal states; the Bundestag, the Bundesrat, and the German parliaments since 1867. In addition: plenary minutes of the Swiss National Council, the Austrian National Council, and the parliament of Lichtenstein. Preprocessed with spaCy and serialized in UIMA-XMI. |
Published |
2022 |
URL |
GerParCor |
License |
CC BY 4.0 |
Reference |
[1] |
Primary source |
Annotations of videos of hand and arm gestures |
Size |
260 gestures |
Published |
2016 |
URL |
FIGURE corpus |
License |
CC BY-SA 4.0 |
Reference |
[3]
[4] |
Language levels |
Image schemas |
Purpose |
Depiction strategies |
Primary source |
Lemmatized and tagged documents. |
Size |
- v0.1 (2016): 242 documents, 7.336 sentences, 122.913 tokens
- v0.2 (2017): 244 documents, 8.941 sentences, 157.210 tokens
|
Published |
2016, 2017 |
URL |
|
Reference |
Alexander Mehler |
License |
CC BY-NC-ND 3.0 DE, or restricted |
ISLRN |
536-382-801-278-5 |
Language levels |
Lemma, POS |
Purpose |
|
Primary source |
SLNs are induced from dependency treebanks |
Size |
Networks for 13 languages, 6 free available |
Published |
2007-2009 |
URL |
Linguistic Networks |
License |
CC BY-NC-ND 3.0 DE, or restricted |
Reference |
Olga Pustylnikov, Alexander Mehler |
Language levels |
Syntax, dependencies |
Purpose |
Language typology |
Primary source |
Networks are induced using a morphological derivation game that implements a word decomposition algorithm |
Size |
Networks for 5 languages |
Published |
2009 |
URL |
Linguistic Networks |
License |
CC BY-SA 4.0 DE |
Reference |
Olga Pustylnikov |
Language levels |
Morphology |
Purpose |
Language typology, Productivity |
Primary source |
WNs are induced from language-specific releases of the Wikipedia |
Size |
Networks of different sizes for 264 languages |
Published |
2008-2009 |
URL |
Linguistic Networks |
License |
CC BY-SA 4.0 DE |
Reference |
Alexander Mehler |
Language levels |
Articles, ontologies, categories |
Purpose |
Social ontologies |
Primary source |
Co-occurrence network based on the Patrologia Latina |
Size |
|
Published |
2007-2010 |
URL |
Linguistic Networks |
License |
Restricted, due to primary source |
Reference |
Alexander Mehler |
Language levels |
Words, co-occurrences |
Purpose |
Language change, historical semantics |
Primary source |
Comparison of scan subpart of Patrologia Latinae (Flodoardi Canonici Remensis Historiae Remensis Ecclesiae Libri Quatuor, ed. Jean-Jacques Migne (Patrologia Latina T. 135), Paris 1853, col. 27A-328B.) with the original |
Size |
5,213 pairs of words (wrongly scanned,correction) |
Published |
2014 |
URL |
Download as ZIP file (33.7 KB) |
License |
CC BY-SA 4.0 DE |
Reference |
Steffen Eger |
Language levels |
Words |
Purpose |
Spelling error correction |
Primary sources |
Paper Sheets of handwritten copies |
Size |
54 documents, ca. 6500 digitized tokens |
Published |
in preparation |
URL |
Download as ZIP file (84.0 KB) |
License |
CC BY-SA 4.0 DE |
Reference |
Armin Hoenen, Goethe University Frankfurt |
Language levels |
Persian Shahname excerpt |
Purpose |
Copy errors, influence of oral versions in stemmatology, artificial non Latin scirpt corpus, copy from print, copy from handwriting |
Primary sources |
DNB dump XML, filtered tsv File |
Size |
ca. 250 MB; 1,912,675 person entries with variant names, occupations and kinship relations |
Published |
in preparation |
URL |
link will follow soon |
License |
CC0 1.0 Universell (CC0 1.0) |
Reference |
Goethe University Frankfurt |
Language levels |
Named entities, persons |
Purpose |
Large database of variant names |
Primary sources |
Excel lexicon file with concordance, cf. Geldner (1896) – Titus, fully annotated, Tei-File TTLab format for text |
Size |
7,744 lexical entries, ca. 30,000 text token |
Published |
Jügel forthcoming |
URL |
Available via the Avestan language from Linguistic Networks |
License |
CC BY-SA 4.0 DE |
Reference |
Thomas Jügel,Goethe University Frankfurt |
Language levels |
Ceremonial text |
Purpose |
General analyses (syntactic, semantic, cooccurrence, comparison Young Avestan-Old Avestan) |
Primary sources |
Bangla textbooks that have been used in public schools in Bangladesh |
Size |
661 documents, 105,897 sentences, 1,029,354 tokens |
Published |
2014 |
URL |
Bangla Textbook Corpus |
License |
CC BY-SA 4.0 DE |
Reference |
Zahurul Islam, Goethe University Frankfurt |
Language levels |
Textbooks |
Purpose |
Text readablity analysis |
Primary source |
Capitularies Sentence Corpus, Proiel Sentence Corpus, TGermaCorp, Tiger (for training models) |
Published |
2019 |
URL |
|
License |
|
Reference |
[6] |
[1]
![[pdf]](https://www.texttechnologylab.org/wp-content/plugins/papercite/img/pdf.png)
G. Abrami, M. Bagci, L. Hammerla, and A. Mehler, “German Parliamentary Corpus (GerParCor),” in
Proceedings of the Language Resources and Evaluation Conference, Marseille, France, 2022, pp. 1900-1906.
[Bibtex]
@InProceedings{Abrami:Bagci:Hammerla:Mehler:2022,
author = {Abrami, Giuseppe and Bagci, Mevlüt and Hammerla, Leon and Mehler, Alexander},
title = {German Parliamentary Corpus (GerParCor)},
booktitle = {Proceedings of the Language Resources and Evaluation Conference},
month = {June},
year = {2022},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {1900--1906},
abstract = {Parliamentary debates represent a large and partly unexploited treasure trove of publicly accessible texts. In the German-speaking area, there is a certain deficit of uniformly accessible and annotated corpora covering all German-speaking parliaments at the national and federal level. To address this gap, we introduce the German Parliamentary Corpus (GerParCor). GerParCor is a genre-specific corpus of (predominantly historical) German-language parliamentary protocols from three centuries and four countries, including state and federal level data. In addition, GerParCor contains conversions of scanned protocols and, in particular, of protocols in Fraktur converted via an OCR process based on Tesseract. All protocols were preprocessed by means of the NLP pipeline of spaCy3 and automatically annotated with metadata regarding their session date. GerParCor is made available in the XMI format of the UIMA project. In this way, GerParCor can be used as a large corpus of historical texts in the field of political communication for various tasks in NLP.},
poster = {https://www.texttechnologylab.org/wp-content/uploads/2022/06/GerParCor_LREC_2022.pdf},
pdf = {http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.202.pdf}
}
[2]
![[pdf]](https://www.texttechnologylab.org/wp-content/plugins/papercite/img/pdf.png)
S. Ahmed and A. Mehler, “Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora,” in
Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018.
[Bibtex]
@InProceedings{Ahmed:Mehler:2018,
author = {Sajawel Ahmed and Alexander Mehler},
title = {{Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora}},
abstract = {This study improves the performance of neural named entity recognition by a margin of up to 11% in terms of F-score on the example of a low-resource language like German, thereby outperforming existing baselines and establishing a new state-of-the-art on each single open-source dataset (CoNLL 2003, GermEval 2014 and Tübingen Treebank 2018). Rather than designing deeper and wider hybrid neural architectures, we gather all available resources and perform a detailed optimization and grammar-dependent morphological processing consisting of lemmatization and part-of-speech tagging prior to exposing the raw data to any training process. We test our approach in a threefold monolingual experimental setup of a) single, b) joint, and c) optimized training and shed light on the dependency of downstream-tasks on the size of corpora used to compute word embeddings.},
booktitle = {Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA)},
location = {Orlando, Florida, USA},
pdf = {https://arxiv.org/pdf/1807.10675.pdf},
year = 2018
}
[3]
![[pdf]](https://www.texttechnologylab.org/wp-content/plugins/papercite/img/pdf.png)
A. Lücking, A. Mehler, D. Walther, M. Mauri, and D. Kurfürst, “Finding Recurrent Features of Image Schema Gestures: the FIGURE corpus,” in
Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016.
[Bibtex]
@InProceedings{Luecking:Mehler:Walther:Mauri:Kurfuerst:2016,
Author = {L\"{u}cking, Andy and Mehler, Alexander and Walther,
D\'{e}sir\'{e}e and Mauri, Marcel and Kurf\"{u}rst,
Dennis},
Title = {Finding Recurrent Features of Image Schema Gestures:
the {FIGURE} corpus},
BookTitle = {Proceedings of the 10th International Conference on
Language Resources and Evaluation},
Series = {LREC 2016},
location = {Portoro\v{z} (Slovenia)},
pdf = {http://www.texttechnologylab.org/wp-content/uploads/2016/04/lrec2016-gesture-study-final-version-short.pdf},
year = 2016
}
[4]
Unknown bibtex entry with key [FIGURE:annotation] [Bibtex]
[5]
![[pdf]](https://www.texttechnologylab.org/wp-content/plugins/papercite/img/pdf.png)
O. Abramov, A. Mehler, and R. Gleim, “A Unified Database of Dependency Treebanks. Integrating, Quantifying and Evaluating Dependency Data,” in
Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco), 2008.
[Bibtex]
@InProceedings{Pustylnikov:Mehler:Gleim:2008,
Author = {Abramov, Olga and Mehler, Alexander and Gleim,
Rüdiger},
Title = {A Unified Database of Dependency Treebanks.
Integrating, Quantifying and Evaluating Dependency Data},
BookTitle = {Proceedings of the 6th Language Resources and
Evaluation Conference (LREC 2008), Marrakech (Morocco)},
abstract = {This paper describes a database of 11 dependency
treebanks which were unified by means of a
two-dimensional graph format. The format was evaluated
with respect to storage-complexity on the one hand, and
efficiency of data access on the other hand. An example
of how the treebanks can be integrated within a unique
interface is given by means of the DTDB interface. },
pdf = {http://wwwhomes.uni-bielefeld.de/opustylnikov/pustylnikov/pdfs/LREC08_full.pdf},
year = 2008
}
[6]
![[doi]](https://www.texttechnologylab.org/wp-content/plugins/papercite/img/external.png)
R. Gleim, S. Eger, A. Mehler, T. Uslu, W. Hemati, A. Lücking, A. Henlein, S. Kahlsdorf, and A. Hoenen, “A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin,”
Journal of Language Modeling, 2019.
[Bibtex]
@article{Gleim:Eger:Mehler:2019,
author = {Gleim, R\"{u}diger and Eger, Steffen and Mehler, Alexander and Uslu, Tolga and Hemati, Wahed and L\"{u}cking, Andy and Henlein, Alexander and Kahlsdorf, Sven and Hoenen, Armin},
title = {A practitioner's view: a survey and comparison of lemmatization and morphological tagging in German and Latin},
journal = {Journal of Language Modeling},
year = {2019},
pdf = {https://www.texttechnologylab.org/wp-content/uploads/2019/07/jlm-tagging.pdf},
doi = {10.15398/jlm.v7i1.205},
url = {http://jlm.ipipan.waw.pl/index.php/JLM/article/view/205}
}