
This DFG-funded project (project number: 531750631) develops new methods for the thematic classification of very large text corpora in the digital humanities and social sciences. Focusing on the German Reference Corpus (DeReKo), the world’s largest collection of German-language texts, it addresses the lack of reliable topic-based metadata for heterogeneous and rapidly growing corpora. By combining approaches from computer science and corpus linguistics, the project creates efficient, open-source, and dynamic classification methods that support advanced corpus analysis, stratified sampling, and the study of linguistic variation. The methods are also tested on domain-specific resources, such as Grammis, to enable fine-grained thematic indexing of specialized texts.
Project Locations
The project is carried out at three research institutions:
- Goethe University Frankfurt am Main
- Project Lead: Prof. Dr. Alexander Mehler
- Saxon Academy of Sciences and Humanities in Leipzig
- Project Lead: Prof. Dr. Gerhard Heyer
- Leibniz Institute for the German Language
- Project Leads: Dr. Marc Kupietz, Prof. Dr. Roman Schneider
Team Frankfurt
Publications
BibTeX
@inproceedings{Verma:Mehler:2026,
title = {Predicting Topic (Co-)Occurrence Using Topic Networks Built from
the Project Gutenberg Corpus},
booktitle = {Proceedings of the 15th International Conference on Language Resources
and Evaluation (LREC 2026)},
year = {2026},
author = {Verma, Bhuvanesh and Mehler, Alexander},
keywords = {Topic Evolution, Topic Network,Time-aware Networks, Temporal Autocorrelation, Project Gutenberg, satek},
abstract = {Although temporal topic modeling has been widely applied to scientific
and legal texts, literary corpora have largely been overlooked
in this regard. To address this issue, we analyze topic evolution
in a subset of the Project Gutenberg (PG) corpus. We model this
subset as a sequence of topic networks that capture the emergence,
persistence, and interaction of thematic structures over decades.
Using supervised topic representations, we predict nodes (topics)
and edges (topic pairings) to forecast future topics and their
co-occurrence. Our experiments demonstrate moderate to strong
temporal persistence in topic connectivity patterns across three
topic systems, with ROC-AUC and AP values consistently above 0.85.
We find that the temporal span of topic networks significantly
impacts predictive performance: longer spans improve the stability
and recall of topic presence, while shorter spans better capture
evolving topic relationships. Overall, our findings demonstrate
the predictability of topics in literary texts over time.},
note = {accepted}
}

