
This DFG-funded project (project number: 531750631) develops new methods for the thematic classification of very large text corpora in the digital humanities and social sciences. Focusing on the German Reference Corpus (DeReKo), the world’s largest collection of German-language texts, it addresses the lack of reliable topic-based metadata for heterogeneous and rapidly growing corpora. By combining approaches from computer science and corpus linguistics, the project creates efficient, open-source, and dynamic classification methods that support advanced corpus analysis, stratified sampling, and the study of linguistic variation. The methods are also tested on domain-specific resources, such as Grammis, to enable fine-grained thematic indexing of specialized texts.
Project Locations
The project is carried out at three research institutions:
- Goethe University Frankfurt am Main
- Project Lead: Prof. Dr. Alexander Mehler
- Saxon Academy of Sciences and Humanities in Leipzig
- Project Lead: Prof. Dr. Gerhard Heyer
- Leibniz Institute for the German Language
- Project Leads: Dr. Marc Kupietz, Prof. Dr. Roman Schneider
Team Frankfurt

