SATEK

This DFG-funded project (project number: 531750631) develops new methods for the thematic classification of very large text corpora in the digital humanities and social sciences. Focusing on the German Reference Corpus (DeReKo), the world’s largest collection of German-language texts, it addresses the lack of reliable topic-based metadata for heterogeneous and rapidly growing corpora. By combining approaches from computer science and corpus linguistics, the project creates efficient, open-source, and dynamic classification methods that support advanced corpus analysis, stratified sampling, and the study of linguistic variation. The methods are also tested on domain-specific resources, such as Grammis, to enable fine-grained thematic indexing of specialized texts.


Project Locations

The project is carried out at three research institutions: