The following master/bachelor thesis topics are currently offered by the Text Technology Lab. You are also welcome to bring in your own topics and ideas. If you are interested in one of them, please contact Alexander Mehler.
Multimodal word embeddings (B.Sc., M.Sc.)
Usually, word embeddings
are obtained from co-occurrence pattern within texts. Using annotated image databases, these embeddings can be enriched by embeddings on image features, giving rise to so-called multimodal embeddings
. Within this broad topic, the candidate develops multimodal embeddings and compares them to unimodal ones on various tasks from computational linguistics such as word sense disambiguation. Parameters to vary include the choice of image database, the segmentation of image annotations, image feature selection, the integration of verbal and visual embeddings, and evaluation scenario.
Inference Engine on Spatial and Temporal Relations (M.Sc.)
In order to infer new statements from a body of given statements a so-called inference or reasoning engine (e.g.,
is required. The candidate plans and implements an inference engine on temporal and spatial statements such as “is included in” and “earlier then”. A main challenge of this thesis is to account for transitive inference pattern of temporal
Authorship-related Word Embeddings (B.Sc., M.Sc.)
Word embeddings are usually computed on large corpora. The larger and more diverse the corpora, the more diverse the aspects of the respective language that can be represented in the embedding space. In this thesis, embedding spaces for different authors will be calculated and analyzed separately. For this purpose, the candidate will pre-train word embeddings on large corpora and then specialize them on smaller corpora. Finally, the embedding spaces created for the different authors have to be evaluated and analyzed in the context of automatic authorship recognition.
Topic Models (B.Sc., M.Sc.)
There are various methods for classifying texts according to topic. These include established models such as topic models based on Latent Dirichlet Allocation (LDA), but also newer methods such as text2ddc. In this thesis, the differences and similarities of these and related topic models will be analyzed and evaluated comparatively. The candidate will focus on combinations of these topic models to increase their overall performance.
GeoNames-based Modeling and Recognition of Toponyms (B.Sc., M.Sc.)
The recognition of places and geographical units in texts is implemented to a certain extent by existing taggers for named entities. These taggers usually only determine whether a textual expression denotes a place (and is therefore a toponym) or not. This recognition is to be extended by using GeoNames as a data source for place names and their classes. GeoNames
is a free data source that contains an ontology for places and geographical areas. The task is to develop a UIMA-based annotator for the recognition and annotation of places and geographical units and to test it with the texts from the BIOfid project
. The UIMA-annotator has to be implemented as a pipeline for TextImager
. Furthermore, an extension for the TextAnnotator
for GeoNames has to be developed to correct existing or create new annotations. Finally, the system has to be evaluated using the BIOfid dataset.
Active Learning for TextAnnotator (M.Sc.)
Active learning (AL) serves as a method of supervised learning to increase the accuracy of classifiers to be trained. Through AL, machine learning gains influence on the data with which it learns by asking human experts about the results for selected data items. In this way, a higher performance should be achieved, especially for data that is difficult to classify. The goal of this thesis is to extend TextAnnotator
with an AL component and to evaluate it with selected annotation tasks. Ideally, the work is based on ensemble methods such as CRFVoter or LSTMVoter. Thus, an ensemble learner is to be developed that integrates individual classifiers for annotation tasks and performs updates: the AL generates ever new training and test examples, whereby both the classifiers included in the ensemble and the ensemble learner are retrained. The planned architecture should be programmed generically so that it can be related to different annotation tasks.
Wikidition Meets TextImager: Interfacing Big NLP Data (M.Sc.)
is a framework for mapping the landscape of Natural Language Processing (NLP) tools and making them available also to non-experts. It integrates tools, which are based on proprietary IO formats, for a range of languages and has a type system that makes these tools interoperable. In this way, the tools are organized in pipelines so that they benefit from each other. TextImager is a scalable distributed system
that can handle big data
. What is missing is a corresponding visualization component. Wikidition
, a MediaWiki-based system for representing corpora as wikis, is suitable for this visualization. The task of the thesis is to improve TextImager as a platform for processing big linguistic data and to extend the visualization functionality of Wikidition. The following subtasks are involved: the database model of TextImager is to be optimized, Wikidition is to be integrated as an interface for the visualization of big linguistic data and the overall system is to be documented. Required prior knowledge: Java, NoSQL, MediaWiki, interest in software engineering and software architectures.
Modeling Semantic Roles for Verb Sense Disambiguation (M.Sc.)
Verbs are used to describe states, events or processes; they form the syntactic and semantic core of sentences, whereby knowledge of their meanings is central to sentence and text comprehension. According to the Duden dictionary, verbs in German have on average more than two senses, so that their disambiguation is indispensable for making statements about the meaning of sentences and larger units such as texts. The TTLab has already done preliminary work on verb-sense disambiguation (VSD) for German. Among other things, it has created the largest VSD corpus for German to date. Each verb in this corpus is assigned a meaning representation. The verb meanings depend strongly on the arguments with which the verbs co-occur. The occurrences of the verbs can be assigned to the respective senses by means of their theta grids, i.e. via the lists of semantic roles associated with them (agent, patient, instrument, etc.). The aim of this thesis is to extend the TTLab VSD corpus by learning theta grids of verbs, so that more information is available for deep learning, which aims at VSD. Required prior knowledge: German as a native language, machine learning.
Sample Bachelor Theses
Other theses resources