The following master/bachelor thesis topics are currently offered by the Text Technology Lab. You are also welcome to bring in your own topics and ideas. If you are interested in one of them, please contact Alexander Mehler.
Situation-dependent clothing modeling of 3D avatars. (B.Sc., M.Sc.)
Description
If you want to automatically generate a scene with people, you have to pay special attention to their clothing. This is because clothing is not only a means to an end, but is influenced by many different factors:For example, clothing can reflect the person’s social status, but it can also be culturally conditioned. The clothing can be both day and lifetime dependent and from the locations visited. In this work exactly this topic is to be dealt with. To do this, build on the work of Mr. Dächert, a VR realized, UMA-based avatareditor for the VAnnotatoR[1]. As a bachelor thesis, you are to derive on which (preferably orthogonal) constraints the choice of dress depends and generate a corresponding resource for the UMA models. This should then allow realistic dressed avatars to be generated. As a master thesis, this will be supplemented by the fact that these conditions are to be automatically recognized and processed from text descriptions. UMA: https://github.com/umasteeringgroup/UMA
Multimodal word embeddings (B.Sc., M.Sc.)
Description
Usually, word embeddings are obtained from co-occurrence pattern within texts. Using annotated image databases, these embeddings can be enriched by embeddings on image features, giving rise to so-called multimodal embeddings . Within this broad topic, the candidate develops multimodal embeddings and compares them to unimodal ones on various tasks from computational linguistics such as word sense disambiguation. Parameters to vary include the choice of image database, the segmentation of image annotations, image feature selection, the integration of verbal and visual embeddings, and evaluation scenario.
Inference Engine on Spatial and Temporal Relations (M.Sc.)
Description
In order to infer new statements from a body of given statements a so-called inference or reasoning engine (e.g., is required. The candidate plans and implements an inference engine on temporal and spatial statements such as “is included in” and “earlier then”. A main challenge of this thesis is to account for transitive inference pattern of temporal and spatial relations.
Authorship-related Word Embeddings (B.Sc., M.Sc.)
Description
Word embeddings are usually computed on large corpora. The larger and more diverse the corpora, the more diverse the aspects of the respective language that can be represented in the embedding space. In this thesis, embedding spaces for different authors will be calculated and analyzed separately. For this purpose, the candidate will pre-train word embeddings on large corpora and then specialize them on smaller corpora. Finally, the embedding spaces created for the different authors have to be evaluated and analyzed in the context of automatic authorship recognition.
Topic Models (B.Sc., M.Sc.)
Description
There are various methods for classifying texts according to topic. These include established models such as topic models based on Latent Dirichlet Allocation (LDA), but also newer methods such as text2ddc. In this thesis, the differences and similarities of these and related topic models will be analyzed and evaluated comparatively. The candidate will focus on combinations of these topic models to increase their overall performance.
Active Learning for TextAnnotator (M.Sc.)
Description
Active learning (AL) serves as a method of supervised learning to increase the accuracy of classifiers to be trained. Through AL, machine learning gains influence on the data with which it learns by asking human experts about the results for selected data items. In this way, a higher performance should be achieved, especially for data that is difficult to classify. The goal of this thesis is to extend TextAnnotator with an AL component and to evaluate it with selected annotation tasks. Ideally, the work is based on ensemble methods such as CRFVoter or LSTMVoter. Thus, an ensemble learner is to be developed that integrates individual classifiers for annotation tasks and performs updates: the AL generates ever new training and test examples, whereby both the classifiers included in the ensemble and the ensemble learner are retrained. The planned architecture should be programmed generically so that it can be related to different annotation tasks.
Wikidition Meets TextImager: Interfacing Big NLP Data (M.Sc.)
Description
TextImager is a framework for mapping the landscape of Natural Language Processing (NLP) tools and making them available also to non-experts. It integrates tools, which are based on proprietary IO formats, for a range of languages and has a type system that makes these tools interoperable. In this way, the tools are organized in pipelines so that they benefit from each other. TextImager is a scalable distributed system that can handle big data . What is missing is a corresponding visualization component. Wikidition , a MediaWiki-based system for representing corpora as wikis, is suitable for this visualization. The task of the thesis is to improve TextImager as a platform for processing big linguistic data and to extend the visualization functionality of Wikidition. The following subtasks are involved: the database model of TextImager is to be optimized, Wikidition is to be integrated as an interface for the visualization of big linguistic data and the overall system is to be documented. Required prior knowledge: Java, NoSQL, MediaWiki, interest in software engineering and software architectures.
Modeling Semantic Roles for Verb Sense Disambiguation (M.Sc.)
Description
Verbs are used to describe states, events or processes; they form the syntactic and semantic core of sentences, whereby knowledge of their meanings is central to sentence and text comprehension. According to the Duden dictionary, verbs in German have on average more than two senses, so that their disambiguation is indispensable for making statements about the meaning of sentences and larger units such as texts. The TTLab has already done preliminary work on verb-sense disambiguation (VSD) for German. Among other things, it has created the largest VSD corpus for German to date. Each verb in this corpus is assigned a meaning representation. The verb meanings depend strongly on the arguments with which the verbs co-occur. The occurrences of the verbs can be assigned to the respective senses by means of their theta grids, i.e. via the lists of semantic roles associated with them (agent, patient, instrument, etc.). The aim of this thesis is to extend the TTLab VSD corpus by learning theta grids of verbs, so that more information is available for deep learning, which aims at VSD. Required prior knowledge: German as a native language, machine learning.
Sample Bachelor Theses
[1]
A. Mehler, G. Abrami, C. Spiekermann, and M. Jostock, “VAnnotatoR: A Framework for Generating Multimodal Hypertexts,” in Proceedings of the 29th ACM Conference on Hypertext and Social Media, New York, NY, USA, 2018.
[Bibtex]
![[pdf]](https://www.texttechnologylab.org/wp-content/plugins/papercite/img/pdf.png)
[Bibtex]
@InProceedings{Mehler:Abrami:Spiekermann:Jostock:2018,
author = {Mehler, Alexander and Abrami, Giuseppe and Spiekermann, Christian and Jostock, Matthias},
title = {{VAnnotatoR}: {A} Framework for Generating Multimodal Hypertexts},
booktitle = {Proceedings of the 29th ACM Conference on Hypertext and Social Media},
series = {Proceedings of the 29th ACM Conference on Hypertext and Social Media (HT '18)},
year = {2018},
location = {Baltimore, Maryland},
publisher = {ACM},
address = {New York, NY, USA},
pdf = {http://delivery.acm.org/10.1145/3210000/3209572/p150-mehler.pdf}
}