On this page you can find a series of potential topics for undergraduate qualification theses (Bachelor / Master) and research papers as well as opportunities for participation in research projects.
Please be reminded that these suggestions are intended as a selection and can be supplemented by your own suggestions. If you are interested and require further information, please contact the responsible staff member or arrange an appointment directly with Prof. Dr. Alexander Mehler.
In addition, we provide a mailing list for free, which we use to inform regularly about updates on new qualification and research work as well as other information relating to Texttechnology.
2025
Master Thesis: Can Adversarial Text Snippets Achieve Refusal Dimension Deletion?.
Description
The threat of abuse through determined adversaries makes safety of public-facing
LLMs a key priority for developers and researcher alike.
Despite intensive efforts, recent research shows that "refusal in language models [may be] mediated by a [one-dimensional subspace in the model's weights]" (Arditi et al., 2024) and that it is possible to create text-snippets that circumvent harmful response prevention in open- and closed-source LLMs using adversarial algorithms (Zou et al., 2023). This beckons the question, whether these two methods of "jailbreaking" LLMs align; i.e. whether adversarially generated text segments can shift a model's hidden states into a position that effectively approach refusal dimension deletion.
Related Work
Corresponding Lab Member:
Despite intensive efforts, recent research shows that "refusal in language models [may be] mediated by a [one-dimensional subspace in the model's weights]" (Arditi et al., 2024) and that it is possible to create text-snippets that circumvent harmful response prevention in open- and closed-source LLMs using adversarial algorithms (Zou et al., 2023). This beckons the question, whether these two methods of "jailbreaking" LLMs align; i.e. whether adversarially generated text segments can shift a model's hidden states into a position that effectively approach refusal dimension deletion.
Related Work
- Arditi et al., 2024, Refusal in Language Models Is Mediated by a Single Direction
- Mazeika et al., 2024, HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
- Zou et al., 2023, Universal and Transferable Adversarial Attacks on Aligned Language Models
- Chao et al., 2023, Jailbreaking Black Box Large Language Models in Twenty Queries
Corresponding Lab Member:
Master Thesis: Enhancing Audio Transcription with Visual Cues: A Multimodal Approach
Utilizing Lip Movements and Facial Expressions for German Language
Applications.
Description
Accurate audio transcription remains a challenge in environments with background noise, low-quality recordings, or overlapping speech. While significant progress has been made using audio-only approaches powered by deep learning and automatic speech recognition (ASR) systems (Graves et al., 2013), such methods often fail in adverse acoustic conditions. This thesis proposes the design and implementation of a multimodal transcription tool that integrates visual information, such as lip movements and facial expressions, to improve transcription accuracy, with a focus on adapting this approach to the German language. The proposed tool leverages the correlation between spoken words and their associated visual signals, such as lip shape dynamics and facial expressions, to improve the decoding of ambiguous or misinterpreted audio signals (Chung et al., 2017). It combines deep learning-based audio and video models to refine transcription results (Afouras et al., 2018). Existing datasets will form the basis for training and testing. For English, datasets such as LRS3 (Afouras et al., 2018) will be used. For German, the GLips (German Lips) dataset (Zöllner et al., 2022) provides extensive video data suitable for word-level lip-reading research. An important sub-task is the fine-tuning of existing pre-trained models for German-specific linguistic and phonetic features. This requires transfer learning techniques to adapt models trained on English datasets to German phoneme distributions, articulatory patterns, and grammatical structures. Experimental evaluation will measure transcription accuracy in both English and German, especially under noisy conditions, to quantify the advantages of the multimodal approach. This work aims to advance multilingual ASR systems by demonstrating the benefits of integrating audiovisual data for transcription.The results will demonstrate the effectiveness of combining existing datasets and adapting pre-trained models to improve transcription accuracy in real-world scenarios.
Graves et al., 2013 Chung et al., 2017 Afouras et al., 2018 Zöllner et al., 2022
Corresponding Lab Member:
Corresponding Lab Member:
Master Thesis: Unlocking Wikipedia for Research: A Modular Toolkit for Structured
NLP Applications.
Description
Wikipedia serves as a vast and diverse resource that is widely used in research domains to address a variety of tasks and questions. However, its size, semi-structured form, inconsistent formatting, and noisy elements (e.g., infoboxes) pose significant challenges to its accessibility and usability in structured research applications. This thesis aims to develop a comprehensive framework to overcome these challenges and enable researchers to effectively use Wikipedia's content for NLP and other structured research purposes. The proposed work focuses on the design of a modular, database-driven toolkit that supports the local use of Wikipedia for NLP processing. Key objectives include exploring existing tools and databases, integrating Wikidata, and leveraging different database solutions to address different use cases. Specific tasks include selecting and evaluating databases, designing database schemas, processing Wikipedia dump files as source data, and implementing robust mechanisms for data extraction, parsing (e.g., Wikitext), and updating. Additional challenges such as constructing category and social graphs, managing interlanguage links, handling revisions, and integrating DUUI (Docker Unified UIMA Interface) will also be addressed. The goal of this thesis is to provide a practical toolkit for researchers that facilitates the effective and flexible use of Wikipedia's content for a wide range of applications. See also:
Corresponding Lab Member:
- WikiDragon: A Java Framework For Diachronic Content And Network Analysis Of MediaWikis
- mwparserfromhtml
- WikiExtractor
- Wikidata Query Service/User Manual
Corresponding Lab Member:
Bachelor Thesis: Development of an HTML Parser for Efficient Extraction of Search Engine Results.
Description
The exponential growth of online information has made search engines indispensable tools for accessing relevant data. Search engines such as Google, Bing, and Yandex generate results that serve a variety of needs, from academic research to commercial applications. However, accessing and analyzing these results often requires parsing the underlying HTML code of the search results pages. This thesis investigates the design and implementation of an HTML parser capable of extracting, structuring, and analyzing search engine results in a reliable and efficient manner. The goal of this project is to develop a robust HTML parser tailored for extracting search results from multiple search engines, while addressing challenges such as dynamic content loading, anti-scraping measures, and variations in HTML structures. The parser will identify key elements such as titles, URLs, snippets, and metadata, standardize the extracted data into a consistent format, and output it for further analysis or integration with other systems. The implementation involves a combination of web scraping libraries, regular expressions, and advanced parsing techniques, with an emphasis on handling dynamic web content rendered through JavaScript. The project also addresses ethical and legal considerations related to web scraping, and proposes mechanisms for compliance with search engine terms of service and applicable data usage regulations. The developed parser will be evaluated based on its accuracy, speed, and adaptability to changes in search engine HTML structures. Performance benchmarks and use cases, such as competitive analysis and data aggregation, will be presented to demonstrate the utility and versatility of the system. The outcome of this thesis aims to contribute to the fields of data mining and web technologies by providing a fundamental tool for generically accessing and leveraging search engine data.
Corresponding Lab Member:
Corresponding Lab Member:
Bachelor Thesis: Multimodal data integration and processing in DUUI.
Description
The Docker Unified UIMA Interface (DUUI) is a tool designed for the automated analysis of large corpora using a variety of NLP tools. Currently, DUUI supports the processing of text, audio, and video data. To extend its capabilities, additional support for multimodal data, such as that provided by Va.Si.Li-Lab – which includes motion data, object interaction data, and more – should be integrated into DUUI. All integrated data will need to be linked through a new type system tailored to each modality. Furthermore, processes such as motion detection must be incorporated to effectively process and analyze these new data types within DUUI. Bachelor's and Master's theses are invited to explore this multimodal model extension and integration. References:
Corresponding Lab Member:
- Unlocking the Heterogeneous Landscape of Big Data NLP with DUUI
- Towards grounding multimodal semantics in interaction data with Va.Si.Li-Lab
- A Multimodal Data Model for Simulation-Based Learning with Va.Si.Li-Lab
- Va.Si.Li-Lab as a Collaborative Multi-User Annotation Tool in Virtual Reality and Its Potential Fields of Application
Corresponding Lab Member:
Bachelor Thesis: Briding the Gap Between Virtual Environments and Reality.
Description
Virtual Reality (VR) enables immersive user experiences by providing highly realistic environments and interactions, particularly with advances in hand, eye, and face tracking. These technologies enhance engagement and facilitate more natural communication, effectively reducing the perceived physical distance between users. However, most virtual meeting environments remain entirely synthetic, disconnected from the physical spaces of users. Despite ongoing improvements in realistic digital avatars (e.g., MetaHumans), the creation and accessibility of authentic virtual environments remain limited. To address this, we propose a novel approach using real-time photogrammetry to reconstruct physical spaces in VR accurately. This method enables users to virtually visit each other's physical environments, seamlessly blending virtual and real spaces, thereby narrowing the gap between digital and physical interactions. Bachelor's and Master's theses are invited to experiment with and evaluate these emerging technologies. See also:
Corresponding Lab Member:
- NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video
- NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
- MetaHuman
- A Multimodal Data Model for Simulation-Based Learning with Va.Si.Li-Lab
- Va.Si.Li-Lab
Corresponding Lab Member:
Bachelor Thesis: Affiliation of Speech and Gesture through LLMs.
Description
Most "referential" gestures have a docking point in accompanying speech, known as the lexical affiliate. This bachelor’s thesis leverages this empirical fact to utilize large language models (LLMs) for gesture annotation. Each occurrence of a referential gesture in a multimodal dataset is presented to an LLM, which is tasked with identifying the corresponding affiliate expression in speech. Through this process, a gesture interpretation is derived. Additionally, the approach aims to detect gestures that lack an overt affiliate. Building on the strong performance of LLMs in handling bridging relations, the thesis proposes a frame-based interpretation for such gestures. This work makes a central topic of multimodal communication accessible to modern computational techniques, provides quantitative insights into speech-gesture affiliation, and lays the foundation for further gesture classifications.
Corresponding Lab Member:
Corresponding Lab Member:
Master Thesis: Aristotelian Modification of Nominals.
Description
The standard semantics of noun-modifying adjectives is typically explained in terms of set membership in one way or another. Modern theories often incorporate scales, particularly for measure adjectives. This master's thesis will generalize such approaches by employing more general property spaces, which can be conceptualized as accidental qualities, a notion derived from Aristotle’s linguistic work. The accidental qualities of nominals will be determined by clustering adjectives from large corpora, thereby enriching lexical entries. This thesis complements computational linguistic research on the generative lexicon, has relevance for multimodal speech-gesture integration, and offers a novel perspective on the metaphoric use of adjectives.
Corresponding Lab Member:
Corresponding Lab Member:
Bachelor Thesis: A comparative study of methodologies that are used to identifying
human vs automatic generated text.
Description
With the advent of large language models such as ChatGPT, growing ethical concerns have emerged, highlighting the need for approaches to address automatic text recognition models. These models are becoming increasingly popular but remain underexplored and not well established. A study is needed to provide an overview of existing work in this area and evaluate its usefulness. Bachelor's and Master's theses are invited to explore this field through a comparative approach by reimplementing and testing a range of established methods. References:
Corresponding Lab Member:
Corresponding Lab Member:
Bachelor Thesis: How does Language Bias Affect Pretrained Language Models?.
Description
Does language bias exist in pretrained large language models, such as those trained using a masked language modeling objective? What are the core components of these models that tend to produce this bias? Language bias refers to the tendency of multilingual models to prefer answering or selecting responses (e.g., in question-answering or information retrieval tasks) in the same language as the query, even when more likely candidate answers are available in other languages. What are the primary causes of this behavior? Are they linguistic, embedded in the training objective, or influenced by the loss function? These questions remain unresolved. Bachelor's and Master's theses are invited to explore these or related questions. References:
Corresponding Lab Member:
Corresponding Lab Member:
Bachelor Thesis: Exploring Pretrained Retrievers and Embedding-Based Search for
Accurate Book Metadata Retrieval in RAG Pipelines.
Description
Retrieving accurate book metadata is essential for enhancing the performance of Retrieval-Augmented Generation (RAG) pipelines. This project explores modern, non-heuristic approaches to metadata retrieval, focusing on the use of pretrained retrievers and embedding-based similarity search. Instead of relying on manually crafted heuristics, these methods leverage embeddings generated by state-of-the-art models to identify the most relevant metadata and associated texts. The experiment will utilize large indexed corpora, such as Wikipedia and online library databases, to evaluate the efficacy of pretrained retrievers and embedding similarity for matching input metadata with incomplete or ambiguous information. The project will involve indexing metadata and textual content from publicly available sources (e.g., Open Library, Google Books, Wikipedia) using vector-based search frameworks. Pretrained models, such as dense retrievers (e.g., DPR, SentenceTransformers), will be used to generate embeddings for both input metadata and indexed corpora. The results will be compared to traditional heuristic-based methods to evaluate retrieval accuracy, scalability, and adaptability to incomplete metadata scenarios. This research addresses a significant bottleneck in RAG pipelines, where retrieval systems must efficiently integrate external knowledge to improve language model performance in answering specific queries. While this study focuses on bibliographic data, the proposed methods are generalizable and applicable to other domains requiring accurate and scalable metadata retrieval. The outcomes will provide insights into the trade-offs between heuristic and non-heuristic approaches and contribute to advancing metadata retrieval techniques for knowledge-intensive NLP tasks. References:
Corresponding Lab Member:
- Dense Passage Retrieval for Open-Domain Question Answering
- Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks
Corresponding Lab Member:
Bachelor Thesis: Developing a Heuristic for Retrieving Specific Book Metadata in
Retrieval-Augmented Generation (RAG) Pipelines.
Description
Accurate retrieval of book metadata is a critical challenge in the development of Retrieval-Augmented Generation (RAG) pipelines. This project aims to develop a heuristic-based procedure for retrieving the most valid metadata - and potentially the text - of books from various online library databases using publicly available APIs. These databases contain large collections of book records, often with incomplete or inconsistent metadata. This makes querying and matching a specific publication a complex task, especially when dealing with incomplete input metadata. The procedure will address cases where multiple books share similar metadata, such as the same title and author, but belong to different editions or publications. The proposed heuristic will analyze and rank the results of API queries to identify the best match for the input data. The approach involves a detailed study of metadata patterns in online libraries and the development of robust matching criteria that account for variations and gaps in the data. This work contributes to an emerging area in natural language processing where RAG pipelines rely on external knowledge sources to augment large language models (LLMs) with domain-specific information. By addressing the challenge of metadata retrieval, this project will improve the accuracy and reliability of downstream tasks, such as answering questions about specific books. Although the focus of this work is on bibliographic data, the developed heuristic has the potential to be generalized for metadata retrieval in other domains. The outcome of this project will be a validated methodology that can be seamlessly integrated into RAG pipelines, representing a significant step forward in leveraging external databases for high quality contextual information retrieval. References:
Corresponding Lab Member:
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Heuristic Approaches to Metadata Matching in Digital Libraries
Corresponding Lab Member:
Bachelor/Master Thesis: Live VR Experiment Visualisation.
Description
Va.Si.Li-Lab is a virtual reality-based system designed for tracking and analyzing interpersonal communication by integrating extensive tracking capabilities, such as hand, face, and eye movements, alongside audio data. It enables controlled multi-user scenarios, allowing researchers to assign roles, impose modality-specific restrictions, and analyze communication behavior through aligned multimodal data stored in a central database. It can be problematic both to track the progress of an experiment and to process the data in a meaningful way afterwards. This thesis is about the meaningful processing of the tracked data, both live and afterwards.
See also:
See also:
Bachelor/Master Thesis: Multimodal VR Data Meets DUUI.
Description
The processing of large and extensive unstructured corpora is a constant challenge for various scientific disciplines. For this purpose, the Docker Unified UIMA Interface (DUUI) was developed, which provides NLP analysis methods based on container services to perform horizontally and vertically distributed big data analyses in a unified, standardized and reusable and schema-based process. The first steps towards multimodality have also already been taken. The task of this thesis is to adapt DUUI processing so that it can also be used to process multimodal data collected through VR experiments. The main difficulty lies in the alignment of speech, transcription and movements.
See also:
See also:
Master Thesis: Natural Human interactions with LLM’s per Audio.
Description
Natural conversations between people are standard, and this is also possible with large language models (LLMs). Human speech can be converted to text, which can then be used as input for the LLM. The output of the LLM is then converted back to audio. However, due to latency and the nature of audio output, it is still a major challenge to integrate a chatbot that can communicate naturally in both text and audio without human interlocutors noticing this latency, especially in multilingual environments. Therefore, Bachelor's or Master's are invited that address these latency issues. See also:
Corresponding Lab Member:
Corresponding Lab Member:
Bachelor Thesis: The emperor's new clothes alias TextAnnotator's new and responsive interface.
Description
TextAnnotator, a web-based tool for platform-independent, simultaneous, and collaborative semi-automated annotation of unstructured corpora based on UIMA, is a flexible and feature-rich solution for annotating various linguistic and semantic features using multiple annotation views and tools. However, TextAnnotator is currently implemented at the visual interface level using an older version of ExtJS, which needs to be upgraded to a modern interface. This upgrade is necessary in the short to medium term to enable the creation and implementation of more modular and interchangeable components. Bachelor's or Master's theses are invited that aim to develop and test new interfaces to enhance TextAnnotator's versatility and attractiveness by leveraging modern web interface technologies. See also:
Corresponding Lab Member:
- TextAnnotator: A UIMA Based Tool for the Simultaneous and Collaborative Annotation of Texts
- Unleashing annotations with TextAnnotator: Multimedia, multi-perspective document views for ubiquitous annotation
- TextAnnotator: A flexible framework for semantic annotations
Corresponding Lab Member:
Bachelor Thesis: Diversification of the container landscape for DUUI.
Description
The processing of large and extensive unstructured corpora remains a significant challenge for various scientific disciplines. To address this, the Docker Unified UIMA Interface (DUUI) was developed. DUUI provides NLP methods through container services to perform horizontally and vertically distributed big data analysis in a unified, standardized, reusable, and schema-based process. In the medium to long term, DUUI can leverage a variety of container services to implement optimal processing solutions tailored to specific scenarios and environmental parameters. This involves the creation, implementation, and evaluation of container services for DUUI that have not yet been integrated. Bachelor's or Master's theses are invited to address this task of services integration. See also:
Corresponding Lab Member:
- Unlocking the Heterogeneous Landscape of Big Data NLP with DUUI
- Efficient, uniform and scalable parallel NLP pre-processing with DUUI: Perspectives and Best Practice for the Digital Humanities
Corresponding Lab Member:
Bachelor Thesis: Retrieval-Augmented Generation (RAG): Synthesizing Knowledge from Large Corpora.
Description
The increase of textual data in scientific and other domains has created an urgent need for tools that can efficiently retrieve accurate information from large corpora. Can large language models help researchers identify critical information - metaphorically, "needles in a haystack"? This research explores Retrieval-Augmented Generation (RAG) as a framework for proposing pipelines and models capable of locating specific units of information in response to user queries. Crucially, this approach avoids the need for explicit fine-tuning of large language models on domain-specific data. Instead, it emphasizes techniques such as prompt engineering, advanced data retrieval mechanisms, and innovative query formulation. Possible methodologies include the use of embedding spaces, graph databases, or hybrid architectures to improve retrieval accuracy and synthesis capabilities. Bachelor's or Master's theses are invited to contribute novel solutions to this interdisciplinary challenge. See also: OPEN SCHOLAR: SYNTHESIZING SCIENTIFIC LITERATURE WITH RETRIEVAL-AUGMENTED LMS; CCC-BERT | Kaggle
Corresponding Lab Member:
Corresponding Lab Member:
Bachelor Thesis: Bringing Order to Chaos: Structuring Unstructured Documents.
Description
The increasing volume and diversity of text corpora generated daily from various sources poses significant challenges for their processing and analysis. Generic tools that facilitate the exploration and understanding of such corpora in a standardized and intuitive manner are rare. A key issue is the transformation of unstructured plain text into generically structured formats that allow efficient reading, sorting, and searching. This task aims to develop models or algorithms that can process raw, unstructured text and produce structured outputs based on predefined rules, algorithms, or models. These outputs should be compatible with the Docker Unified UIMA Interface (DUUI), our general purpose corpus annotation tool. The structured format must also comply with the UIMA (Unstructured Information Management Architecture) standard. Bachelor's or Master's theses are invited to explore this topic and contribute to the broader goal of making diverse textual corpora more accessible and manageable. See also: Unlocking the Heterogeneous Landscape of Big Data NLP with DUUI - ACL Anthology
Corresponding Lab Member:
Corresponding Lab Member: