
Research assistant
Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
Room 401e
D-60325 Frankfurt am Main
D-60054 Frankfurt am Main (use for package delivery)
Postfach / P.O. Box: 154
Phone:
Mail:
Thesis topic proposals
2025
Master Thesis: Streaming Multimodal Understanding with Incremental Prediction
and Uncertainty Modeling.
Description
Multimodal signals, such as video and audio, are ordered chronologically. Processing these signals requires systems capable of performing incremental predictions and dynamic uncertainty assessments. This master thesis is about implementing and testing a framework that tracks events or states in real time, predicts them, interprets intermodal interactions, and provides confidence estimates. The proposed topic focuses on multimodal stream learning, wherein models update their internal representations continuously as new data arrives without access to the complete sequence. Unlike offline approaches, streaming systems process asynchronous, context-dependent signals where the same signals can have different meanings depending on accompanying visual or acoustic cues. For instance, the greeting "hello" (audio) accompanied by a smile (image) may imply friendliness, whereas the same greeting accompanied by a clenched fist (image) may imply a threat. This project has many potential applications, one of which is emotion recognition in video streams. Tasks include specifying a streaming-compatible application, acquiring relevant literature, implementing a basic system, and testing extensions such as online cross-modal fusion, temporal attention, and uncertainty-aware prediction layers.
Corresponding Lab Member:
Corresponding Lab Member:
Bachelor Thesis: Multimodal Sentiment Analysis via Cross-Attention Fusion in Latent Space.
Description
Accurately estimating sentiments in video data is a complex challenge that requires integrating visual, audio, and textual cues. This proposal involves developing a multimodal sentiment analysis pipeline that uses state-of-the-art models to extract embeddings from different modalities: Qwen2.5 for text, JEPA 2 for visuals, and WhisperX (pyannote) for audio. The embeddings then undergo projection into a shared latent space and fusion via cross-attention mechanisms to capture intermodal dependencies. A probabilistic output layer will then estimate sentiment distributions over fine-grained emotions. To accelerate the analysis process for real-time applications, the pipeline should use a stream-like vector projection method that updates latent space representations incrementally instead of reprocessing entire sequences. The pipeline will be built using DUUI to ensure modularity, scalability, and reproducibility. The goal of this thesis is to tackle limitations of unimodal sentiment analysis and traditional fusion methods to achieve efficient, accurate, and scalable multimodal sentiment analysis.
Corresponding Lab Member:
Corresponding Lab Member:
If you have any suggestions of your own relating to this or our other proposed topics, please do not hesitate to contact us.
In addition, we provide a mailing list for free, which we use to inform regularly about updates on new qualification and research work as well as other information relating to Texttechnology.
Publications
2024
2024.
A Multitask VAE for Time Series Preprocessing and Prediction of
Blood Glucose Level.
BibTeX
@misc{Abusaleh:Rahim:2024,
title = {A Multitask VAE for Time Series Preprocessing and Prediction of
Blood Glucose Level},
author = {Ali Abusaleh and Mehdi Rahim},
year = {2024},
eprint = {2410.00015},
archiveprefix = {arXiv},
primaryclass = {eess.SP},
url = {https://arxiv.org/abs/2410.00015}
}
