
Research assistant
Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
Room 401e
D-60325 Frankfurt am Main
D-60629 Frankfurt am Main (use for package delivery)
Postfach / P.O. Box: 154
Phone:
Mail:
Thesis topic proposals
2025
Master Thesis: Streaming Multimodal Understanding with Incremental Prediction
and Uncertainty Modeling.
Description
Multimodal signals, such as video and audio, are ordered chronologically. Processing these signals requires systems capable of performing incremental predictions and dynamic uncertainty assessments. This master thesis is about implementing and testing a framework that tracks events or states in real time, predicts them, interprets intermodal interactions, and provides confidence estimates. The proposed topic focuses on multimodal stream learning, wherein models update their internal representations continuously as new data arrives without access to the complete sequence. Unlike offline approaches, streaming systems process asynchronous, context-dependent signals where the same signals can have different meanings depending on accompanying visual or acoustic cues. For instance, the greeting "hello" (audio) accompanied by a smile (image) may imply friendliness, whereas the same greeting accompanied by a clenched fist (image) may imply a threat. This project has many potential applications, one of which is emotion recognition in video streams. Tasks include specifying a streaming-compatible application, acquiring relevant literature, implementing a basic system, and testing extensions such as online cross-modal fusion, temporal attention, and uncertainty-aware prediction layers.
Corresponding Lab Member:
Corresponding Lab Member:
Bachelor Thesis: Multimodal Sentiment Analysis via Cross-Attention Fusion in Latent Space.
Description
Accurately estimating sentiments in video data is a complex challenge that requires integrating visual, audio, and textual cues. This proposal involves developing a multimodal sentiment analysis pipeline that uses state-of-the-art models to extract embeddings from different modalities: Qwen2.5 for text, JEPA 2 for visuals, and WhisperX (pyannote) for audio. The embeddings then undergo projection into a shared latent space and fusion via cross-attention mechanisms to capture intermodal dependencies. A probabilistic output layer will then estimate sentiment distributions over fine-grained emotions. To accelerate the analysis process for real-time applications, the pipeline should use a stream-like vector projection method that updates latent space representations incrementally instead of reprocessing entire sequences. The pipeline will be built using DUUI to ensure modularity, scalability, and reproducibility. The goal of this thesis is to tackle limitations of unimodal sentiment analysis and traditional fusion methods to achieve efficient, accurate, and scalable multimodal sentiment analysis.
Corresponding Lab Member:
Corresponding Lab Member:
If you have any suggestions of your own relating to this or our other proposed topics, please do not hesitate to contact us.
In addition, we provide a mailing list for free, which we use to inform regularly about updates on new qualification and research work as well as other information relating to Texttechnology.
Publications
2026
2026.
TTLab at AraSentEval: SARF (صرف) Sentiment Analysis via Root-based
Fusion for Multi-Dialectal Arabic. Proceedings of the 7th Workshop on Open-Source Arabic Corpora
and Processing Tools (OSACT7), co-located with the Language Resources
and Evaluation Conference (LREC 2026).
accepted.
BibTeX
@inproceedings{Abusaleh:et:al:2026:sarf,
title = {TTLab at AraSentEval: SARF (صرف) Sentiment Analysis via Root-based
Fusion for Multi-Dialectal Arabic},
author = {Abusaleh, Ali and Verma, Bhuvanesh and Mehler, Alexander},
booktitle = {Proceedings of the 7th Workshop on Open-Source Arabic Corpora
and Processing Tools (OSACT7), co-located with the Language Resources
and Evaluation Conference (LREC 2026)},
eventdate = {May, 2026},
location = {Palma, Mallorca, Spain},
year = {2026},
keywords = {NLP, Sentiment Analysis, Arabic analysis, new-data-spaces, circlet, satek},
abstract = {Arabic sentiment analysis is challenged by morphological complexity
and lexical variation across Arabic dialects, compounded by subjectivity
in how speakers and writers express sentiment. In this paper,
we present our submission for the AraSentEval 2026 Shared Task
on Arabic Dialect Sentiment Analysis. We propose SARF (صرف) a
multi-view architectural framework that integrates surface-level
context with stemmed and rooted morphological perspectives using
a shared MARBERTv2 encoder. Our system employs a hybrid BERT-CNN-BiLSTM-Attention
architecture to capture both local sentiment n-grams and global
sequential dependencies. Experimental results show that while
individual morphological normalization strategies (stemming or
rooting) may degrade performance, their joint integration via
cross-morphological attention provides robust features across
diverse dialects. Our final system achieved a competitive macro-F1-score
of 0.9263, ranking 2nd out of 15 participating teams.},
note = {accepted}
}
2026.
Learning to Detect Cross-Modal Negation: An Analysis of Latent
Representations and an Attention-Based Solution. 2026 8th International Conference on Natural Language Processing (ICNLP).
accepted.
BibTeX
@inproceedings{Abusaleh:et:al:2026,
title = {Learning to Detect Cross-Modal Negation: An Analysis of Latent
Representations and an Attention-Based Solution},
author = {Abusaleh, Ali and Hammerla, Leon and Mehler, Alexander},
booktitle = {2026 8th International Conference on Natural Language Processing (ICNLP)},
eventdate = {2026-03-20/2026-03-22},
location = {Xi'an,China},
year = {2026},
keywords = {Vision language model, Natural language processing, Cross-modal retrieval, negation detection, video analysis, Multimodal analysis, Political Communication, neglab, new-data-spaces, circlet},
abstract = {Detecting high-level semantic concepts like negation across modalities
remains a challenge for current multimodal systems. We analyze
this as a fundamental representation learning problem, providing
the first evidence that negation does not form a linearly or non-linearly
separable class in the latent spaces of standard vision-language
models (VLMs). We demonstrate that pretrained embeddings primarily
encode modality-specific features, lacking a generalizable negation
signal. To overcome this, we propose a novel cross-modal attention
architecture that explicitly models inter-modal dependencies,
achieving performance gains of up to +7.03% F1 over unimodal baselines.
Our analysis reveals a key asymmetry: while textual negation often
appears independently, visual negation is semantically dependent
on linguistic context, a finding validated through our statistical
analysis of 3,222 political video-text pairs automatically annotated
via Qwen2.5-VL. By combining this analysis with self-supervised
video representations (JEPA2), we advance the modeling of temporal
negation. This work provides new methods and insights for learning
robust, semantically-aligned representations in multimodal systems.},
note = {accepted}
}
2024
2024.
A Multitask VAE for Time Series Preprocessing and Prediction of
Blood Glucose Level.
BibTeX
@misc{Abusaleh:Rahim:2024,
title = {A Multitask VAE for Time Series Preprocessing and Prediction of
Blood Glucose Level},
author = {Ali Abusaleh and Mehdi Rahim},
year = {2024},
eprint = {2410.00015},
archiveprefix = {arXiv},
primaryclass = {eess.SP},
url = {https://arxiv.org/abs/2410.00015}
}
