New publication accepted at IEEE ICNLP 2026

We are pleased to inform you about the acceptance of a new paper at IEEE’s 2026 8th International Conference on Natural Language Processing (ICNLP) entitled:

Learning to Detect Cross-Modal Negation: An Analysis of Latent Representations and an Attention-Based Solution

Ali AbuSaleh, Leon Hammerla and Alexander Mehler. 2026. Learning to Detect Cross-Modal Negation: An Analysis of Latent Representations and an Attention-Based Solution. 2026 8th International Conference on Natural Language Processing (ICNLP), 613–622.

BibTeX

@inproceedings{Abusaleh:et:al:2026,
  author    = {AbuSaleh, Ali and Hammerla, Leon and Mehler, Alexander},
  booktitle = {2026 8th International Conference on Natural Language Processing (ICNLP)},
  title     = {Learning to Detect Cross-Modal Negation: An Analysis of Latent
               Representations and an Attention-Based Solution},
  year      = {2026},
  volume    = {},
  number    = {},
  pages     = {613-622},
  keywords  = {Modeling;Videos;Labeling;Visualization;Signal detection;Large language models;Head;Media;Accuracy;Annotations;Vision language model;Natural language processing;Cross-modal retrieval;negation detection;video analysis;Multimodal analysis;Political Communication, neglab, new-data-spaces, circlet},
  doi       = {10.1109/ICNLP69856.2026.11527861},
  abstract  = {Detecting high-level semantic concepts like negation across modalities
               remains a challenge for current multimodal systems. We analyze
               this as a fundamental representation learning problem, providing
               the first evidence that negation does not form a linearly or non-linearly
               separable class in the latent spaces of standard vision-language
               models (VLMs). We demonstrate that pretrained embeddings primarily
               encode modality-specific features, lacking a generalizable negation
               signal. To overcome this, we propose a novel cross-modal attention
               architecture that explicitly models inter-modal dependencies,
               achieving performance gains of up to +7.03% F1 over unimodal baselines.
               Our analysis reveals a key asymmetry: while textual negation often
               appears independently, visual negation is semantically dependent
               on linguistic context, a finding validated through our statistical
               analysis of 3,222 political video-text pairs automatically annotated
               via Qwen2.5-VL. By combining this analysis with self-supervised
               video representations (JEPA2), we advance the modeling of temporal
               negation. This work provides new methods and insights for learning
               robust, semantically-aligned representations in multimodal systems.}
}