General

New publication accepted in ACL Findings 2025

Our paper, Filling the Temporal Void: Recovering Missing Publication Years in the Project Gutenberg Corpus Using LLMs, has been accepted to the Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025).

Omar Momen, Manuel Schaaf and Alexander Mehler. July, 2025. Filling the Temporal Void: Recovering Missing Publication Years in the Project Gutenberg Corpus Using LLMs. Findings of the Association for Computational Linguistics: ACL 2025, 17318–17334.
BibTeX
@inproceedings{Momen:Schaaf:Mehler:2025,
  title     = {Filling the Temporal Void: Recovering Missing Publication Years
               in the Project Gutenberg Corpus Using {LLM}s},
  author    = {Momen, Omar and Schaaf, Manuel and Mehler, Alexander},
  editor    = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
  month     = {jul},
  year      = {2025},
  address   = {Vienna, Austria},
  publisher = {Association for Computational Linguistics},
  url       = {https://aclanthology.org/2025.findings-acl.890/},
  pages     = {17318--17334},
  isbn      = {979-8-89176-256-5},
  abstract  = {Analysing texts spanning long periods of time is critical for
               researchers in historical linguistics and related disciplines.
               However, publicly available corpora suitable for such analyses
               are scarce. The Project Gutenberg (PG) corpus presents a significant
               yet underutilized opportunity in this context, due to the absence
               of accurate temporal metadata. We take advantage of language models
               and information retrieval to explore four sources of information
               {--} Open Web, Wikipedia, Open Library API, and PG books texts
               {--} to add missing temporal metadata to the PG corpus. Through
               20 experiments employing state-of-the-art Large Language Models
               (LLMs) and Retrieval-Augmented Generation (RAG) methods, we estimate
               the production years of all PG books. We curate an enriched metadata
               repository for the PG corpus and propose a refined version for
               it, which includes 53,774 books with a total of 3.8 billion tokens
               in 11 languages, produced between 1600 and 2000. This work provides
               a new resource for computational linguistics and humanities studies
               focusing on diachronic analyses. The final dataset and all experiments
               data are publicly available (https://github.com/OmarMomen14/pg-dates).},
  pdf       = {https://aclanthology.org/2025.findings-acl.890.pdf}
}

Best Demo Award at NAACL 2025

We are delighted that our paper “Towards Unified, Dynamic, and Annotation-based Visualizations and Exploration of Annotated Big Data Corpora with the Help of Unified Corpus Explorer” has been awarded the Best Demo Paper at this year’s annual conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025).

Kevin Bönisch, Giuseppe Abrami and Alexander Mehler. 2025. Towards Unified, Dynamic and Annotation-based Visualisations and Exploration of Annotated Big Data Corpora with the Help of Unified Corpus Explorer. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), 522–534. Best Demo Award.
BibTeX
@inproceedings{Boenisch:et:al:2025,
  title     = {Towards Unified, Dynamic and Annotation-based Visualisations and
               Exploration of Annotated Big Data Corpora with the Help of Unified
               Corpus Explorer},
  author    = {B{\"o}nisch, Kevin and Abrami, Giuseppe and Mehler, Alexander},
  editor    = {Dziri, Nouha and Ren, Sean (Xiang) and Diao, Shizhe},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas
               Chapter of the Association for Computational Linguistics: Human
               Language Technologies (System Demonstrations)},
  year      = {2025},
  address   = {Albuquerque, New Mexico},
  publisher = {Association for Computational Linguistics},
  url       = {https://aclanthology.org/2025.naacl-demo.42/},
  pages     = {522--534},
  isbn      = {979-8-89176-191-9},
  abstract  = {The annotation and exploration of large text corpora, both automatic
               and manual, presents significant challenges across multiple disciplines,
               including linguistics, digital humanities, biology, and legal
               science. These challenges are exacerbated by the heterogeneity
               of processing methods, which complicates corpus visualization,
               interaction, and integration. To address these issues, we introduce
               the Unified Corpus Explorer (UCE), a standardized, dockerized,
               open-source and dynamic Natural Language Processing (NLP) application
               designed for flexible and scalable corpus navigation. Herein,
               UCE utilizes the UIMA format for NLP annotations as a standardized
               input, constructing interfaces and features around those annotations
               while dynamically adapting to the corpora and their extracted
               annotations. We evaluate UCE based on a user study and demonstrate
               its versatility as a corpus explorer based on generative AI.},
  note      = {Best Demo Award},
  pdf       = {https://aclanthology.org/2025.naacl-demo.42.pdf},
  keywords  = {uce,new-data-spaces,circlet,core,core_c08}
}

New publication accepted at NAACL 2025

Our paper, “Towards Unified, Dynamic, and Annotation-based Visualizations and Exploration of Annotated Big Data Corpora with the Help of Unified Corpus Explorer,” has been accepted to the Systems Demonstrations Track of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025).

In this paper, we present our open-source Unified Corpus Explorer (UCE)—a generic corpus explorer in the form of a web portal that takes UIMA-annotated data from any domain and dynamically builds itself around it. This results in an interactive corpus explorer with semantic search, visualizations, document reading capabilities, Wikidition hypertext generation, and chatbot integration.

Kevin Bönisch, Giuseppe Abrami and Alexander Mehler. 2025. Towards Unified, Dynamic and Annotation-based Visualisations and Exploration of Annotated Big Data Corpora with the Help of Unified Corpus Explorer. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), 522–534. Best Demo Award.
BibTeX
@inproceedings{Boenisch:et:al:2025,
  title     = {Towards Unified, Dynamic and Annotation-based Visualisations and
               Exploration of Annotated Big Data Corpora with the Help of Unified
               Corpus Explorer},
  author    = {B{\"o}nisch, Kevin and Abrami, Giuseppe and Mehler, Alexander},
  editor    = {Dziri, Nouha and Ren, Sean (Xiang) and Diao, Shizhe},
  booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas
               Chapter of the Association for Computational Linguistics: Human
               Language Technologies (System Demonstrations)},
  year      = {2025},
  address   = {Albuquerque, New Mexico},
  publisher = {Association for Computational Linguistics},
  url       = {https://aclanthology.org/2025.naacl-demo.42/},
  pages     = {522--534},
  isbn      = {979-8-89176-191-9},
  abstract  = {The annotation and exploration of large text corpora, both automatic
               and manual, presents significant challenges across multiple disciplines,
               including linguistics, digital humanities, biology, and legal
               science. These challenges are exacerbated by the heterogeneity
               of processing methods, which complicates corpus visualization,
               interaction, and integration. To address these issues, we introduce
               the Unified Corpus Explorer (UCE), a standardized, dockerized,
               open-source and dynamic Natural Language Processing (NLP) application
               designed for flexible and scalable corpus navigation. Herein,
               UCE utilizes the UIMA format for NLP annotations as a standardized
               input, constructing interfaces and features around those annotations
               while dynamically adapting to the corpora and their extracted
               annotations. We evaluate UCE based on a user study and demonstrate
               its versatility as a corpus explorer based on generative AI.},
  note      = {Best Demo Award},
  pdf       = {https://aclanthology.org/2025.naacl-demo.42.pdf},
  keywords  = {uce,new-data-spaces,circlet,core,core_c08}
}