Our paper, “Filling the Temporal Void: Recovering Missing Publication Years in the Project Gutenberg Corpus Using LLMs“, has been accepted to the Findings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025).
July, 2025.
Filling the Temporal Void: Recovering Missing Publication Years
in the Project Gutenberg Corpus Using LLMs. Findings of the Association for Computational Linguistics: ACL 2025, 17318–17334.
BibTeX
@inproceedings{Momen:Schaaf:Mehler:2025,
title = {Filling the Temporal Void: Recovering Missing Publication Years
in the Project Gutenberg Corpus Using {LLM}s},
author = {Momen, Omar and Schaaf, Manuel and Mehler, Alexander},
editor = {Che, Wanxiang and Nabende, Joyce and Shutova, Ekaterina and Pilehvar, Mohammad Taher},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
month = {jul},
year = {2025},
address = {Vienna, Austria},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2025.findings-acl.890/},
pages = {17318--17334},
isbn = {979-8-89176-256-5},
abstract = {Analysing texts spanning long periods of time is critical for
researchers in historical linguistics and related disciplines.
However, publicly available corpora suitable for such analyses
are scarce. The Project Gutenberg (PG) corpus presents a significant
yet underutilized opportunity in this context, due to the absence
of accurate temporal metadata. We take advantage of language models
and information retrieval to explore four sources of information
{--} Open Web, Wikipedia, Open Library API, and PG books texts
{--} to add missing temporal metadata to the PG corpus. Through
20 experiments employing state-of-the-art Large Language Models
(LLMs) and Retrieval-Augmented Generation (RAG) methods, we estimate
the production years of all PG books. We curate an enriched metadata
repository for the PG corpus and propose a refined version for
it, which includes 53,774 books with a total of 3.8 billion tokens
in 11 languages, produced between 1600 and 2000. This work provides
a new resource for computational linguistics and humanities studies
focusing on diachronic analyses. The final dataset and all experiments
data are publicly available (https://github.com/OmarMomen14/pg-dates).},
pdf = {https://aclanthology.org/2025.findings-acl.890.pdf}
}
