
BIOfid (FID Biodiversity Research) aims to make both historical and contemporary biodiversity literature available to researchers in modern, machine-readable formats across Germany.
The Johann Christian Senckenberg University Library leads the project in collaboration with the Senckenberg Society for Nature Research and the Text Technology Working Group at the Goethe University Frankfurt.
Scope and Objectives
In terms of subject focus, BIOfid covers a subfield of the former special collection area of biology. Biodiversity research, as defined within the project, includes:
- Taxnomy
- Systematics
- Evolutionary biology
- Ecology
The initial three-year development phase is dedicated to building an innovative specialised information service (FID), based on close and continuous exchange with the biodiversity research community.
More information about the service can be found at: https://www.biofid.de/en/
Phases of the BIOfid Project:
- Phase I: 2017-2020
- Phase II: 2020-2023
- Phase III: 2023-2026
Project Structure
The BIOfid project is structured into four complementary modules, each addressing a key aspect of making biodiversity literature accessible, usable, and sustainable for research.
Module 1: Text Mining for Biodiversity Literature
- Objective:
- To mobilise structured biodiversity data from existing scientific literature using text-mining techniques.
- Description:
- Module 1 focuses on extracting relevant scientific information from biodiversity literature and transforming it into machine-readable data. Based on a DFG roundtable discussion involving representatives from numerous German research institutions, the project initially concentrates on:
- Publications originating from Germany
- Three organism groups: birds, butterflies, and vascular plants
- The module develops reusable text-mining tools that can later be applied to additional organism groups and geographical regions. This ensures long-term scalability beyond the initial project scope.
- Module 1 focuses on extracting relevant scientific information from biodiversity literature and transforming it into machine-readable data. Based on a DFG roundtable discussion involving representatives from numerous German research institutions, the project initially concentrates on:
- Outcome:
- The extracted and structured data form an extensive, reusable data pool that will support future biodiversity research and enable data-driven analyses.
Module 2: Digitisation of Biodiversity Literature
- Objective:
- To create high-quality digital text corpora from print biodiversity literature.
- Description:
- Module 2 focuses on the digitisation of 20th-century biodiversity literature. This digitised material serves a dual purpose:
- It provides the textual foundation required for text-mining activities in Module 1.
- It forms the content basis for the planned open access journal platform in Module 3.
- An additional goal of this module is to ensure that digitised materials are freely available on the web, thereby improving accessibility for researchers.
- Module 2 focuses on the digitisation of 20th-century biodiversity literature. This digitised material serves a dual purpose:
- Outcome:
- A curated, digitised corpus of biodiversity literature that supports both automated processing and open scholarly access.
Module 3: Open Access Journal Platform
- Objective:
- To establish a sustainable publishing infrastructure for biodiversity research.
- Description:
- Module 3 involves the development of a platform for open access biodiversity journals, designed as a long-term service for publishers such as professional associations.
- The platform supports:
- The publication of new open access journals
- The digital transfer of journals previously available only in print form
- This module ensures the long-term preservation, visibility, and accessibility of biodiversity literature.
- Outcome:
- A stable and sustainable open access platform that enables continuous dissemination of biodiversity research outputs.
Module 4: Literature Procurement and Supraregional Provision
- Objective:
- To ensure comprehensive access to biodiversity literature, including print-only resources.
- Description:
- Module 4 focuses on the procurement and supraregional provision of specialist biodiversity literature. It ensures that printed literature remains accessible for research purposes and covers the entire spectrum of organismic biodiversity.
- In addition, this module provides supraregional access to specialised databases, including the Global Plants database, for eligible research institutions.
- Outcome:
- Reliable, nationwide access to essential biodiversity literature and databases, complementing the digital services provided by other modules.
Overall Goal
All four modules work together to make biodiversity literature easier to use for research. The project supports digitisation, text mining, publishing, and access to literature so that the content can be analysed with computers as well as read by humans.
Current Work (Ongoing)
As part of the ongoing work in BIOfid, we are focusing on the annotation of biodiversity literature, especially historical German texts. These texts are difficult to process automatically because they contain old spelling forms, changing terminology, and long, complex sentences.
To handle this, we are using large language models (LLMs) to annotate important information in the texts, such as location and time expressions. The models are guided with clear instructions and structured input formats so that annotations are produced in a consistent way.
At the same time, human annotations play an important role. A large set of manually annotated data is used as a reference to evaluate and verify the annotations produced by LLMs. This allows us to compare machine-generated annotations with human judgments and to better understand the strengths and limitations of LLMs on historical biodiversity texts.
This work is still in progress. It helps improve the quality of machine-readable data and supports future text mining and biodiversity research within BIOfid.
BibTeX
@inproceedings{Dahmann:et:al:2026,
title = {Towards the Generation and Application of Dynamic Web-Based Visualization
of UIMA-based Annotations for Big-Data Corpora with the Help of
Unified Dynamic Annotation Visualizer},
booktitle = {Proceedings of the 15th International Conference on Language Resources
and Evaluation (LREC 2026)},
year = {2026},
author = {Dahmann, Thiemo and Schneider, Julian and Stephan, Philipp and Abrami, Giuseppe
and Mehler, Alexander},
keywords = {NLP, UIMA, Annotations, dynamic visualization, uce},
abstract = {The automatic and manual annotation of unstructured corpora is
a daily task in various scientific fields, which is supported
by a variety of existing software solutions. Despite this variety,
there are currently only limited solutions for visualizing annotations,
especially with regard to dynamic generation and interaction.
To bridge this gap and to visualize and provide annotated corpora
based on user-, project- or corpus-specific aspects, Unified Dynamic
Annotation Visualizer (UDAV) was developed. UDAV is designed as
a web-based solution that implements a number of essential features
which comparable tools do not support to enable a customizable
and extensible toolbox for interacting with annotations, allowing
the integration into existing big data frameworks.},
note = {accepted}
}
BibTeX
@inproceedings{Boenisch:et:al:2025,
title = {Towards Unified, Dynamic and Annotation-based Visualisations and
Exploration of Annotated Big Data Corpora with the Help of Unified
Corpus Explorer},
author = {B{\"o}nisch, Kevin and Abrami, Giuseppe and Mehler, Alexander},
editor = {Dziri, Nouha and Ren, Sean (Xiang) and Diao, Shizhe},
booktitle = {Proceedings of the 2025 Conference of the Nations of the Americas
Chapter of the Association for Computational Linguistics: Human
Language Technologies (System Demonstrations)},
year = {2025},
address = {Albuquerque, New Mexico},
publisher = {Association for Computational Linguistics},
url = {https://aclanthology.org/2025.naacl-demo.42/},
pages = {522--534},
isbn = {979-8-89176-191-9},
abstract = {The annotation and exploration of large text corpora, both automatic
and manual, presents significant challenges across multiple disciplines,
including linguistics, digital humanities, biology, and legal
science. These challenges are exacerbated by the heterogeneity
of processing methods, which complicates corpus visualization,
interaction, and integration. To address these issues, we introduce
the Unified Corpus Explorer (UCE), a standardized, dockerized,
open-source and dynamic Natural Language Processing (NLP) application
designed for flexible and scalable corpus navigation. Herein,
UCE utilizes the UIMA format for NLP annotations as a standardized
input, constructing interfaces and features around those annotations
while dynamically adapting to the corpora and their extracted
annotations. We evaluate UCE based on a user study and demonstrate
its versatility as a corpus explorer based on generative AI.},
note = {Best Demo Award},
pdf = {https://aclanthology.org/2025.naacl-demo.42.pdf},
keywords = {uce,new-data-spaces,circlet,core,core_c08}
}
BibTeX
@article{Abrami:et:al:2025:a,
title = {Docker Unified UIMA Interface: New perspectives for NLP on big data},
journal = {SoftwareX},
volume = {29},
pages = {102033},
year = {2025},
issn = {2352-7110},
doi = {https://doi.org/10.1016/j.softx.2024.102033},
url = {https://www.sciencedirect.com/science/article/pii/S2352711024004047},
author = {Giuseppe Abrami and Markos Genios and Filip Fitzermann and Daniel Baumartz
and Alexander Mehler},
keywords = {Docker, Kubernetes, UIMA, Distributed NLP, duui, biofid, neglab, new-data-spaces, circlet, core, core_c08},
abstract = {Processing large amounts of natural language text using machine
learning-based models is becoming important in many disciplines.
This demand is being met by a variety of approaches, resulting
in the heterogeneous deployment of separate, partly incompatible,
not natively scalable applications. To overcome the technological
bottleneck involved, we have developed Docker Unified UIMA Interface,
a system for the standardized, parallel, platform-independent,
distributed and microservices-based solution for processing large
and extensive text corpora with any NLP method. We present DUUI
as a framework that enables automated orchestration of GPU-based
NLP processes beyond the existing Docker Swarm cluster variant,
and in addition to the adaptation to new runtime environments
such as Kubernetes. Therefore, a new driver for DUUI is introduced,
which enables the lightweight orchestration of DUUI processes
within a Kubernetes environment in a scalable setup. In this way,
the paper opens up novel text-technological perspectives for existing
practices in disciplines that deal with the scientific analysis
of large amounts of data based on NLP.}
}
BibTeX
@article{Luecking:et:al:2021,
author = {Andy Lücking and Christine Driller and Manuel Stoeckel and Giuseppe Abrami
and Adrian Pachzelt and Alexander Mehler},
year = {2021},
journal = {Language Resources and Evaluation},
title = {Multiple Annotation for Biodiversity: Developing an annotation
framework among biology, linguistics and text technology},
editor = {Nancy Ide and Nicoletta Calzolari},
doi = {10.1007/s10579-021-09553-5},
pdf = {https://link.springer.com/content/pdf/10.1007/s10579-021-09553-5.pdf},
keywords = {biofid}
}
BibTeX
@inproceedings{Abrami:et:al:2021,
author = {Abrami, Giuseppe and Henlein, Alexander and Lücking, Andy and Kett, Attila
and Adeberg, Pascal and Mehler, Alexander},
title = {Unleashing annotations with {TextAnnotator}: Multimedia, multi-perspective
document views for ubiquitous annotation},
booktitle = {Proceedings of the 17th Joint ACL - ISO Workshop on Interoperable
Semantic Annotation},
series = {ISA-17},
publisher = {Association for Computational Linguistics},
address = {Groningen, The Netherlands (online)},
month = {June},
editor = {Bunt, Harry},
year = {2021},
url = {https://aclanthology.org/2021.isa-1.7},
pages = {65--75},
keywords = {textannotator, biofid},
pdf = {https://iwcs2021.github.io/proceedings/isa/pdf/2021.isa-1.7.pdf},
abstract = {We argue that mainly due to technical innovation in the landscape
of annotation tools, a conceptual change in annotation models
and processes is also on the horizon. It is diagnosed that these
changes are bound up with multi-media and multi-perspective facilities
of annotation tools, in particular when considering virtual reality
(VR) and augmented reality (AR) applications, their potential
ubiquitous use, and the exploitation of externally trained natural
language pre-processing methods. Such developments potentially
lead to a dynamic and exploratory heuristic construction of the
annotation process. With TextAnnotator an annotation suite is
introduced which focuses on multi-mediality and multi-perspectivity
with an interoperable set of task-specific annotation modules
(e.g., for word classification, rhetorical structures, dependency
trees, semantic roles, and more) and their linkage to VR and mobile
implementations. The basic architecture and usage of TextAnnotator
is described and related to the above mentioned shifts in the
field.}
}
BibTeX
@article{Driller:et:al:2020,
author = {Christine Driller and Markus Koch and Giuseppe Abrami and Wahed Hemati
and Andy Lücking and Alexander Mehler and Adrian Pachzelt and Gerwin Kasperek},
title = {Fast and Easy Access to Central European Biodiversity Data with BIOfid},
volume = {4},
number = {},
year = {2020},
doi = {10.3897/biss.4.59157},
publisher = {Pensoft Publishers},
abstract = {The storage of data in public repositories such as the Global
Biodiversity Information Facility (GBIF) or the National Center
for Biotechnology Information (NCBI) is nowadays stipulated in
the policies of many publishers in order to facilitate data replication
or proliferation. Species occurrence records contained in legacy
printed literature are no exception to this. The extent of their
digital and machine-readable availability, however, is still far
from matching the existing data volume (Thessen and Parr 2014).
But precisely these data are becoming more and more relevant to
the investigation of ongoing loss of biodiversity. In order to
extract species occurrence records at a larger scale from available
publications, one has to apply specialised text mining tools.
However, such tools are in short supply especially for scientific
literature in the German language.The Specialised Information
Service Biodiversity Research*1 BIOfid (Koch et al. 2017) aims
at reducing this desideratum, inter alia, by preparing a searchable
text corpus semantically enriched by a new kind of multi-label
annotation. For this purpose, we feed manual annotations into
automatic, machine-learning annotators. This mixture of automatic
and manual methods is needed, because BIOfid approaches a new
application area with respect to language (mainly German of the
19th century), text type (biological reports), and linguistic
focus (technical and everyday language).We will present current
results of the performance of BIOfid’s semantic search engine
and the application of independent natural language processing
(NLP) tools. Most of these are freely available online, such as
TextImager (Hemati et al. 2016). We will show how TextImager is
tied into the BIOfid pipeline and how it is made scalable (e.g.
extendible by further modules) and usable on different systems
(docker containers).Further, we will provide a short introduction
to generating machine-learning training data using TextAnnotator
(Abrami et al. 2019) for multi-label annotation. Annotation reproducibility
can be assessed by the implementation of inter-annotator agreement
methods (Abrami et al. 2020). Beyond taxon recognition and entity
linking, we place particular emphasis on location and time information.
For this purpose, our annotation tag-set combines general categories
and biology-specific categories (including taxonomic names) with
location and time ontologies. The application of the annotation
categories is regimented by annotation guidelines (Lücking et
al. 2020). Within the next years, our work deliverable will be
a semantically accessible and data-extractable text corpus of
around two million pages. In this way, BIOfid is creating a new
valuable resource that expands our knowledge of biodiversity and
its determinants.},
issn = {},
pages = {e59157},
url = {https://doi.org/10.3897/biss.4.59157},
eprint = {https://doi.org/10.3897/biss.4.59157},
journal = {Biodiversity Information Science and Standards},
keywords = {biofid}
}
BibTeX
@inproceedings{Abrami:Mehler:Stoeckel:2020,
author = {Abrami, Giuseppe and Mehler, Alexander and Stoeckel, Manuel},
title = {{TextAnnotator}: A web-based annotation suite for texts},
booktitle = {Proceedings of the Digital Humanities 2020},
series = {DH 2020},
location = {Ottawa, Canada},
year = {2020},
url = {https://dh2020.adho.org/wp-content/uploads/2020/07/547_TextAnnotatorAwebbasedannotationsuitefortexts.html},
doi = {http://dx.doi.org/10.17613/tenm-4907},
abstract = {The TextAnnotator is a tool for simultaneous and collaborative
annotation of texts with visual annotation support, integration
of knowledge bases and, by pipelining the TextImager, a rich variety
of pre-processing and automatic annotation tools. It includes
a variety of modules for the annotation of texts, which contains
the annotation of argumentative, rhetorical, propositional and
temporal structures as well as a module for named entity linking
and rapid annotation of named entities. Especially the modules
for annotation of temporal, argumentative and propositional structures
are currently unique in web-based annotation tools. The TextAnnotator,
which allows the annotation of texts as a platform, is divided
into a front- and a backend component. The backend is a web service
based on WebSockets, which integrates the UIMA Database Interface
to manage and use texts. Texts are made accessible by using the
ResourceManager and the AuthorityManager, based on user and group
access permissions. Different views of a document can be created
and used depending on the scenario. Once a document has been opened,
access is gained to the annotations stored within annotation views
in which these are organized. Any annotation view can be assigned
with access permissions and by default, each user obtains his
or her own user view for every annotated document. In addition,
with sufficient access permissions, all annotation views can also
be used and curated. This allows the possibility to calculate
an Inter-Annotator-Agreement for a document, which shows an agreement
between the annotators. Annotators without sufficient rights cannot
display this value so that the annotators do not influence each
other. This contribution is intended to reflect the current state
of development of TextAnnotator, demonstrate the possibilities
of an instantaneous Inter-Annotator-Agreement and trigger a discussion
about further functions for the community.},
keywords = {textannotator, biofid},
poster = {https://hcommons.org/deposits/download/hc:31816/CONTENT/dh2020_textannotator_poster.pdf}
}
BibTeX
@inproceedings{Abrami:Stoeckel:Mehler:2020,
author = {Abrami, Giuseppe and Stoeckel, Manuel and Mehler, Alexander},
title = {TextAnnotator: A UIMA Based Tool for the Simultaneous and Collaborative
Annotation of Texts},
booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
year = {2020},
address = {Marseille, France},
publisher = {European Language Resources Association},
pages = {891--900},
isbn = {979-10-95546-34-4},
abstract = {The annotation of texts and other material in the field of digital
humanities and Natural Language Processing (NLP) is a common task
of research projects. At the same time, the annotation of corpora
is certainly the most time- and cost-intensive component in research
projects and often requires a high level of expertise according
to the research interest. However, for the annotation of texts,
a wide range of tools is available, both for automatic and manual
annotation. Since the automatic pre-processing methods are not
error-free and there is an increasing demand for the generation
of training data, also with regard to machine learning, suitable
annotation tools are required. This paper defines criteria of
flexibility and efficiency of complex annotations for the assessment
of existing annotation tools. To extend this list of tools, the
paper describes TextAnnotator, a browser-based, multi-annotation
system, which has been developed to perform platform-independent
multimodal annotations and annotate complex textual structures.
The paper illustrates the current state of development of TextAnnotator
and demonstrates its ability to evaluate annotation quality (inter-annotator
agreement) at runtime. In addition, it will be shown how annotations
of different users can be performed simultaneously and collaboratively
on the same document from different platforms using UIMA as the
basis for annotation.},
url = {https://www.aclweb.org/anthology/2020.lrec-1.112},
keywords = {textannotator, biofid},
pdf = {http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.112.pdf}
}
BibTeX
@inproceedings{Ahmed:Stoeckel:Driller:Pachzelt:Mehler:2019,
author = {Sajawel Ahmed and Manuel Stoeckel and Christine Driller and Adrian Pachzelt
and Alexander Mehler},
title = {{BIOfid Dataset: Publishing a German Gold Standard for Named Entity
Recognition in Historical Biodiversity Literature}},
publisher = {Association for Computational Linguistics},
year = {2019},
booktitle = {Proceedings of the 23rd Conference on Computational Natural Language
Learning (CoNLL)},
address = {Hong Kong, China},
url = {https://www.aclweb.org/anthology/K19-1081},
doi = {10.18653/v1/K19-1081},
pages = {871--880},
abstract = {The Specialized Information Service Biodiversity Research (BIOfid)
has been launched to mobilize valuable biological data from printed
literature hidden in German libraries for over the past 250 years.
In this project, we annotate German texts converted by OCR from
historical scientific literature on the biodiversity of plants,
birds, moths and butterflies. Our work enables the automatic extraction
of biological information previously buried in the mass of papers
and volumes. For this purpose, we generated training data for
the tasks of Named Entity Recognition (NER) and Taxa Recognition
(TR) in biological documents. We use this data to train a number
of leading machine learning tools and create a gold standard for
TR in biodiversity literature. More specifically, we perform a
practical analysis of our newly generated BIOfid dataset through
various downstream-task evaluations and establish a new state
of the art for TR with 80.23{\%} F-score. In this sense, our paper
lays the foundations for future work in the field of information
extraction in biology texts.},
keywords = {biofid}
}
BibTeX
@inproceedings{Abrami:et:al:2019,
author = {Abrami, Giuseppe and Mehler, Alexander and Lücking, Andy and Rieb, Elias
and Helfrich, Philipp},
title = {{TextAnnotator}: A flexible framework for semantic annotations},
booktitle = {Proceedings of the Fifteenth Joint ACL - ISO Workshop on Interoperable
Semantic Annotation, (ISA-15)},
series = {ISA-15},
location = {Gothenburg, Sweden},
month = {May},
pdf = {https://www.texttechnologylab.org/wp-content/uploads/2019/04/TextAnnotator_IWCS_Göteborg.pdf},
year = {2019},
keywords = {textannotator, biofid},
abstract = {Modern annotation tools should meet at least the following general
requirements: they can handle diverse data and annotation levels
within one tool, and they support the annotation process with
automatic (pre-)processing outcomes as much as possible. We developed
a framework that meets these general requirements and that enables
versatile and browser-based annotations of texts, the TextAnnotator.
It combines NLP methods of pre-processing with methods of flexible
post-processing. Infact, machine learning (ML) requires a lot
of training and test data, but is usually far from achieving perfect
results. Producing high-level annotations for ML and post-correcting
its results are therefore necessary. This is the purpose of TextAnnotator,
which is entirely implemented in ExtJS and provides a range of
interactive visualizations of annotations. In addition, it allows
for flexibly integrating knowledge resources, e.g. in the course
of post-processing named entity recognition. The paper describes
TextAnnotator’s architecture together with three use cases: annotating
temporal structures, argument structures and named entity linking.}
}
BibTeX
@article{Driller:et:al:2018,
author = {Christine Driller and Markus Koch and Marco Schmidt and Claus Weiland
and Thomas Hörnschemeyer and Thomas Hickler and Giuseppe Abrami and Sajawel Ahmed
and Rüdiger Gleim and Wahed Hemati and Tolga Uslu and Alexander Mehler
and Adrian Pachzelt and Jashar Rexhepi and Thomas Risse and Janina Schuster
and Gerwin Kasperek and Angela Hausinger},
title = {Workflow and Current Achievements of BIOfid, an Information Service
Mobilizing Biodiversity Data from Literature Sources},
volume = {2},
number = {},
year = {2018},
doi = {10.3897/biss.2.25876},
publisher = {Pensoft Publishers},
abstract = {BIOfid is a specialized information service currently being developed
to mobilize biodiversity data dormant in printed historical and
modern literature and to offer a platform for open access journals
on the science of biodiversity. Our team of librarians, computer
scientists and biologists produce high-quality text digitizations,
develop new text-mining tools and generate detailed ontologies
enabling semantic text analysis and semantic search by means of
user-specific queries. In a pilot project we focus on German publications
on the distribution and ecology of vascular plants, birds, moths
and butterflies extending back to the Linnaeus period about 250
years ago. The three organism groups have been selected according
to current demands of the relevant research community in Germany.
The text corpus defined for this purpose comprises over 400 volumes
with more than 100,000 pages to be digitized and will be complemented
by journals from other digitization projects, copyright-free and
project-related literature. With TextImager (Natural Language
Processing & Text Visualization) and TextAnnotator (Discourse
Semantic Annotation) we have already extended and launched tools
that focus on the text-analytical section of our project. Furthermore,
taxonomic and anatomical ontologies elaborated by us for the taxa
prioritized by the project’s target group - German institutions
and scientists active in biodiversity research - are constantly
improved and expanded to maximize scientific data output. Our
poster describes the general workflow of our project ranging from
literature acquisition via software development, to data availability
on the BIOfid web portal (http://biofid.de/), and the implementation
into existing platforms which serve to promote global accessibility
of biodiversity data.},
issn = {},
pages = {e25876},
url = {https://doi.org/10.3897/biss.2.25876},
eprint = {https://doi.org/10.3897/biss.2.25876},
journal = {Biodiversity Information Science and Standards},
keywords = {biofid}
}
