Manuel Schaaf

née Stoeckel

Manuel Schaaf

Scientific Assistant

Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
Room 401b
D-60325 Frankfurt am Main
D-60054 Frankfurt am Main (use for package delivery)
Postfach / P.O. Box: 154
Phone:
Mail:

Office Hour: Wednesday, 10-12 AM

Projects

BIOfid Logo

The specialised information service BIOfid (www.biofid.de) is oriented towards the special needs of scientists researching biodiversity topics at research institutions and in natural history collections. Since 2017, BIOfid has been building an infrastructure that contributes to the provision and mobilisation of research-relevant data in a variety of ways in the context of current developments in biodiversity research.

Specialised Information Service Biodiversity Research (BIOfid). 2017 – . Funded by DFG (FID 326061700).

BIOfid Publications

Andy Lücking, Christine Driller, Manuel Stoeckel, Giuseppe Abrami, Adrian Pachzelt and Alexander Mehler. 2021. Multiple Annotation for Biodiversity: Developing an annotation framework among biology, linguistics and text technology. Language Resources and Evaluation.
BibTeX
@article{Luecking:et:al:2021,
  author    = {Andy Lücking and Christine Driller and Manuel Stoeckel and Giuseppe Abrami
               and Adrian Pachzelt and Alexander Mehler},
  year      = {2021},
  journal   = {Language Resources and Evaluation},
  title     = {Multiple Annotation for Biodiversity: Developing an annotation
               framework among biology, linguistics and text technology},
  editor    = {Nancy Ide and Nicoletta Calzolari},
  doi       = {10.1007/s10579-021-09553-5},
  pdf       = {https://link.springer.com/content/pdf/10.1007/s10579-021-09553-5.pdf},
  keywords  = {biofid}
}
Sajawel Ahmed, Manuel Stoeckel, Christine Driller, Adrian Pachzelt and Alexander Mehler. 2019. BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 871–880.
BibTeX
@inproceedings{Ahmed:Stoeckel:Driller:Pachzelt:Mehler:2019,
  author    = {Sajawel Ahmed and Manuel Stoeckel and Christine Driller and Adrian Pachzelt
               and Alexander Mehler},
  title     = {{BIOfid Dataset: Publishing a German Gold Standard for Named Entity
               Recognition in Historical Biodiversity Literature}},
  publisher = {Association for Computational Linguistics},
  year      = {2019},
  booktitle = {Proceedings of the 23rd Conference on Computational Natural Language
               Learning (CoNLL)},
  address   = {Hong Kong, China},
  url       = {https://www.aclweb.org/anthology/K19-1081},
  doi       = {10.18653/v1/K19-1081},
  pages     = {871--880},
  abstract  = {The Specialized Information Service Biodiversity Research (BIOfid)
               has been launched to mobilize valuable biological data from printed
               literature hidden in German libraries for over the past 250 years.
               In this project, we annotate German texts converted by OCR from
               historical scientific literature on the biodiversity of plants,
               birds, moths and butterflies. Our work enables the automatic extraction
               of biological information previously buried in the mass of papers
               and volumes. For this purpose, we generated training data for
               the tasks of Named Entity Recognition (NER) and Taxa Recognition
               (TR) in biological documents. We use this data to train a number
               of leading machine learning tools and create a gold standard for
               TR in biodiversity literature. More specifically, we perform a
               practical analysis of our newly generated BIOfid dataset through
               various downstream-task evaluations and establish a new state
               of the art for TR with 80.23{\%} F-score. In this sense, our paper
               lays the foundations for future work in the field of information
               extraction in biology texts.},
  keywords  = {biofid}
}

Teaching

Courses

None, 2025

Master Thesis: Can Adversarial Text Snippets Achieve Refusal Dimension Deletion?.
Description
The threat of abuse through determined adversaries makes safety of public-facing LLMs a key priority for developers and researcher alike.
Despite intensive efforts, recent research shows that "refusal in language models [may be] mediated by a [one-dimensional subspace in the model's weights]" (Arditi et al., 2024) and that it is possible to create text-snippets that circumvent harmful response prevention in open- and closed-source LLMs using adversarial algorithms (Zou et al., 2023). This beckons the question, whether these two methods of "jailbreaking" LLMs align; i.e. whether adversarially generated text segments can shift a model's hidden states into a position that effectively approach refusal dimension deletion.

Related Work

Corresponding Lab Member: Manuel Schaaf and Alexander Mehler.

Winter Semester, 2024

Practical: Transformer-based Natural Language Processing. Alexander Mehler and Manuel Schaaf.
QISOLAT

Summer Semester, 2024

Lecture: NLP-gestützte Data Science. Alexander Mehler and Manuel Stoeckel.
QISOLAT
Practical: Transformer-based Natural Language Processing. Alexander Mehler and Manuel Stoeckel.
QISOLAT

Winter Semester, 2023

Lecture: Einführung Computational Humanities. Alexander Mehler and Manuel Stoeckel.
QISOLAT
Practical: Transformer-based Natural Language Processing. Alexander Mehler and Manuel Stoeckel.
QISOLAT

Summer Semester, 2023

Lecture: NLP-gestützte Data Science. Alexander Mehler, Manuel Stoeckel and Giuseppe Abrami.
QISOLAT
Practical: Transformer-based Natural Language Processing. Alexander Mehler and Manuel Stoeckel.
QISOLAT

Student Research Topics

deep learning

Master Thesis: Can Adversarial Text Snippets Achieve Refusal Dimension Deletion?.
Description
The threat of abuse through determined adversaries makes safety of public-facing LLMs a key priority for developers and researcher alike.
Despite intensive efforts, recent research shows that "refusal in language models [may be] mediated by a [one-dimensional subspace in the model's weights]" (Arditi et al., 2024) and that it is possible to create text-snippets that circumvent harmful response prevention in open- and closed-source LLMs using adversarial algorithms (Zou et al., 2023). This beckons the question, whether these two methods of "jailbreaking" LLMs align; i.e. whether adversarially generated text segments can shift a model's hidden states into a position that effectively approach refusal dimension deletion.

Related Work

Corresponding Lab Member: Manuel Schaaf and Alexander Mehler.

ethics

Master Thesis: Can Adversarial Text Snippets Achieve Refusal Dimension Deletion?.
Description
The threat of abuse through determined adversaries makes safety of public-facing LLMs a key priority for developers and researcher alike.
Despite intensive efforts, recent research shows that "refusal in language models [may be] mediated by a [one-dimensional subspace in the model's weights]" (Arditi et al., 2024) and that it is possible to create text-snippets that circumvent harmful response prevention in open- and closed-source LLMs using adversarial algorithms (Zou et al., 2023). This beckons the question, whether these two methods of "jailbreaking" LLMs align; i.e. whether adversarially generated text segments can shift a model's hidden states into a position that effectively approach refusal dimension deletion.

Related Work

Corresponding Lab Member: Manuel Schaaf and Alexander Mehler.

jailbreak

Master Thesis: Can Adversarial Text Snippets Achieve Refusal Dimension Deletion?.
Description
The threat of abuse through determined adversaries makes safety of public-facing LLMs a key priority for developers and researcher alike.
Despite intensive efforts, recent research shows that "refusal in language models [may be] mediated by a [one-dimensional subspace in the model's weights]" (Arditi et al., 2024) and that it is possible to create text-snippets that circumvent harmful response prevention in open- and closed-source LLMs using adversarial algorithms (Zou et al., 2023). This beckons the question, whether these two methods of "jailbreaking" LLMs align; i.e. whether adversarially generated text segments can shift a model's hidden states into a position that effectively approach refusal dimension deletion.

Related Work

Corresponding Lab Member: Manuel Schaaf and Alexander Mehler.

large language models

Master Thesis: Can Adversarial Text Snippets Achieve Refusal Dimension Deletion?.
Description
The threat of abuse through determined adversaries makes safety of public-facing LLMs a key priority for developers and researcher alike.
Despite intensive efforts, recent research shows that "refusal in language models [may be] mediated by a [one-dimensional subspace in the model's weights]" (Arditi et al., 2024) and that it is possible to create text-snippets that circumvent harmful response prevention in open- and closed-source LLMs using adversarial algorithms (Zou et al., 2023). This beckons the question, whether these two methods of "jailbreaking" LLMs align; i.e. whether adversarially generated text segments can shift a model's hidden states into a position that effectively approach refusal dimension deletion.

Related Work

Corresponding Lab Member: Manuel Schaaf and Alexander Mehler.

llms

Master Thesis: Can Adversarial Text Snippets Achieve Refusal Dimension Deletion?.
Description
The threat of abuse through determined adversaries makes safety of public-facing LLMs a key priority for developers and researcher alike.
Despite intensive efforts, recent research shows that "refusal in language models [may be] mediated by a [one-dimensional subspace in the model's weights]" (Arditi et al., 2024) and that it is possible to create text-snippets that circumvent harmful response prevention in open- and closed-source LLMs using adversarial algorithms (Zou et al., 2023). This beckons the question, whether these two methods of "jailbreaking" LLMs align; i.e. whether adversarially generated text segments can shift a model's hidden states into a position that effectively approach refusal dimension deletion.

Related Work

Corresponding Lab Member: Manuel Schaaf and Alexander Mehler.

safety

Master Thesis: Can Adversarial Text Snippets Achieve Refusal Dimension Deletion?.
Description
The threat of abuse through determined adversaries makes safety of public-facing LLMs a key priority for developers and researcher alike.
Despite intensive efforts, recent research shows that "refusal in language models [may be] mediated by a [one-dimensional subspace in the model's weights]" (Arditi et al., 2024) and that it is possible to create text-snippets that circumvent harmful response prevention in open- and closed-source LLMs using adversarial algorithms (Zou et al., 2023). This beckons the question, whether these two methods of "jailbreaking" LLMs align; i.e. whether adversarially generated text segments can shift a model's hidden states into a position that effectively approach refusal dimension deletion.

Related Work

Corresponding Lab Member: Manuel Schaaf and Alexander Mehler.

Publications

Total: 9

2024

Alexander Mehler, Mevlüt Bagci, Patrick Schrottenbacher, Alexander Henlein, Maxim Konca, Giuseppe Abrami, Kevin Bönisch, Manuel Stoeckel, Christian Spiekermann and Juliane Engel. 2024. Towards New Data Spaces for the Study of Multiple Documents with Va.Si.Li-Lab: A Conceptual Analysis. In: Students', Graduates' and Young Professionals' Critical Use of Online Information: Digital Performance Assessment and Training within and across Domains, 259–303. Ed. by Olga Zlatkin-Troitschanskaia, Marie-Theres Nagel, Verena Klose and Alexander Mehler. Springer Nature Switzerland.
BibTeX
@inbook{Mehler:et:al:2024:a,
  author    = {Mehler, Alexander and Bagci, Mevl{\"u}t and Schrottenbacher, Patrick
               and Henlein, Alexander and Konca, Maxim and Abrami, Giuseppe and B{\"o}nisch, Kevin
               and Stoeckel, Manuel and Spiekermann, Christian and Engel, Juliane},
  editor    = {Zlatkin-Troitschanskaia, Olga and Nagel, Marie-Theres and Klose, Verena
               and Mehler, Alexander},
  title     = {Towards New Data Spaces for the Study of Multiple Documents with
               Va.Si.Li-Lab: A Conceptual Analysis},
  booktitle = {Students', Graduates' and Young Professionals' Critical Use of
               Online Information: Digital Performance Assessment and Training
               within and across Domains},
  year      = {2024},
  publisher = {Springer Nature Switzerland},
  address   = {Cham},
  pages     = {259--303},
  abstract  = {The constitution of multiple documents has so far been studied
               essentially as a process in which a single learner consults a
               number (of segments) of different documents in the context of
               the task at hand in order to construct a mental model for the
               purpose of completing the task. As a result of this research focus,
               the constitution of multiple documents appears predominantly as
               a monomodal, non-interactive process in which mainly textual units
               are studied, supplemented by images, text-image relations and
               comparable artifacts. This approach is reflected in the contextual
               fixity of the research design, in which the learners under study
               search for information using suitably equipped computers. If,
               on the other hand, we consider the openness of multi-agent learning
               situations, this scenario lacks the aspects of interactivity,
               contextual openness and, above all, the multimodality of information
               objects, information processing and information exchange. This
               is where the chapter comes in. It describes Va.Si.Li-Lab as an
               instrument for multimodal measurement for studying and modeling
               multiple documents in the context of interactive learning in a
               multi-agent environment. To this end, the chapter places Va.Si.Li-Lab
               in the spectrum of evolutionary approaches that vary the combination
               of human and machine innovation and selection. It also combines
               the requirements of multimodal representational learning with
               various aspects of contextual plasticity to prepare Va.Si.Li-Lab
               as a system that can be used for experimental research. The chapter
               is conceptual in nature, designing a system of requirements using
               the example of Va.Si.Li-Lab to outline an experimental environment
               in which the study of Critical Online Reasoning (COR) as a group
               process becomes possible. Although the chapter illustrates some
               of these requirements with realistic data from the field of simulation-based
               learning, the focus is still conceptual rather than experimental,
               hypothesis-driven. That is, the chapter is concerned with the
               design of a technology for future research into COR processes.},
  isbn      = {978-3-031-69510-0},
  doi       = {10.1007/978-3-031-69510-0_12},
  url       = {https://doi.org/10.1007/978-3-031-69510-0_12}
}
Kevin Bönisch, Manuel Stoeckel and Alexander Mehler. 2024. HyperCausal: Visualizing Causal Inference in 3D Hypertext. Proceedings of the 35th ACM Conference on Hypertext and Social Media, 330––336.
BibTeX
@inproceedings{Boenisch:et:al:2024,
  author    = {B\"{o}nisch, Kevin and Stoeckel, Manuel and Mehler, Alexander},
  title     = {HyperCausal: Visualizing Causal Inference in 3D Hypertext},
  year      = {2024},
  isbn      = {9798400705953},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  url       = {https://doi.org/10.1145/3648188.3677049},
  doi       = {10.1145/3648188.3677049},
  abstract  = {We present HyperCausal, a 3D hypertext visualization framework
               for exploring causal inference in generative Large Language Models
               (LLMs). HyperCausal maps the generative processes of LLMs into
               spatial hypertexts, where tokens are represented as nodes connected
               by probability-weighted edges. The edges are weighted by the prediction
               scores of next tokens, depending on the underlying language model.
               HyperCausal facilitates navigation through the causal space of
               the underlying LLM, allowing users to explore predicted word sequences
               and their branching. Through comparative analysis of LLM parameters
               such as token probabilities and search algorithms, HyperCausal
               provides insight into model behavior and performance. Implemented
               using the Hugging Face transformers library and Three.js, HyperCausal
               ensures cross-platform accessibility to advance research in natural
               language processing using concepts from hypertext research. We
               demonstrate several use cases of HyperCausal and highlight the
               potential for detecting hallucinations generated by LLMs using
               this framework. The connection with hypertext research arises
               from the fact that HyperCausal relies on user interaction to unfold
               graphs with hierarchically appearing branching alternatives in
               3D space. This approach refers to spatial hypertexts and early
               concepts of hierarchical hypertext structures. A third connection
               concerns hypertext fiction, since the branching alternatives mediated
               by HyperCausal manifest non-linearly organized reading threads
               along artificially generated texts that the user decides to follow
               optionally depending on the reading context.},
  booktitle = {Proceedings of the 35th ACM Conference on Hypertext and Social Media},
  pages     = {330–-336},
  numpages  = {7},
  keywords  = {3D hypertext, large language models, visualization},
  location  = {Poznan, Poland},
  series    = {HT '24},
  video     = {https://www.youtube.com/watch?v=ANHFTupnKhI}
}

2022

Andy Lücking, Manuel Stoeckel, Giuseppe Abrami and Alexander Mehler. 2022. I still have Time(s): Extending HeidelTime for German Texts. Proceedings of the 13th Language Resources and Evaluation Conference.
BibTeX
@inproceedings{Luecking:Stoeckel:Abrami:Mehler:2022,
  author    = {L{\"u}cking, Andy and Stoeckel, Manuel and Abrami, Giuseppe and Mehler, Alexander},
  title     = {I still have Time(s): Extending {HeidelTime} for {German} Texts},
  booktitle = {Proceedings of the 13th Language Resources and Evaluation Conference},
  series    = {LREC 2022},
  location  = {Marseille, France},
  year      = {2022},
  url       = {https://aclanthology.org/2022.lrec-1.505},
  pdf       = {https://aclanthology.org/2022.lrec-1.505.pdf}
}

2021

Andy Lücking, Christine Driller, Manuel Stoeckel, Giuseppe Abrami, Adrian Pachzelt and Alexander Mehler. 2021. Multiple Annotation for Biodiversity: Developing an annotation framework among biology, linguistics and text technology. Language Resources and Evaluation.
BibTeX
@article{Luecking:et:al:2021,
  author    = {Andy Lücking and Christine Driller and Manuel Stoeckel and Giuseppe Abrami
               and Adrian Pachzelt and Alexander Mehler},
  year      = {2021},
  journal   = {Language Resources and Evaluation},
  title     = {Multiple Annotation for Biodiversity: Developing an annotation
               framework among biology, linguistics and text technology},
  editor    = {Nancy Ide and Nicoletta Calzolari},
  doi       = {10.1007/s10579-021-09553-5},
  pdf       = {https://link.springer.com/content/pdf/10.1007/s10579-021-09553-5.pdf},
  keywords  = {biofid}
}

2020

Giuseppe Abrami, Alexander Mehler and Manuel Stoeckel. 2020. TextAnnotator: A web-based annotation suite for texts. Proceedings of the Digital Humanities 2020.
BibTeX
@inproceedings{Abrami:Mehler:Stoeckel:2020,
  author    = {Abrami, Giuseppe and Mehler, Alexander and Stoeckel, Manuel},
  title     = {{TextAnnotator}: A web-based annotation suite for texts},
  booktitle = {Proceedings of the Digital Humanities 2020},
  series    = {DH 2020},
  location  = {Ottawa, Canada},
  year      = {2020},
  url       = {https://dh2020.adho.org/wp-content/uploads/2020/07/547_TextAnnotatorAwebbasedannotationsuitefortexts.html},
  doi       = {http://dx.doi.org/10.17613/tenm-4907},
  abstract  = {The TextAnnotator is a tool for simultaneous and collaborative
               annotation of texts with visual annotation support, integration
               of knowledge bases and, by pipelining the TextImager, a rich variety
               of pre-processing and automatic annotation tools. It includes
               a variety of modules for the annotation of texts, which contains
               the annotation of argumentative, rhetorical, propositional and
               temporal structures as well as a module for named entity linking
               and rapid annotation of named entities. Especially the modules
               for annotation of temporal, argumentative and propositional structures
               are currently unique in web-based annotation tools. The TextAnnotator,
               which allows the annotation of texts as a platform, is divided
               into a front- and a backend component. The backend is a web service
               based on WebSockets, which integrates the UIMA Database Interface
               to manage and use texts. Texts are made accessible by using the
               ResourceManager and the AuthorityManager, based on user and group
               access permissions. Different views of a document can be created
               and used depending on the scenario. Once a document has been opened,
               access is gained to the annotations stored within annotation views
               in which these are organized. Any annotation view can be assigned
               with access permissions and by default, each user obtains his
               or her own user view for every annotated document. In addition,
               with sufficient access permissions, all annotation views can also
               be used and curated. This allows the possibility to calculate
               an Inter-Annotator-Agreement for a document, which shows an agreement
               between the annotators. Annotators without sufficient rights cannot
               display this value so that the annotators do not influence each
               other. This contribution is intended to reflect the current state
               of development of TextAnnotator, demonstrate the possibilities
               of an instantaneous Inter-Annotator-Agreement and trigger a discussion
               about further functions for the community.},
  keywords  = {textannotator},
  poster    = {https://hcommons.org/deposits/download/hc:31816/CONTENT/dh2020_textannotator_poster.pdf}
}
Giuseppe Abrami, Manuel Stoeckel and Alexander Mehler. 2020. TextAnnotator: A UIMA Based Tool for the Simultaneous and Collaborative Annotation of Texts. Proceedings of The 12th Language Resources and Evaluation Conference, 891–900.
BibTeX
@inproceedings{Abrami:Stoeckel:Mehler:2020,
  author    = {Abrami, Giuseppe and Stoeckel, Manuel and Mehler, Alexander},
  title     = {TextAnnotator: A UIMA Based Tool for the Simultaneous and Collaborative
               Annotation of Texts},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {891--900},
  isbn      = {979-10-95546-34-4},
  abstract  = {The annotation of texts and other material in the field of digital
               humanities and Natural Language Processing (NLP) is a common task
               of research projects. At the same time, the annotation of corpora
               is certainly the most time- and cost-intensive component in research
               projects and often requires a high level of expertise according
               to the research interest. However, for the annotation of texts,
               a wide range of tools is available, both for automatic and manual
               annotation. Since the automatic pre-processing methods are not
               error-free and there is an increasing demand for the generation
               of training data, also with regard to machine learning, suitable
               annotation tools are required. This paper defines criteria of
               flexibility and efficiency of complex annotations for the assessment
               of existing annotation tools. To extend this list of tools, the
               paper describes TextAnnotator, a browser-based, multi-annotation
               system, which has been developed to perform platform-independent
               multimodal annotations and annotate complex textual structures.
               The paper illustrates the current state of development of TextAnnotator
               and demonstrates its ability to evaluate annotation quality (inter-annotator
               agreement) at runtime. In addition, it will be shown how annotations
               of different users can be performed simultaneously and collaboratively
               on the same document from different platforms using UIMA as the
               basis for annotation.},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.112},
  keywords  = {textannotator},
  pdf       = {http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.112.pdf}
}
Manuel Stoeckel, Alexander Henlein, Wahed Hemati and Alexander Mehler. May, 2020. Voting for POS tagging of Latin texts: Using the flair of FLAIR to better Ensemble Classifiers by Example of Latin. Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, 130–135.
BibTeX
@inproceedings{Stoeckel:et:al:2020,
  author    = {Stoeckel, Manuel and Henlein, Alexander and Hemati, Wahed and Mehler, Alexander},
  title     = {{Voting for POS tagging of Latin texts: Using the flair of FLAIR
               to better Ensemble Classifiers by Example of Latin}},
  booktitle = {Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies
               for Historical and Ancient Languages},
  month     = {May},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association (ELRA)},
  pages     = {130--135},
  abstract  = {Despite the great importance of the Latin language in the past,
               there are relatively few resources available today to develop
               modern NLP tools for this language. Therefore, the EvaLatin Shared
               Task for Lemmatization and Part-of-Speech (POS) tagging was published
               in the LT4HALA workshop. In our work, we dealt with the second
               EvaLatin task, that is, POS tagging. Since most of the available
               Latin word embeddings were trained on either few or inaccurate
               data, we trained several embeddings on better data in the first
               step. Based on these embeddings, we trained several state-of-the-art
               taggers and used them as input for an ensemble classifier called
               LSTMVoter. We were able to achieve the best results for both the
               cross-genre and the cross-time task (90.64\% and 87.00\%) without
               using additional annotated data (closed modality). In the meantime,
               we further improved the system and achieved even better results
               (96.91\% on classical, 90.87\% on cross-genre and 87.35\% on cross-time).},
  url       = {https://www.aclweb.org/anthology/2020.lt4hala-1.21},
  pdf       = {http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4HALA/pdf/2020.lt4hala-1.21.pdf}
}

2019

Manuel Stoeckel, Wahed Hemati and Alexander Mehler. November, 2019. When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in Spanish. Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 11–15.
BibTeX
@inproceedings{Stoeckel:Hemati:Mehler:2019,
  title     = {When Specialization Helps: Using Pooled Contextualized Embeddings
               to Detect Chemical and Biomedical Entities in {S}panish},
  author    = {Stoeckel, Manuel and Hemati, Wahed and Mehler, Alexander},
  booktitle = {Proceedings of The 5th Workshop on BioNLP Open Shared Tasks},
  month     = {nov},
  year      = {2019},
  address   = {Hong Kong, China},
  publisher = {Association for Computational Linguistics},
  url       = {https://www.aclweb.org/anthology/D19-5702},
  doi       = {10.18653/v1/D19-5702},
  pages     = {11--15},
  abstract  = {The recognition of pharmacological substances, compounds and proteins
               is an essential preliminary work for the recognition of relations
               between chemicals and other biomedically relevant units. In this
               paper, we describe an approach to Task 1 of the PharmaCoNER Challenge,
               which involves the recognition of mentions of chemicals and drugs
               in Spanish medical texts. We train a state-of-the-art BiLSTM-CRF
               sequence tagger with stacked Pooled Contextualized Embeddings,
               word and sub-word embeddings using the open-source framework FLAIR.
               We present a new corpus composed of articles and papers from Spanish
               health science journals, termed the Spanish Health Corpus, and
               use it to train domain-specific embeddings which we incorporate
               in our model training. We achieve a result of 89.76{\%} F1-score
               using pre-trained embeddings and are able to improve these results
               to 90.52{\%} F1-score using specialized embeddings.}
}
Sajawel Ahmed, Manuel Stoeckel, Christine Driller, Adrian Pachzelt and Alexander Mehler. 2019. BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 871–880.
BibTeX
@inproceedings{Ahmed:Stoeckel:Driller:Pachzelt:Mehler:2019,
  author    = {Sajawel Ahmed and Manuel Stoeckel and Christine Driller and Adrian Pachzelt
               and Alexander Mehler},
  title     = {{BIOfid Dataset: Publishing a German Gold Standard for Named Entity
               Recognition in Historical Biodiversity Literature}},
  publisher = {Association for Computational Linguistics},
  year      = {2019},
  booktitle = {Proceedings of the 23rd Conference on Computational Natural Language
               Learning (CoNLL)},
  address   = {Hong Kong, China},
  url       = {https://www.aclweb.org/anthology/K19-1081},
  doi       = {10.18653/v1/K19-1081},
  pages     = {871--880},
  abstract  = {The Specialized Information Service Biodiversity Research (BIOfid)
               has been launched to mobilize valuable biological data from printed
               literature hidden in German libraries for over the past 250 years.
               In this project, we annotate German texts converted by OCR from
               historical scientific literature on the biodiversity of plants,
               birds, moths and butterflies. Our work enables the automatic extraction
               of biological information previously buried in the mass of papers
               and volumes. For this purpose, we generated training data for
               the tasks of Named Entity Recognition (NER) and Taxa Recognition
               (TR) in biological documents. We use this data to train a number
               of leading machine learning tools and create a gold standard for
               TR in biodiversity literature. More specifically, we perform a
               practical analysis of our newly generated BIOfid dataset through
               various downstream-task evaluations and establish a new state
               of the art for TR with 80.23{\%} F-score. In this sense, our paper
               lays the foundations for future work in the field of information
               extraction in biology texts.},
  keywords  = {biofid}
}