Manuel Stoeckel

Manuel Stoeckel

Scientific Assistant

Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
Room 401b
D-60325 Frankfurt am Main
D-60054 Frankfurt am Main (use for package delivery)
Postfach / P.O. Box: 154
Phone:
Mail:

Office Hour: Wednesday, 10-12 AM

Teaching

2024

Lecture: NLP-gestützte Data Science. Alexander Mehler and Manuel Stoeckel.
QISOLAT
Practical: Transformer-based Natural Language Processing. Alexander Mehler and Manuel Stoeckel.
QISOLAT

2023

Lecture: Einführung Computational Humanities. Alexander Mehler and Manuel Stoeckel.
QISOLAT
Practical: Transformer-based Natural Language Processing. Alexander Mehler and Manuel Stoeckel.
QISOLAT
Lecture: NLP-gestützte Data Science. Alexander Mehler, Manuel Stoeckel and Giuseppe Abrami.
QISOLAT
Practical: Transformer-based Natural Language Processing. Alexander Mehler and Manuel Stoeckel.
QISOLAT

Projects

The specialised information service BIOfid (www.biofid.de) is oriented towards the special needs of scientists researching biodiversity topics at research institutions and in natural history collections. Since 2017, BIOfid has been building an infrastructure that contributes to the provision and mobilisation of research-relevant data in a variety of ways in the context of current developments in biodiversity research.

Specialised Information Service Biodiversity Research (BIOfid). 2017 – . Funded by DFG (FID 326061700).
Description
The specialised information service BIOfid (www.biofid.de) is oriented towards the special needs of scientists researching biodiversity topics at research institutions and in natural history collections. Since 2017, BIOfid has been building an infrastructure that contributes to the provision and mobilisation of research-relevant data in a variety of ways in the context of current developments in biodiversity research.
BibTeX
@project{biofid,
  name      = {Specialised Information Service Biodiversity Research (BIOfid)},
  abstract  = {The specialised information service BIOfid (www.biofid.de) is
               oriented towards the special needs of scientists researching biodiversity
               topics at research institutions and in natural history collections.
               Since 2017, BIOfid has been building an infrastructure that contributes
               to the provision and mobilisation of research-relevant data in
               a variety of ways in the context of current developments in biodiversity
               research.},
  year      = {2017},
  funded_by = {DFG (FID 326061700)},
  funded_by_url = {https://gepris.dfg.de/gepris/projekt/326061700},
  url       = {https://www.biofid.de/en/},
  repository = {https://github.com/FID-Biodiversity},
  logo      = {/wp-content/uploads/2024/01/logo-BIOfid.png}
}
BIOfid Publications
Sajawel Ahmed, Manuel Stoeckel, Christine Driller, Adrian Pachzelt and Alexander Mehler. 2019. BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 871–880.
BibTeX
@inproceedings{Ahmed:Stoeckel:Driller:Pachzelt:Mehler:2019,
  author    = {Sajawel Ahmed and Manuel Stoeckel and Christine Driller and Adrian Pachzelt
               and Alexander Mehler},
  title     = {{BIOfid Dataset: Publishing a German Gold Standard for Named Entity
               Recognition in Historical Biodiversity Literature}},
  publisher = {Association for Computational Linguistics},
  year      = {2019},
  booktitle = {Proceedings of the 23rd Conference on Computational Natural Language
               Learning (CoNLL)},
  address   = {Hong Kong, China},
  url       = {https://www.aclweb.org/anthology/K19-1081},
  doi       = {10.18653/v1/K19-1081},
  pages     = {871--880},
  abstract  = {The Specialized Information Service Biodiversity Research (BIOfid)
               has been launched to mobilize valuable biological data from printed
               literature hidden in German libraries for over the past 250 years.
               In this project, we annotate German texts converted by OCR from
               historical scientific literature on the biodiversity of plants,
               birds, moths and butterflies. Our work enables the automatic extraction
               of biological information previously buried in the mass of papers
               and volumes. For this purpose, we generated training data for
               the tasks of Named Entity Recognition (NER) and Taxa Recognition
               (TR) in biological documents. We use this data to train a number
               of leading machine learning tools and create a gold standard for
               TR in biodiversity literature. More specifically, we perform a
               practical analysis of our newly generated BIOfid dataset through
               various downstream-task evaluations and establish a new state
               of the art for TR with 80.23{\%} F-score. In this sense, our paper
               lays the foundations for future work in the field of information
               extraction in biology texts.},
  keywords  = {biofid}
}
Andy Lücking, Christine Driller, Manuel Stoeckel, Giuseppe Abrami, Adrian Pachzelt and Alexander Mehler. 2021. Multiple Annotation for Biodiversity: Developing an annotation framework among biology, linguistics and text technology. Language Resources and Evaluation.
BibTeX
@article{Luecking:et:al:2021,
  author    = {Andy Lücking and Christine Driller and Manuel Stoeckel and Giuseppe Abrami
               and Adrian Pachzelt and Alexander Mehler},
  year      = {2021},
  journal   = {Language Resources and Evaluation},
  title     = {Multiple Annotation for Biodiversity: Developing an annotation
               framework among biology, linguistics and text technology},
  editor    = {Nancy Ide and Nicoletta Calzolari},
  doi       = {10.1007/s10579-021-09553-5},
  pdf       = {https://link.springer.com/content/pdf/10.1007/s10579-021-09553-5.pdf},
  keywords  = {biofid}
}

Publications

Total: 8

2024

Kevin Bönisch, Manuel Stoeckel and Alexander Mehler. 2024. HyperCausal: Visualizing Causal Inference in 3D Hypertext. Proceedings of the 35th ACM Conference on Hypertext and Social Media. accepted.
BibTeX
@inproceedings{Boenisch:et:al:2024,
  author    = {B\"{o}nisch, Kevin and Stoeckel, Manuel and Mehler, Alexander},
  abstract  = {We present HyperCausal, a 3D hypertext visualization framework
               for exploring causal inference in generative Large Language Models
               (LLMs). HyperCausal maps the generative processes of LLMs into
               spatial hypertexts, where tokens are represented as nodes connected
               by probability-weighted edges. The edges are weighted by the prediction
               scores of next tokens, depending on the underlying language model.
               HyperCausal facilitates navigation through the causal space of
               the underlying LLM, allowing users to explore predicted word sequences
               and their branching. Through comparative analysis of LLM parameters
               such as token probabilities and search algorithms, HyperCausal
               provides insight into model behavior and performance. Implemented
               using the Hugging Face transformers library and Three.js, HyperCausal
               ensures cross-platform accessibility to advance research in natural
               language processing using concepts from hypertext research. We
               demonstrate several use cases of HyperCausal and highlight the
               potential for detecting hallucinations generated by LLMs using
               this framework. The connection with hypertext research arises
               from the fact that HyperCausal relies on user interaction to unfold
               graphs with hierarchically appearing branching alternatives in
               3D space. This approach refers to spatial hypertexts and early
               concepts of hierarchical hypertext structures. A third connection
               concerns hypertext fiction, since the branching alternatives mediated
               by HyperCausal manifest non-linearly organized reading threads
               along artificially generated texts that the user decides to follow
               optionally depending on the reading context.},
  title     = {HyperCausal: Visualizing Causal Inference in 3D Hypertext},
  year      = {2024},
  publisher = {Association for Computing Machinery},
  address   = {New York, NY, USA},
  booktitle = {Proceedings of the 35th ACM Conference on Hypertext and Social Media},
  keywords  = {3D hypertext, large language models, visualization},
  location  = {Poznan, Poland},
  series    = {HT '24},
  note      = {accepted}
}

2022

Andy Lücking, Manuel Stoeckel, Giuseppe Abrami and Alexander Mehler. June, 2022. I still have Time(s): Extending HeidelTime for German Texts. Proceedings of the Language Resources and Evaluation Conference, 4723–4728.
BibTeX
@inproceedings{Luecking:Stoeckel:Abrami:Mehler:2022,
  author    = {L\"{u}cking, Andy and Stoeckel, Manuel and Abrami, Giuseppe and Mehler, Alexander},
  title     = {I still have Time(s): Extending HeidelTime for German Texts},
  booktitle = {Proceedings of the Language Resources and Evaluation Conference},
  month     = {June},
  year      = {2022},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {4723--4728},
  abstract  = {HeidelTime is one of the most widespread and successful tools
               for detecting temporal expressions in texts. Since HeidelTime’s
               pattern matching system is based on regular expression, it can
               be extended in a convenient way. We present such an extension
               for the German resources of HeidelTime: HeidelTimeExt. The extension
               has been brought about by means of observing false negatives within
               real world texts and various time banks. The gain in coverage
               is 2.7 \% or 8.5 \%, depending on the admitted degree of potential
               overgeneralization. We describe the development of HeidelTimeExt,
               its evaluation on text samples from various genres, and share
               some linguistic observations. HeidelTimeExt can be obtained from
               https://github.com/texttechnologylab/heideltime.},
  poster    = {https://www.texttechnologylab.org/wp-content/uploads/2022/06/HeidelTimeExt_LREC_2022.pdf},
  pdf       = {http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.505.pdf}
}

2021

Andy Lücking, Christine Driller, Manuel Stoeckel, Giuseppe Abrami, Adrian Pachzelt and Alexander Mehler. 2021. Multiple Annotation for Biodiversity: Developing an annotation framework among biology, linguistics and text technology. Language Resources and Evaluation.
BibTeX
@article{Luecking:et:al:2021,
  author    = {Andy Lücking and Christine Driller and Manuel Stoeckel and Giuseppe Abrami
               and Adrian Pachzelt and Alexander Mehler},
  year      = {2021},
  journal   = {Language Resources and Evaluation},
  title     = {Multiple Annotation for Biodiversity: Developing an annotation
               framework among biology, linguistics and text technology},
  editor    = {Nancy Ide and Nicoletta Calzolari},
  doi       = {10.1007/s10579-021-09553-5},
  pdf       = {https://link.springer.com/content/pdf/10.1007/s10579-021-09553-5.pdf},
  keywords  = {biofid}
}

2020

Manuel Stoeckel, Alexander Henlein, Wahed Hemati and Alexander Mehler. May, 2020. Voting for POS tagging of Latin texts: Using the flair of FLAIR to better Ensemble Classifiers by Example of Latin. Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages, 130–135.
BibTeX
@inproceedings{Stoeckel:et:al:2020,
  author    = {Stoeckel, Manuel and Henlein, Alexander and Hemati, Wahed and Mehler, Alexander},
  title     = {{Voting for POS tagging of Latin texts: Using the flair of FLAIR
               to better Ensemble Classifiers by Example of Latin}},
  booktitle = {Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies
               for Historical and Ancient Languages},
  month     = {May},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association (ELRA)},
  pages     = {130--135},
  abstract  = {Despite the great importance of the Latin language in the past,
               there are relatively few resources available today to develop
               modern NLP tools for this language. Therefore, the EvaLatin Shared
               Task for Lemmatization and Part-of-Speech (POS) tagging was published
               in the LT4HALA workshop. In our work, we dealt with the second
               EvaLatin task, that is, POS tagging. Since most of the available
               Latin word embeddings were trained on either few or inaccurate
               data, we trained several embeddings on better data in the first
               step. Based on these embeddings, we trained several state-of-the-art
               taggers and used them as input for an ensemble classifier called
               LSTMVoter. We were able to achieve the best results for both the
               cross-genre and the cross-time task (90.64\% and 87.00\%) without
               using additional annotated data (closed modality). In the meantime,
               we further improved the system and achieved even better results
               (96.91\% on classical, 90.87\% on cross-genre and 87.35\% on cross-time).},
  url       = {https://www.aclweb.org/anthology/2020.lt4hala-1.21},
  pdf       = {http://www.lrec-conf.org/proceedings/lrec2020/workshops/LT4HALA/pdf/2020.lt4hala-1.21.pdf}
}
Giuseppe Abrami, Manuel Stoeckel and Alexander Mehler. 2020. TextAnnotator: A UIMA Based Tool for the Simultaneous and Collaborative Annotation of Texts. Proceedings of The 12th Language Resources and Evaluation Conference, 891–900.
BibTeX
@inproceedings{Abrami:Stoeckel:Mehler:2020,
  author    = {Abrami, Giuseppe and Stoeckel, Manuel and Mehler, Alexander},
  title     = {TextAnnotator: A UIMA Based Tool for the Simultaneous and Collaborative
               Annotation of Texts},
  booktitle = {Proceedings of The 12th Language Resources and Evaluation Conference},
  year      = {2020},
  address   = {Marseille, France},
  publisher = {European Language Resources Association},
  pages     = {891--900},
  isbn      = {979-10-95546-34-4},
  abstract  = {The annotation of texts and other material in the field of digital
               humanities and Natural Language Processing (NLP) is a common task
               of research projects. At the same time, the annotation of corpora
               is certainly the most time- and cost-intensive component in research
               projects and often requires a high level of expertise according
               to the research interest. However, for the annotation of texts,
               a wide range of tools is available, both for automatic and manual
               annotation. Since the automatic pre-processing methods are not
               error-free and there is an increasing demand for the generation
               of training data, also with regard to machine learning, suitable
               annotation tools are required. This paper defines criteria of
               flexibility and efficiency of complex annotations for the assessment
               of existing annotation tools. To extend this list of tools, the
               paper describes TextAnnotator, a browser-based, multi-annotation
               system, which has been developed to perform platform-independent
               multimodal annotations and annotate complex textual structures.
               The paper illustrates the current state of development of TextAnnotator
               and demonstrates its ability to evaluate annotation quality (inter-annotator
               agreement) at runtime. In addition, it will be shown how annotations
               of different users can be performed simultaneously and collaboratively
               on the same document from different platforms using UIMA as the
               basis for annotation.},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.112},
  keywords  = {textannotator},
  pdf       = {http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.112.pdf}
}
Giuseppe Abrami, Alexander Mehler and Manuel Stoeckel. 2020. TextAnnotator: A web-based annotation suite for texts. Proceedings of the Digital Humanities 2020.
BibTeX
@inproceedings{Abrami:Mehler:Stoeckel:2020,
  author    = {Abrami, Giuseppe and Mehler, Alexander and Stoeckel, Manuel},
  title     = {{TextAnnotator}: A web-based annotation suite for texts},
  booktitle = {Proceedings of the Digital Humanities 2020},
  series    = {DH 2020},
  location  = {Ottawa, Canada},
  year      = {2020},
  url       = {https://dh2020.adho.org/wp-content/uploads/2020/07/547_TextAnnotatorAwebbasedannotationsuitefortexts.html},
  doi       = {http://dx.doi.org/10.17613/tenm-4907},
  abstract  = {The TextAnnotator is a tool for simultaneous and collaborative
               annotation of texts with visual annotation support, integration
               of knowledge bases and, by pipelining the TextImager, a rich variety
               of pre-processing and automatic annotation tools. It includes
               a variety of modules for the annotation of texts, which contains
               the annotation of argumentative, rhetorical, propositional and
               temporal structures as well as a module for named entity linking
               and rapid annotation of named entities. Especially the modules
               for annotation of temporal, argumentative and propositional structures
               are currently unique in web-based annotation tools. The TextAnnotator,
               which allows the annotation of texts as a platform, is divided
               into a front- and a backend component. The backend is a web service
               based on WebSockets, which integrates the UIMA Database Interface
               to manage and use texts. Texts are made accessible by using the
               ResourceManager and the AuthorityManager, based on user and group
               access permissions. Different views of a document can be created
               and used depending on the scenario. Once a document has been opened,
               access is gained to the annotations stored within annotation views
               in which these are organized. Any annotation view can be assigned
               with access permissions and by default, each user obtains his
               or her own user view for every annotated document. In addition,
               with sufficient access permissions, all annotation views can also
               be used and curated. This allows the possibility to calculate
               an Inter-Annotator-Agreement for a document, which shows an agreement
               between the annotators. Annotators without sufficient rights cannot
               display this value so that the annotators do not influence each
               other. This contribution is intended to reflect the current state
               of development of TextAnnotator, demonstrate the possibilities
               of an instantaneous Inter-Annotator-Agreement and trigger a discussion
               about further functions for the community.},
  keywords  = {textannotator},
  poster    = {https://hcommons.org/deposits/download/hc:31816/CONTENT/dh2020_textannotator_poster.pdf}
}

2019

Sajawel Ahmed, Manuel Stoeckel, Christine Driller, Adrian Pachzelt and Alexander Mehler. 2019. BIOfid Dataset: Publishing a German Gold Standard for Named Entity Recognition in Historical Biodiversity Literature. Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), 871–880.
BibTeX
@inproceedings{Ahmed:Stoeckel:Driller:Pachzelt:Mehler:2019,
  author    = {Sajawel Ahmed and Manuel Stoeckel and Christine Driller and Adrian Pachzelt
               and Alexander Mehler},
  title     = {{BIOfid Dataset: Publishing a German Gold Standard for Named Entity
               Recognition in Historical Biodiversity Literature}},
  publisher = {Association for Computational Linguistics},
  year      = {2019},
  booktitle = {Proceedings of the 23rd Conference on Computational Natural Language
               Learning (CoNLL)},
  address   = {Hong Kong, China},
  url       = {https://www.aclweb.org/anthology/K19-1081},
  doi       = {10.18653/v1/K19-1081},
  pages     = {871--880},
  abstract  = {The Specialized Information Service Biodiversity Research (BIOfid)
               has been launched to mobilize valuable biological data from printed
               literature hidden in German libraries for over the past 250 years.
               In this project, we annotate German texts converted by OCR from
               historical scientific literature on the biodiversity of plants,
               birds, moths and butterflies. Our work enables the automatic extraction
               of biological information previously buried in the mass of papers
               and volumes. For this purpose, we generated training data for
               the tasks of Named Entity Recognition (NER) and Taxa Recognition
               (TR) in biological documents. We use this data to train a number
               of leading machine learning tools and create a gold standard for
               TR in biodiversity literature. More specifically, we perform a
               practical analysis of our newly generated BIOfid dataset through
               various downstream-task evaluations and establish a new state
               of the art for TR with 80.23{\%} F-score. In this sense, our paper
               lays the foundations for future work in the field of information
               extraction in biology texts.},
  keywords  = {biofid}
}
Manuel Stoeckel, Wahed Hemati and Alexander Mehler. November, 2019. When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in Spanish. Proceedings of The 5th Workshop on BioNLP Open Shared Tasks, 11–15.
BibTeX
@inproceedings{Stoeckel:Hemati:Mehler:2019,
  title     = {When Specialization Helps: Using Pooled Contextualized Embeddings
               to Detect Chemical and Biomedical Entities in {S}panish},
  author    = {Stoeckel, Manuel and Hemati, Wahed and Mehler, Alexander},
  booktitle = {Proceedings of The 5th Workshop on BioNLP Open Shared Tasks},
  month     = {nov},
  year      = {2019},
  address   = {Hong Kong, China},
  publisher = {Association for Computational Linguistics},
  url       = {https://www.aclweb.org/anthology/D19-5702},
  doi       = {10.18653/v1/D19-5702},
  pages     = {11--15},
  abstract  = {The recognition of pharmacological substances, compounds and proteins
               is an essential preliminary work for the recognition of relations
               between chemicals and other biomedically relevant units. In this
               paper, we describe an approach to Task 1 of the PharmaCoNER Challenge,
               which involves the recognition of mentions of chemicals and drugs
               in Spanish medical texts. We train a state-of-the-art BiLSTM-CRF
               sequence tagger with stacked Pooled Contextualized Embeddings,
               word and sub-word embeddings using the open-source framework FLAIR.
               We present a new corpus composed of articles and papers from Spanish
               health science journals, termed the Spanish Health Corpus, and
               use it to train domain-specific embeddings which we incorporate
               in our model training. We achieve a result of 89.76{\%} F1-score
               using pre-trained embeddings and are able to improve these results
               to 90.52{\%} F1-score using specialized embeddings.}
}