When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in Spanish [1]
Presentation at BioNLP-OST 2019 workshop, collocated with EMNLP-IJCNLP 2019
Monday, November 4, 2019, Hong Kong
Efficient access to information on chemicals and pharmaceutical units has become increasingly important for researchers in various chemical disciplines. Previous work has been successful in detecting and classifying chemical substances or in extracting complex relations between chemical substances. While most NLP research is conducted on English datasets, there are a considerable number of non-English bio-medically relevant texts written in other languages, e.g. clinical texts. In order to advance the further development of biomedical and pharmaceutical entity recognition facing this linguistic diversity, the PharmaCoNER task challenged participants with Named Entity Recognition (NER) for pharmacological substances, compounds and proteins on a Spanish corpus.
We presented our solution to the PharmaCoNER challenge at the BioNLP-OST 2019 workshop at the EMNLP-IJCNLP 2019 in Hong Kong. Our team placed third out of 22 participating teams with 89.88% F1-Score using Pooled Contextualized Embeddings, Word and Sub-word Embeddings and a Bi-LSTM-CRF Tagger.
![[doi]](https://www.texttechnologylab.org/wp-content/plugins/papercite/img/external.png)
[Bibtex]
@inproceedings{Stoeckel:Hemati:Mehler:2019,
title = "When Specialization Helps: Using Pooled Contextualized Embeddings to Detect Chemical and Biomedical Entities in {S}panish",
author = "Stoeckel, Manuel and Hemati, Wahed and Mehler, Alexander",
booktitle = "Proceedings of The 5th Workshop on BioNLP Open Shared Tasks",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/D19-5702",
doi = "10.18653/v1/D19-5702",
pages = "11--15",
abstract = "The recognition of pharmacological substances, compounds and proteins is an essential preliminary work for the recognition of relations between chemicals and other biomedically relevant units. In this paper, we describe an approach to Task 1 of the PharmaCoNER Challenge, which involves the recognition of mentions of chemicals and drugs in Spanish medical texts. We train a state-of-the-art BiLSTM-CRF sequence tagger with stacked Pooled Contextualized Embeddings, word and sub-word embeddings using the open-source framework FLAIR. We present a new corpus composed of articles and papers from Spanish health science journals, termed the Spanish Health Corpus, and use it to train domain-specific embeddings which we incorporate in our model training. We achieve a result of 89.76{\%} F1-score using pre-trained embeddings and are able to improve these results to 90.52{\%} F1-score using specialized embeddings.",
}