Resource-Size matters! Neural Named Entity Recognition for German (in 2018)

TTLab improved the performance of neural named entity reconition (NER) by a margin of up to 11% F-Score for the German language, thus closing the gap to high-resource languages like English and establishing a new state-of-the-art on each single open-source dataset (CoNLL 2003, GermEval 2014 and Tübingen Treebank 2018). Rather than designing deeper and wider hybrid neural architectures, we performed a detailed optimization and grammar-dependent morphological processing consisting of lemmatization and part-of-speech (POS) tagging.

Download pre-trained Word Embeddings for German

For the task of NER within the BIOfid project, we trained the word embeddings with an algoritmic extension of word2vec (i.e. wang2vec) on two large German text collections, namely the extended Leipzig corpus (60 Mio. sentences) and the COW corpus (600 Mio. Sentences). Here, we provide (under the CC-By license) for the research community the pre-trained embeddings in their various variations, all of which played a crucial role in boosting the performance of the given downstream-task. The embeddings are targeted to NER, however, not limited to it and can be used for any other NLP task as well.

Processing Method Leipzig Corpus 2018 COW Corpus 2016
Token LeipzigMT.lower.wang2vec COW.lower.wang2vec
Lemmatization LeipzigMT.lemma.wang2vec COW.lemma.wang2vec
Lemmatization & POS LeipzigMT.lemmapos.wang2vec COW.lemmapos.wang2vec

Further information on the text corpora, the choice of parameters and the processing methods can be found in [1], or on our GitHub repository. Please cite this study if you happen to use our embeddings in your work. In case of further questions, do not hesitate to contact Sajawel Ahmed.

Reference

[1] [pdf] S. Ahmed and A. Mehler, “Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora,” in Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018. accepted
[BibTeX]
@InProceedings{Ahmed:Mehler:2018,
author = {Sajawel Ahmed and Alexander Mehler},
title = {{Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora}},
booktitle = {Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA)},
note = {accepted},
location = {Orlando, Florida, USA},
pdf = {https://arxiv.org/pdf/1807.10675.pdf},
year = 2018
}