TTLab improved the performance of neural named entity reconition (NER) by a margin of up to 11% F-Score for the German language, thus closing the gap to high-resource languages like English and establishing a new state-of-the-art on each single open-source dataset (CoNLL 2003, GermEval 2014 and Tübingen Treebank 2018). Rather than designing deeper and wider hybrid neural architectures, we performed a detailed optimization and grammar-dependent morphological processing consisting of lemmatization and part-of-speech (POS) tagging.
Download pre-trained Word Embeddings for German
For the task of NER within the BIOfid project, we trained the word embeddings with an algoritmic extension of word2vec (i.e. wang2vec) on two large German text collections, namely the extended Leipzig corpus (60 Mio. sentences) and the COW corpus (600 Mio. Sentences). Here, we provide (under the CC-BY-4.0 license) for the research community the pre-trained embeddings in their various variations, all of which played a crucial role in boosting the performance of the given downstream-task. The embeddings are targeted to NER, however, not limited to it and can be used for any other NLP task as well.
Processing Method | Leipzig Corpus 2018 | COW Corpus 2016 |
Token | LeipzigMT.lower.wang2vec | COW.lower.wang2vec |
Lemmatization | LeipzigMT.lemma.wang2vec | COW.lemma.wang2vec |
Lemmatization & POS | LeipzigMT.lemmapos.wang2vec | COW.lemmapos.wang2vec |
Further information on the text corpora, the choice of parameters and the processing methods can be found in [1], or on our GitHub repository. Please cite this study if you happen to use our embeddings in your work. In case of further questions, do not hesitate to contact Sajawel Ahmed.
Reference
![[pdf]](https://www.texttechnologylab.org/wp-content/plugins/papercite/img/pdf.png)
[Bibtex]
@InProceedings{Ahmed:Mehler:2018,
author = {Sajawel Ahmed and Alexander Mehler},
title = {{Resource-Size matters: Improving Neural Named Entity Recognition with Optimized Large Corpora}},
abstract = {This study improves the performance of neural named entity recognition by a margin of up to 11% in terms of F-score on the example of a low-resource language like German, thereby outperforming existing baselines and establishing a new state-of-the-art on each single open-source dataset (CoNLL 2003, GermEval 2014 and Tübingen Treebank 2018). Rather than designing deeper and wider hybrid neural architectures, we gather all available resources and perform a detailed optimization and grammar-dependent morphological processing consisting of lemmatization and part-of-speech tagging prior to exposing the raw data to any training process. We test our approach in a threefold monolingual experimental setup of a) single, b) joint, and c) optimized training and shed light on the dependency of downstream-tasks on the size of corpora used to compute word embeddings.},
booktitle = {Proceedings of the 17th IEEE International Conference on Machine Learning and Applications (ICMLA)},
location = {Orlando, Florida, USA},
pdf = {https://arxiv.org/pdf/1807.10675.pdf},
year = 2018
}