Publication

New Publication accepted at Digital Humanities 2024

The following publication has been accepted at the Digital Humanities in Washington, D.C.

Efficient, uniform and scalable parallel NLP pre-processing with DUUI: Perspectives and Best Practice for the Digital Humanities

Giuseppe Abrami and Alexander Mehler. August, 2024. Efficient, uniform and scalable parallel NLP pre-processing with DUUI: Perspectives and Best Practice for the Digital Humanities. Digital Humanities Conference 2024 (DH 2024). accepted.
BibTeX
@inproceedings{Abrami:Mehler:2024,
  author    = {Abrami, Giuseppe and Mehler, Alexander},
  title     = {Efficient, uniform and scalable parallel NLP pre-processing with
               DUUI: Perspectives and Best Practice for the Digital Humanities},
  year      = {2024},
  month     = {08},
  booktitle = {Digital Humanities Conference 2024 (DH 2024)},
  location  = {Washington, DC, USA},
  series    = {DH},
  keywords  = {duui},
  note      = {accepted}
}

New Publications Accepted at LREC-COLING 2024

The following publications were accepted at the LREC-COLING 2024 in Turin / Italy:

Dependencies over Times and Tools (DoTT)

Andy Lücking, Giuseppe Abrami, Leon Hammerla, Marc Rahn, Daniel Baumartz, Steffen Eger and Alexander Mehler. May, 2024. Dependencies over Times and Tools (DoTT). Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 4641–4653.
BibTeX
@inproceedings{Luecking:et:al:2024,
  abstract  = {Purpose: Based on the examples of English and German, we investigate
               to what extent parsers trained on modern variants of these languages
               can be transferred to older language levels without loss. Methods:
               We developed a treebank called DoTT (https://github.com/texttechnologylab/DoTT)
               which covers, roughly, the time period from 1800 until today,
               in conjunction with the further development of the annotation
               tool DependencyAnnotator. DoTT consists of a collection of diachronic
               corpora enriched with dependency annotations using 3 parsers,
               6 pre-trained language models, 5 newly trained models for German,
               and two tag sets (TIGER and Universal Dependencies). To assess
               how the different parsers perform on texts from different time
               periods, we created a gold standard sample as a benchmark. Results:
               We found that the parsers/models perform quite well on modern
               texts (document-level LAS ranging from 82.89 to 88.54) and slightly
               worse on older texts, as expected (average document-level LAS
               84.60 vs. 86.14), but not significantly. For German texts, the
               (German) TIGER scheme achieved slightly better results than UD.
               Conclusion: Overall, this result speaks for the transferability
               of parsers to past language levels, at least dating back until
               around 1800. This very transferability, it is however argued,
               means that studies of language change in the field of dependency
               syntax can draw on dependency distance but miss out on some grammatical
               phenomena.},
  address   = {Torino, Italy},
  author    = {L{\"u}cking, Andy and Abrami, Giuseppe and Hammerla, Leon and Rahn, Marc
               and Baumartz, Daniel and Eger, Steffen and Mehler, Alexander},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational
               Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  editor    = {Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro
               and Sakti, Sakriani and Xue, Nianwen},
  month     = {may},
  pages     = {4641--4653},
  publisher = {ELRA and ICCL},
  title     = {Dependencies over Times and Tools ({D}o{TT})},
  url       = {https://aclanthology.org/2024.lrec-main.415},
  poster    = {https://www.texttechnologylab.org/wp-content/uploads/2024/05/LREC_2024_Poster_DoTT.pdf},
  year      = {2024}
}

German SRL: Corpus Construction and Model Training

Maxim Konca, Andy Lücking and Alexander Mehler. May, 2024. German SRL: Corpus Construction and Model Training. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 7717–7727.
BibTeX
@inproceedings{Konca:et:al:2024,
  abstract  = {A useful semantic role-annotated resource for training semantic
               role models for the German language is missing. We point out some
               problems of previous resources and provide a new one due to a
               combined translation and alignment process: The gold standard
               CoNLL-2012 semantic role annotations are translated into German.
               Semantic role labels are transferred due to alignment models.
               The resulting dataset is used to train a German semantic role
               model. With F1-scores around 0.7, the major roles achieve competitive
               evaluation scores, but avoid limitations of previous approaches.
               The described procedure can be applied to other languages as well.},
  address   = {Torino, Italy},
  author    = {Konca, Maxim and L{\"u}cking, Andy and Mehler, Alexander},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational
               Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  editor    = {Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro
               and Sakti, Sakriani and Xue, Nianwen},
  month     = {may},
  pages     = {7717--7727},
  publisher = {ELRA and ICCL},
  title     = {{G}erman {SRL}: Corpus Construction and Model Training},
  url       = {https://aclanthology.org/2024.lrec-main.682},
  poster    = {https://www.texttechnologylab.org/wp-content/uploads/2024/05/LREC_2024_Poster_GERMAN_SRL.pdf},
  year      = {2024}
}

German Parliamentary Corpus (GerParCor) Reloaded

Giuseppe Abrami, Mevlüt Bagci and Alexander Mehler. 2024. German Parliamentary Corpus (GerParCor) Reloaded. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 7707–7716.
BibTeX
@inproceedings{Abrami:et:al:2024,
  abstract  = {In 2022, the largest German-speaking corpus of parliamentary protocols
               from three different centuries, on a national and federal level
               from the countries of Germany, Austria, Switzerland and Liechtenstein,
               was collected and published - GerParCor. Through GerParCor, it
               became possible to provide for the first time various parliamentary
               protocols which were not available digitally and, moreover, could
               not be retrieved and processed in a uniform manner. Furthermore,
               GerParCor was additionally preprocessed using NLP methods and
               made available in XMI format. In this paper, GerParCor is significantly
               updated by including all new parliamentary protocols in the corpus,
               as well as adding and preprocessing further parliamentary protocols
               previously not covered, so that a period up to 1797 is now covered.
               Besides the integration of a new, state-of-the-art and appropriate
               NLP preprocessing for the handling of large text corpora, this
               update also provides an overview of the further reuse of GerParCor
               by presenting various provisioning capabilities such as API's,
               among others.},
  address   = {Torino, Italy},
  author    = {Abrami, Giuseppe and Bagci, Mevl{\"u}t and Mehler, Alexander},
  booktitle = {Proceedings of the 2024 Joint International Conference on Computational
               Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  editor    = {Calzolari, Nicoletta and Kan, Min-Yen and Hoste, Veronique and Lenci, Alessandro
               and Sakti, Sakriani and Xue, Nianwen},
  pages     = {7707--7716},
  publisher = {ELRA and ICCL},
  title     = {{G}erman Parliamentary Corpus ({G}er{P}ar{C}or) Reloaded},
  url       = {https://aclanthology.org/2024.lrec-main.681},
  pdf       = {https://aclanthology.org/2024.lrec-main.681.pdf},
  poster    = {https://www.texttechnologylab.org/wp-content/uploads/2024/05/GerParCor_Reloaded_Poster.pdf},
  video     = {https://www.youtube.com/watch?v=5X-w_oXOAYo},
  keywords  = {gerparcor,corpus},
  year      = {2024}
}