
Rüdiger Gleim
- 2008 – July 2019: Scientific Assistant at Goethe University, Institute for Computer Science, Texttechnology Lab
- 2004 – 2008: Scientific staff member at Bielefeld University, Institute for Computational Linguistics
- 1997 – 2004: Study of Computer Science at Darmstadt University
Almost any study in corpus linguistics boils down to construct, annotate, represent and analyze linguistic data. The requirements on a proper database are often contradicting:
- It should be able to scale well with ever growing corpora such as the Wikipedia- while still being flexible for annotation and edition.
- It should serve a broad spectrum of analyses by minimizing the need to transform data for a specific kind of analysis, while still being space efficient.
- The data model should be able to mediate between standard formats while not becoming over-generic and difficult to handle.
- ….
Designing and developing linguistic databases has become a major topic for me. Realizing that there is no such thing as the ultimate solution Iam interested in all kinds of database management systems and paradigms including relational-, graph-, distributed- and NoSQL databases as well as APIs for persistent storage.
Total: 46
2020 (1)
-
A. Mehler, R. Gleim, R. Gaitsch, T. Uslu, and W. Hemati, “From Topic Networks to Distributed Cognitive Maps: Zipfian Topic Universes in the Area of Volunteered Geographic Information,” Complexity, vol. 4, pp. 1-47, 2020.
[BibTeX]@Article{Mehler:Gleim:Gaitsch:Uslu:Hemati:2020, author = {Alexander Mehler and R{\"{u}}diger Gleim and Regina Gaitsch and Tolga Uslu and Wahed Hemati}, title = {From Topic Networks to Distributed Cognitive Maps: {Zipfian} Topic Universes in the Area of Volunteered Geographic Information}, journal = {Complexity}, volume = {4}, doi={10.1155/2020/4607025}, pages = {1-47}, issuetitle = {Cognitive Network Science: A New Frontier}, year = {2020}, }
2019 (2)
-
A. Mehler, T. Uslu, R. Gleim, and D. Baumartz, “text2ddc meets Literature – Ein Verfahren für die Analyse und Visualisierung thematischer Makrostrukturen,” in Proceedings of the 6th Digital Humanities Conference in the German-speaking Countries, DHd 2019, 2019.
[Poster][BibTeX]@InProceedings{Mehler:Uslu:Gleim:Baumartz:2019, Author = {Mehler, Alexander and Uslu, Tolga and Gleim, Rüdiger and Baumartz, Daniel}, Title = {{text2ddc meets Literature - Ein Verfahren für die Analyse und Visualisierung thematischer Makrostrukturen}}, BookTitle = {Proceedings of the 6th Digital Humanities Conference in the German-speaking Countries, DHd 2019}, poster = {https://www.texttechnologylab.org/wp-content/uploads/2019/04/DHD_Poster___text2ddc_meets_Literature_Poster.pdf}, Series = {DHd 2019}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2019/04/Preprint_DHd2019_text2ddc_meets_Literature.pdf}, location = {Frankfurt, Germany}, year = 2019 }
-
R. Gleim, S. Eger, A. Mehler, T. Uslu, W. Hemati, A. Lücking, A. Henlein, S. Kahlsdorf, and A. Hoenen, “A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin,” Journal of Language Modeling, 2019.
[BibTeX]@article{Gleim:Eger:Mehler:2019, author = {Gleim, R\"{u}diger and Eger, Steffen and Mehler, Alexander and Uslu, Tolga and Hemati, Wahed and L\"{u}cking, Andy and Henlein, Alexander and Kahlsdorf, Sven and Hoenen, Armin}, title = {A practitioner's view: a survey and comparison of lemmatization and morphological tagging in German and Latin}, journal = {Journal of Language Modeling}, year = {2019}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2019/07/jlm-tagging.pdf}, doi = {10.15398/jlm.v7i1.205}, url = {http://jlm.ipipan.waw.pl/index.php/JLM/article/view/205} }
2018 (6)
-
C. Driller, M. Koch, M. Schmidt, C. Weiland, T. Hörnschemeyer, T. Hickler, G. Abrami, S. Ahmed, R. Gleim, W. Hemati, T. Uslu, A. Mehler, A. Pachzelt, J. Rexhepi, T. Risse, J. Schuster, G. Kasperek, and A. Hausinger, “Workflow and Current Achievements of BIOfid, an Information Service Mobilizing Biodiversity Data from Literature Sources,” Biodiversity Information Science and Standards, vol. 2, p. e25876, 2018.
[Abstract] [BibTeX]BIOfid is a specialized information service currently being developed to mobilize biodiversity data dormant in printed historical and modern literature and to offer a platform for open access journals on the science of biodiversity. Our team of librarians, computer scientists and biologists produce high-quality text digitizations, develop new text-mining tools and generate detailed ontologies enabling semantic text analysis and semantic search by means of user-specific queries. In a pilot project we focus on German publications on the distribution and ecology of vascular plants, birds, moths and butterflies extending back to the Linnaeus period about 250 years ago. The three organism groups have been selected according to current demands of the relevant research community in Germany. The text corpus defined for this purpose comprises over 400 volumes with more than 100,000 pages to be digitized and will be complemented by journals from other digitization projects, copyright-free and project-related literature. With TextImager (Natural Language Processing & Text Visualization) and TextAnnotator (Discourse Semantic Annotation) we have already extended and launched tools that focus on the text-analytical section of our project. Furthermore, taxonomic and anatomical ontologies elaborated by us for the taxa prioritized by the project’s target group - German institutions and scientists active in biodiversity research - are constantly improved and expanded to maximize scientific data output. Our poster describes the general workflow of our project ranging from literature acquisition via software development, to data availability on the BIOfid web portal (http://biofid.de/), and the implementation into existing platforms which serve to promote global accessibility of biodiversity data.
@article{Driller:et:al:2018, author = {Christine Driller and Markus Koch and Marco Schmidt and Claus Weiland and Thomas Hörnschemeyer and Thomas Hickler and Giuseppe Abrami and Sajawel Ahmed and Rüdiger Gleim and Wahed Hemati and Tolga Uslu and Alexander Mehler and Adrian Pachzelt and Jashar Rexhepi and Thomas Risse and Janina Schuster and Gerwin Kasperek and Angela Hausinger}, title = {Workflow and Current Achievements of BIOfid, an Information Service Mobilizing Biodiversity Data from Literature Sources}, volume = {2}, number = {}, year = {2018}, doi = {10.3897/biss.2.25876}, publisher = {Pensoft Publishers}, abstract = {BIOfid is a specialized information service currently being developed to mobilize biodiversity data dormant in printed historical and modern literature and to offer a platform for open access journals on the science of biodiversity. Our team of librarians, computer scientists and biologists produce high-quality text digitizations, develop new text-mining tools and generate detailed ontologies enabling semantic text analysis and semantic search by means of user-specific queries. In a pilot project we focus on German publications on the distribution and ecology of vascular plants, birds, moths and butterflies extending back to the Linnaeus period about 250 years ago. The three organism groups have been selected according to current demands of the relevant research community in Germany. The text corpus defined for this purpose comprises over 400 volumes with more than 100,000 pages to be digitized and will be complemented by journals from other digitization projects, copyright-free and project-related literature. With TextImager (Natural Language Processing & Text Visualization) and TextAnnotator (Discourse Semantic Annotation) we have already extended and launched tools that focus on the text-analytical section of our project. Furthermore, taxonomic and anatomical ontologies elaborated by us for the taxa prioritized by the project’s target group - German institutions and scientists active in biodiversity research - are constantly improved and expanded to maximize scientific data output. Our poster describes the general workflow of our project ranging from literature acquisition via software development, to data availability on the BIOfid web portal (http://biofid.de/), and the implementation into existing platforms which serve to promote global accessibility of biodiversity data.}, issn = {}, pages = {e25876}, URL = {https://doi.org/10.3897/biss.2.25876}, eprint = {https://doi.org/10.3897/biss.2.25876}, journal = {Biodiversity Information Science and Standards} }
-
A. Mehler, W. Hemati, R. Gleim, and D. Baumartz, “VienNA: Auf dem Weg zu einer Infrastruktur für die verteilte interaktive evolutionäre Verarbeitung natürlicher Sprache,” in Forschungsinfrastrukturen und digitale Informationssysteme in der germanistischen Sprachwissenschaft , H. Lobin, R. Schneider, and A. Witt, Eds., Berlin: De Gruyter, 2018, vol. 6.
[BibTeX]@InCollection{Mehler:Hemati:Gleim:Baumartz:2018, Author = {Alexander Mehler and Wahed Hemati and Rüdiger Gleim and Daniel Baumartz}, Title = {{VienNA: }{Auf dem Weg zu einer Infrastruktur für die verteilte interaktive evolutionäre Verarbeitung natürlicher Sprache}}, BookTitle = {Forschungsinfrastrukturen und digitale Informationssysteme in der germanistischen Sprachwissenschaft }, Publisher = {De Gruyter}, Editor = {Henning Lobin and Roman Schneider and Andreas Witt}, Volume = {6}, Address = {Berlin}, year = 2018 }
-
T. Uslu, L. Miebach, S. Wolfsgruber, M. Wagner, K. Fließbach, R. Gleim, W. Hemati, A. Henlein, and A. Mehler, “Automatic Classification in Memory Clinic Patients and in Depressive Patients,” in Proceedings of Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric impairments (RaPID-2), 2018.
[BibTeX]@InProceedings{Uslu:et:al:2018:a, Author = {Tolga Uslu and Lisa Miebach and Steffen Wolfsgruber and Michael Wagner and Klaus Fließbach and Rüdiger Gleim and Wahed Hemati and Alexander Henlein and Alexander Mehler}, Title = {{Automatic Classification in Memory Clinic Patients and in Depressive Patients}}, BookTitle = {Proceedings of Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric impairments (RaPID-2)}, Series = {RaPID}, location = {Miyazaki, Japan}, year = 2018 }
-
A. Mehler, R. Gleim, A. Lücking, T. Uslu, and C. Stegbauer, “On the Self-similarity of Wikipedia Talks: a Combined Discourse-analytical and Quantitative Approach,” Glottometrics, vol. 40, pp. 1-44, 2018.
[BibTeX]@Article{Mehler:Gleim:Luecking:Uslu:Stegbauer:2018, Author = {Alexander Mehler and Rüdiger Gleim and Andy Lücking and Tolga Uslu and Christian Stegbauer}, Title = {On the Self-similarity of {Wikipedia} Talks: a Combined Discourse-analytical and Quantitative Approach}, Journal = {Glottometrics}, Volume = {40}, Pages = {1-44}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2018/03/Glottometrics-Mehler.pdf}, year = 2018 }
-
R. Gleim, A. Mehler, and S. Y. Song, “WikiDragon: A Java Framework For Diachronic Content And Network Analysis Of MediaWikis,” in Proceedings of the 11th edition of the Language Resources and Evaluation Conference, May 7 – 12, Miyazaki, Japan, 2018.
[BibTeX]@InProceedings{Gleim:Mehler:Song:2018, Author = {R{\"u}diger Gleim and Alexander Mehler and Sung Y. Song}, Title = {WikiDragon: A Java Framework For Diachronic Content And Network Analysis Of MediaWikis}, BookTitle = {Proceedings of the 11th edition of the Language Resources and Evaluation Conference, May 7 - 12}, Series = {LREC 2018}, Address = {Miyazaki, Japan}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2018/03/WikiDragon.pdf}, year = 2018 }
-
G. Abrami, S. Ahmed, R. Gleim, W. Hemati, A. Mehler, and U. Tolga, Natural Language Processing and Text Mining for BIOfid, 2018.
[BibTeX]@misc{Abrami:et:al:2018b, author = {Abrami, Giuseppe and Ahmed, Sajawel and Gleim, R{\"u}diger and Hemati, Wahed and Mehler, Alexander and Uslu Tolga}, title = {{Natural Language Processing and Text Mining for BIOfid}}, howpublished = {Presentation at the 1st Meeting of the Scientific Advisory Board of the BIOfid Project}, adress = {Goethe-University, Frankfurt am Main, Germany}, year = {2018}, month = {March}, day = {08}, pdf = {} }
2017 (1)
-
A. Mehler, R. Gleim, W. Hemati, and T. Uslu, “Skalenfreie online soziale Lexika am Beispiel von Wiktionary,” in Proceedings of 53rd Annual Conference of the Institut für Deutsche Sprache (IDS), March 14-16, Mannheim, Germany, Berlin, 2017. In German. Title translates into: Scale-free online-social Lexika by Example of Wiktionary
[Abstract] [BibTeX]In English: The paper deals with characteristics of the structural, thematic and participatory dynamics of collaboratively generated lexical networks. This is done by example of Wiktionary. Starting from a network-theoretical model in terms of so-called multi-layer networks, we describe Wiktionary as a scale-free lexicon. Systems of this sort are characterized by the fact that their content-related dynamics is determined by the underlying dynamics of collaborating authors. This happens in a way that social structure imprints on content structure. According to this conception, the unequal distribution of the activities of authors results in a correspondingly unequal distribution of the information units documented within the lexicon. The paper focuses on foundations for describing such systems starting from a parameter space which requires to deal with Wiktionary as an issue in big data analysis. In German: Der Beitrag thematisiert Eigenschaften der strukturellen, thematischen und partizipativen Dynamik kollaborativ erzeugter lexikalischer Netzwerke am Beispiel von Wiktionary. Ausgehend von einem netzwerktheoretischen Modell in Form so genannter Mehrebenennetzwerke wird Wiktionary als ein skalenfreies Lexikon beschrieben. Systeme dieser Art zeichnen sich dadurch aus, dass ihre inhaltliche Dynamik durch die zugrundeliegende Kollaborationsdynamik bestimmt wird, und zwar so, dass sich die soziale Struktur der entsprechenden inhaltlichen Struktur aufprägt. Dieser Auffassung gemäß führt die Ungleichverteilung der Aktivitäten von Lexikonproduzenten zu einer analogen Ungleichverteilung der im Lexikon dokumentierten Informationseinheiten. Der Beitrag thematisiert Grundlagen zur Beschreibung solcher Systeme ausgehend von einem Parameterraum, welcher die netzwerkanalytische Betrachtung von Wiktionary als Big-Data-Problem darstellt.
@InProceedings{Mehler:Gleim:Hemati:Uslu:2017, Author = {Alexander Mehler and Rüdiger Gleim and Wahed Hemati and Tolga Uslu}, Title = {{Skalenfreie online soziale Lexika am Beispiel von Wiktionary}}, BookTitle = {Proceedings of 53rd Annual Conference of the Institut für Deutsche Sprache (IDS), March 14-16, Mannheim, Germany}, Editor = {Stefan Engelberg and Henning Lobin and Kathrin Steyer and Sascha Wolfer}, Address = {Berlin}, Publisher = {De Gruyter}, Note = {In German. Title translates into: Scale-free online-social Lexika by Example of Wiktionary}, abstract = {In English: The paper deals with characteristics of the structural, thematic and participatory dynamics of collaboratively generated lexical networks. This is done by example of Wiktionary. Starting from a network-theoretical model in terms of so-called multi-layer networks, we describe Wiktionary as a scale-free lexicon. Systems of this sort are characterized by the fact that their content-related dynamics is determined by the underlying dynamics of collaborating authors. This happens in a way that social structure imprints on content structure. According to this conception, the unequal distribution of the activities of authors results in a correspondingly unequal distribution of the information units documented within the lexicon. The paper focuses on foundations for describing such systems starting from a parameter space which requires to deal with Wiktionary as an issue in big data analysis. In German: Der Beitrag thematisiert Eigenschaften der strukturellen, thematischen und partizipativen Dynamik kollaborativ erzeugter lexikalischer Netzwerke am Beispiel von Wiktionary. Ausgehend von einem netzwerktheoretischen Modell in Form so genannter Mehrebenennetzwerke wird Wiktionary als ein skalenfreies Lexikon beschrieben. Systeme dieser Art zeichnen sich dadurch aus, dass ihre inhaltliche Dynamik durch die zugrundeliegende Kollaborationsdynamik bestimmt wird, und zwar so, dass sich die soziale Struktur der entsprechenden inhaltlichen Struktur aufprägt. Dieser Auffassung gemäß führt die Ungleichverteilung der Aktivitäten von Lexikonproduzenten zu einer analogen Ungleichverteilung der im Lexikon dokumentierten Informationseinheiten. Der Beitrag thematisiert Grundlagen zur Beschreibung solcher Systeme ausgehend von einem Parameterraum, welcher die netzwerkanalytische Betrachtung von Wiktionary als Big-Data-Problem darstellt.}, year = 2017 }
2016 (3)
-
A. Mehler, B. Wagner, and R. Gleim, “Wikidition: Towards A Multi-layer Network Model of Intertextuality,” in Proceedings of DH 2016, 12-16 July, 2016.
[Abstract] [BibTeX]The paper presents Wikidition, a novel text mining tool for generating online editions of text corpora. It explores lexical, sentential and textual relations to span multi-layer networks (linkification) that allow for browsing syntagmatic and paradigmatic relations among the constituents of its input texts. In this way, relations of text reuse can be explored together with lexical relations within the same literary memory information system. Beyond that, Wikidition contains a module for automatic lexiconisation to extract author specific vocabularies. Based on linkification and lexiconisation, Wikidition does not only allow for traversing input corpora on different (lexical, sentential and textual) levels. Rather, its readers can also study the vocabulary of authors on several levels of resolution including superlemmas, lemmas, syntactic words and wordforms. We exemplify Wikidition by a range of literary texts and evaluate it by means of the apparatus of quantitative network analysis.
@InProceedings{Mehler:Wagner:Gleim:2016, Author = {Mehler, Alexander and Wagner, Benno and Gleim, R\"{u}diger}, Title = {Wikidition: Towards A Multi-layer Network Model of Intertextuality}, BookTitle = {Proceedings of DH 2016, 12-16 July}, Series = {DH 2016}, abstract = {The paper presents Wikidition, a novel text mining tool for generating online editions of text corpora. It explores lexical, sentential and textual relations to span multi-layer networks (linkification) that allow for browsing syntagmatic and paradigmatic relations among the constituents of its input texts. In this way, relations of text reuse can be explored together with lexical relations within the same literary memory information system. Beyond that, Wikidition contains a module for automatic lexiconisation to extract author specific vocabularies. Based on linkification and lexiconisation, Wikidition does not only allow for traversing input corpora on different (lexical, sentential and textual) levels. Rather, its readers can also study the vocabulary of authors on several levels of resolution including superlemmas, lemmas, syntactic words and wordforms. We exemplify Wikidition by a range of literary texts and evaluate it by means of the apparatus of quantitative network analysis.}, location = {Kraków}, url = {http://dh2016.adho.org/abstracts/250}, year = 2016 }
-
S. Eger, R. Gleim, and A. Mehler, “Lemmatization and Morphological Tagging in German and Latin: A comparison and a survey of the state-of-the-art,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016.
[BibTeX]@InProceedings{Eger:Mehler:Gleim:2016, Author = {Eger, Steffen and Gleim, R\"{u}diger and Mehler, Alexander}, Title = {Lemmatization and Morphological Tagging in {German} and {Latin}: A comparison and a survey of the state-of-the-art}, BookTitle = {Proceedings of the 10th International Conference on Language Resources and Evaluation}, Series = {LREC 2016}, location = {Portoro\v{z} (Slovenia)}, pdf = {http://www.texttechnologylab.org/wp-content/uploads/2016/04/lrec_eger_gleim_mehler.pdf}, year = 2016 }
-
A. Mehler, R. Gleim, T. vor der Brück, W. Hemati, T. Uslu, and S. Eger, “Wikidition: Automatic Lexiconization and Linkification of Text Corpora,” Information Technology, vol. 58, pp. 70-79, 2016.
[Abstract] [BibTeX]We introduce a new text technology, called Wikidition, which automatically generates large scale editions of corpora of natural language texts. Wikidition combines a wide range of text mining tools for automatically linking lexical, sentential and textual units. This includes the extraction of corpus-specific lexica down to the level of syntactic words and their grammatical categories. To this end, we introduce a novel measure of text reuse and exemplify Wikidition by means of the capitularies, that is, a corpus of Medieval Latin texts.
@Article{Mehler:et:al:2016, Author = {Alexander Mehler and Rüdiger Gleim and Tim vor der Brück and Wahed Hemati and Tolga Uslu and Steffen Eger}, Title = {Wikidition: Automatic Lexiconization and Linkification of Text Corpora}, Journal = {Information Technology}, Volume = {58}, Pages = {70-79}, abstract = {We introduce a new text technology, called Wikidition, which automatically generates large scale editions of corpora of natural language texts. Wikidition combines a wide range of text mining tools for automatically linking lexical, sentential and textual units. This includes the extraction of corpus-specific lexica down to the level of syntactic words and their grammatical categories. To this end, we introduce a novel measure of text reuse and exemplify Wikidition by means of the capitularies, that is, a corpus of Medieval Latin texts.}, doi = {10.1515/itit-2015-0035}, year = 2016 }
2015 (3)
-
A. Mehler and R. Gleim, “Linguistic Networks — An Online Platform for Deriving Collocation Networks from Natural Language Texts,” in Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, A. Mehler, A. Lücking, S. Banisch, P. Blanchard, and B. Frank-Job, Eds., Springer, 2015.
[BibTeX]@InCollection{Mehler:Gleim:2015:a, Author = {Mehler, Alexander and Gleim, Rüdiger}, Title = {Linguistic Networks -- An Online Platform for Deriving Collocation Networks from Natural Language Texts}, BookTitle = {Towards a Theoretical Framework for Analyzing Complex Linguistic Networks}, Publisher = {Springer}, Editor = {Mehler, Alexander and Lücking, Andy and Banisch, Sven and Blanchard, Philippe and Frank-Job, Barbara}, Series = {Understanding Complex Systems}, year = 2015 }
-
A. Mehler, T. vor der Brück, R. Gleim, and T. Geelhaar, “Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts using the TTLab Latin Tagger,” in Text Mining: From Ontology Learning to Automated text Processing Applications, C. Biemann and A. Mehler, Eds., Berlin/New York: Springer, 2015, pp. 87-112.
[Abstract] [BibTeX]The analysis of longitudinal corpora of historical texts requires the integrated development of tools for automatically preprocessing these texts and for building representation models of their genre- and register-related dynamics. In this chapter we present such a joint endeavor that ranges from resource formation via preprocessing to network-based text representation and classification. We start with presenting the so-called TTLab Latin Tagger (TLT) that preprocesses texts of classical and medieval Latin. Its lexical resource in the form of the Frankfurt Latin Lexicon (FLL) is also briefly introduced. As a first test case for showing the expressiveness of these resources, we perform a tripartite classification task of authorship attribution, genre detection and a combination thereof. To this end, we introduce a novel text representation model that explores the core structure (the so-called coreness) of lexical network representations of texts. Our experiment shows the expressiveness of this representation format and mediately of our Latin preprocessor.
@InCollection{Mehler:Brueck:Gleim:Geelhaar:2015, Author = {Mehler, Alexander and vor der Brück, Tim and Gleim, Rüdiger and Geelhaar, Tim}, Title = {Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts using the TTLab Latin Tagger}, BookTitle = {Text Mining: From Ontology Learning to Automated text Processing Applications}, Publisher = {Springer}, Editor = {Chris Biemann and Alexander Mehler}, Series = {Theory and Applications of Natural Language Processing}, Pages = {87-112}, Address = {Berlin/New York}, abstract = {The analysis of longitudinal corpora of historical texts requires the integrated development of tools for automatically preprocessing these texts and for building representation models of their genre- and register-related dynamics. In this chapter we present such a joint endeavor that ranges from resource formation via preprocessing to network-based text representation and classification. We start with presenting the so-called TTLab Latin Tagger (TLT) that preprocesses texts of classical and medieval Latin. Its lexical resource in the form of the Frankfurt Latin Lexicon (FLL) is also briefly introduced. As a first test case for showing the expressiveness of these resources, we perform a tripartite classification task of authorship attribution, genre detection and a combination thereof. To this end, we introduce a novel text representation model that explores the core structure (the so-called coreness) of lexical network representations of texts. Our experiment shows the expressiveness of this representation format and mediately of our Latin preprocessor.}, website = {http://link.springer.com/chapter/10.1007/978-3-319-12655-5_5}, year = 2015 }
-
R. Gleim and A. Mehler, “TTLab Preprocessor – Eine generische Web-Anwendung für die Vorverarbeitung von Texten und deren Evaluation,” in Accepted in the Proceedings of the Jahrestagung der Digital Humanities im deutschsprachigen Raum, 2015.
[BibTeX]@InProceedings{Gleim:Mehler:2015, Author = {Gleim, Rüdiger and Mehler, Alexander}, Title = {TTLab Preprocessor – Eine generische Web-Anwendung für die Vorverarbeitung von Texten und deren Evaluation}, BookTitle = {Accepted in the Proceedings of the Jahrestagung der Digital Humanities im deutschsprachigen Raum}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/Gleim_Mehler_PrePro_DHGraz2015.pdf}, year = 2015 }
2013 (1)
-
A. Mehler, C. Stegbauer, and R. Gleim, “Zur Struktur und Dynamik der kollaborativen Plagiatsdokumentation am Beispiel des GuttenPlag Wiki: eine Vorstudie,” in Die Dynamik sozialer und sprachlicher Netzwerke. Konzepte, Methoden und empirische Untersuchungen am Beispiel des WWW, B. Frank-Job, A. Mehler, and T. Sutter, Eds., Wiesbaden: VS Verlag, 2013.
[BibTeX]@InCollection{Mehler:Stegbauer:Gleim:2013, Author = {Mehler, Alexander and Stegbauer, Christian and Gleim, Rüdiger}, Title = {Zur Struktur und Dynamik der kollaborativen Plagiatsdokumentation am Beispiel des GuttenPlag Wiki: eine Vorstudie}, BookTitle = {Die Dynamik sozialer und sprachlicher Netzwerke. Konzepte, Methoden und empirische Untersuchungen am Beispiel des WWW}, Publisher = {VS Verlag}, Editor = {Frank-Job, Barbara and Mehler, Alexander and Sutter, Tilman}, Address = {Wiesbaden}, year = 2013 }
2012 (3)
-
A. Mehler, C. Stegbauer, and R. Gleim, “Latent Barriers in Wiki-based Collaborative Writing,” in Proceedings of the Wikipedia Academy: Research and Free Knowledge. June 29 – July 1 2012, Berlin, 2012.
[BibTeX]@InProceedings{Mehler:Stegbauer:Gleim:2012:b, Author = {Mehler, Alexander and Stegbauer, Christian and Gleim, Rüdiger}, Title = {Latent Barriers in Wiki-based Collaborative Writing}, BookTitle = {Proceedings of the Wikipedia Academy: Research and Free Knowledge. June 29 - July 1 2012}, Address = {Berlin}, month = {July}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/12_Paper_Alexander_Mehler_Christian_Stegbauer_Ruediger_Gleim.pdf}, year = 2012 }
-
R. Gleim, A. Mehler, and A. Ernst, “SOA implementation of the eHumanities Desktop,” in Proceedings of the Workshop on Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts, Digital Humanities 2012, Hamburg, Germany, 2012.
[Abstract] [BibTeX]The eHumanities Desktop is a system which allows users to upload, organize and share resources using a web interface. Furthermore resources can be processed, annotated and analyzed in various ways. Registered users can organize themselves in groups and collaboratively work on their data. The eHumanities Desktop is platform independent and runs in a web browser. This paper presents the system focusing on its service orientation and process management.
@InProceedings{Gleim:Mehler:Ernst:2012, Author = {Gleim, Rüdiger and Mehler, Alexander and Ernst, Alexandra}, Title = {SOA implementation of the eHumanities Desktop}, BookTitle = {Proceedings of the Workshop on Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts, Digital Humanities 2012, Hamburg, Germany}, abstract = {The eHumanities Desktop is a system which allows users to upload, organize and share resources using a web interface. Furthermore resources can be processed, annotated and analyzed in various ways. Registered users can organize themselves in groups and collaboratively work on their data. The eHumanities Desktop is platform independent and runs in a web browser. This paper presents the system focusing on its service orientation and process management.}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/dhc2012.pdf}, year = 2012 }
-
A. Mehler, S. Schwandt, R. Gleim, and A. Ernst, “Inducing Linguistic Networks from Historical Corpora: Towards a New Method in Historical Semantics,” in Proceedings of the Conference on New Methods in Historical Corpora, P. Bennett, M. Durrell, S. Scheible, and R. J. Whitt, Eds., Tübingen: Narr, 2012, vol. 3, pp. 257-274.
[BibTeX]@InCollection{Mehler:Schwandt:Gleim:Ernst:2012, Author = {Mehler, Alexander and Schwandt, Silke and Gleim, Rüdiger and Ernst, Alexandra}, Title = {Inducing Linguistic Networks from Historical Corpora: Towards a New Method in Historical Semantics}, BookTitle = {Proceedings of the Conference on New Methods in Historical Corpora}, Publisher = {Narr}, Editor = {Paul Bennett and Martin Durrell and Silke Scheible and Richard J. Whitt}, Volume = {3}, Series = {Corpus linguistics and Interdisciplinary perspectives on language (CLIP)}, Pages = {257--274}, Address = {Tübingen}, year = 2012 }
2011 (3)
-
A. Mehler, N. Diewald, U. Waltinger, R. Gleim, D. Esch, B. Job, T. Küchelmann, O. Abramov, and P. Blanchard, “Evolution of Romance Language in Written Communication: Network Analysis of Late Latin and Early Romance Corpora,” Leonardo, vol. 44, iss. 3, 2011.
[BibTeX]@Article{Mehler:Diewald:Waltinger:et:al:2010, Author = {Mehler, Alexander and Diewald, Nils and Waltinger, Ulli and Gleim, Rüdiger and Esch, Dietmar and Job, Barbara and Küchelmann, Thomas and Abramov, Olga and Blanchard, Philippe}, Title = {Evolution of Romance Language in Written Communication: Network Analysis of Late Latin and Early Romance Corpora}, Journal = {Leonardo}, Volume = {44}, Number = {3}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/mehler_diewald_waltinger_gleim_esch_job_kuechelmann_pustylnikov_blanchard_2010.pdf}, publisher = {MIT Press}, year = 2011 }
-
R. Gleim, A. Hoenen, N. Diewald, A. Mehler, and A. Ernst, “Modeling, Building and Maintaining Lexica for Corpus Linguistic Studies by Example of Late Latin,” in Corpus Linguistics 2011, 20-22 July, Birmingham, 2011.
[BibTeX]@InProceedings{Gleim:Hoenen:Diewald:Mehler:Ernst:2011, Author = {Gleim, Rüdiger and Hoenen, Armin and Diewald, Nils and Mehler, Alexander and Ernst, Alexandra}, Title = {Modeling, Building and Maintaining Lexica for Corpus Linguistic Studies by Example of Late Latin}, BookTitle = {Corpus Linguistics 2011, 20-22 July, Birmingham}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/Paper-48.pdf}, year = 2011 }
-
A. Mehler, S. Schwandt, R. Gleim, and B. Jussen, “Der eHumanities Desktop als Werkzeug in der historischen Semantik: Funktionsspektrum und Einsatzszenarien,” Journal for Language Technology and Computational Linguistics (JLCL), vol. 26, iss. 1, pp. 97-117, 2011.
[Abstract] [BibTeX]Die Digital Humanities bzw. die Computational Humanities entwickeln sich zu eigenständigen Disziplinen an der Nahtstelle von Geisteswissenschaft und Informatik. Diese Entwicklung betrifft zunehmend auch die Lehre im Bereich der geisteswissenschaftlichen Fachinformatik. In diesem Beitrag thematisieren wir den eHumanities Desktop als ein Werkzeug für diesen Bereich der Lehre. Dabei geht es genauer um einen Brückenschlag zwischen Geschichtswissenschaft und Informatik: Am Beispiel der historischen Semantik stellen wir drei Lehrszenarien vor, in denen der eHumanities Desktop in der geschichtswissenschaftlichen Lehre zum Einsatz kommt. Der Beitrag schliesst mit einer Anforderungsanalyse an zukünftige Entwicklungen in diesem Bereich.
@Article{Mehler:Schwandt:Gleim:Jussen:2011, Author = {Mehler, Alexander and Schwandt, Silke and Gleim, Rüdiger and Jussen, Bernhard}, Title = {Der eHumanities Desktop als Werkzeug in der historischen Semantik: Funktionsspektrum und Einsatzszenarien}, Journal = {Journal for Language Technology and Computational Linguistics (JLCL)}, Volume = {26}, Number = {1}, Pages = {97-117}, abstract = {Die Digital Humanities bzw. die Computational Humanities entwickeln sich zu eigenst{\"a}ndigen Disziplinen an der Nahtstelle von Geisteswissenschaft und Informatik. Diese Entwicklung betrifft zunehmend auch die Lehre im Bereich der geisteswissenschaftlichen Fachinformatik. In diesem Beitrag thematisieren wir den eHumanities Desktop als ein Werkzeug für diesen Bereich der Lehre. Dabei geht es genauer um einen Brückenschlag zwischen Geschichtswissenschaft und Informatik: Am Beispiel der historischen Semantik stellen wir drei Lehrszenarien vor, in denen der eHumanities Desktop in der geschichtswissenschaftlichen Lehre zum Einsatz kommt. Der Beitrag schliesst mit einer Anforderungsanalyse an zukünftige Entwicklungen in diesem Bereich.}, pdf = {http://media.dwds.de/jlcl/2011_Heft1/8.pdf }, year = 2011 }
2010 (3)
-
R. Gleim and A. Mehler, “Computational Linguistics for Mere Mortals – Powerful but Easy-to-use Linguistic Processing for Scientists in the Humanities,” in Proceedings of LREC 2010, Malta, 2010.
[Abstract] [BibTeX]Delivering linguistic resources and easy-to-use methods to a broad public in the humanities is a challenging task. On the one hand users rightly demand easy to use interfaces but on the other hand want to have access to the full flexibility and power of the functions being offered. Even though a growing number of excellent systems exist which offer convenient means to use linguistic resources and methods, they usually focus on a specific domain, as for example corpus exploration or text categorization. Architectures which address a broad scope of applications are still rare. This article introduces the eHumanities Desktop, an online system for corpus management, processing and analysis which aims at bridging the gap between powerful command line tools and intuitive user interfaces.
@InProceedings{Gleim:Mehler:2010:b, Author = {Gleim, Rüdiger and Mehler, Alexander}, Title = {Computational Linguistics for Mere Mortals – Powerful but Easy-to-use Linguistic Processing for Scientists in the Humanities}, BookTitle = {Proceedings of LREC 2010}, Address = {Malta}, Publisher = {ELDA}, abstract = {Delivering linguistic resources and easy-to-use methods to a broad public in the humanities is a challenging task. On the one hand users rightly demand easy to use interfaces but on the other hand want to have access to the full flexibility and power of the functions being offered. Even though a growing number of excellent systems exist which offer convenient means to use linguistic resources and methods, they usually focus on a specific domain, as for example corpus exploration or text categorization. Architectures which address a broad scope of applications are still rare. This article introduces the eHumanities Desktop, an online system for corpus management, processing and analysis which aims at bridging the gap between powerful command line tools and intuitive user interfaces. }, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_mehler_2010.pdf}, year = 2010 }
-
A. Mehler, R. Gleim, U. Waltinger, and N. Diewald, “Time Series of Linguistic Networks by Example of the Patrologia Latina,” in Proceedings of INFORMATIK 2010: Service Science, September 27 – October 01, 2010, Leipzig, 2010, pp. 609-616.
[BibTeX]@InProceedings{Mehler:Gleim:Waltinger:Diewald:2010, Author = {Mehler, Alexander and Gleim, Rüdiger and Waltinger, Ulli and Diewald, Nils}, Title = {Time Series of Linguistic Networks by Example of the Patrologia Latina}, BookTitle = {Proceedings of INFORMATIK 2010: Service Science, September 27 - October 01, 2010, Leipzig}, Editor = {F{\"a}hnrich, Klaus-Peter and Franczyk, Bogdan}, Volume = {2}, Series = {Lecture Notes in Informatics}, Pages = {609-616}, Publisher = {GI}, pdf = {http://subs.emis.de/LNI/Proceedings/Proceedings176/586.pdf}, year = 2010 }
-
R. Gleim, P. Warner, and A. Mehler, “eHumanities Desktop – An Architecture for Flexible Annotation in Iconographic Research,” in Proceedings of the 6th International Conference on Web Information Systems and Technologies (WEBIST ’10), April 7-10, 2010, Valencia, 2010.
[BibTeX]@InProceedings{Gleim:Warner:Mehler:2010, Author = {Gleim, Rüdiger and Warner, Paul and Mehler, Alexander}, Title = {eHumanities Desktop - An Architecture for Flexible Annotation in Iconographic Research}, BookTitle = {Proceedings of the 6th International Conference on Web Information Systems and Technologies (WEBIST '10), April 7-10, 2010, Valencia}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_warner_mehler_2010.pdf}, website = {https://www.researchgate.net/publication/220724277_eHumanities_Desktop_-_An_Architecture_for_Flexible_Annotation_in_Iconographic_Research}, year = 2010 }
2009 (4)
-
A. Mehler, R. Gleim, U. Waltinger, A. Ernst, D. Esch, and T. Feith, “eHumanities Desktop – eine webbasierte Arbeitsumgebung für die geisteswissenschaftliche Fachinformatik,” in Proceedings of the Symposium “Sprachtechnologie und eHumanities”, 26.–27. Februar, Duisburg-Essen University, 2009.
[BibTeX]@InProceedings{Mehler:Gleim:Waltinger:Ernst:Esch:Feith:2009, Author = {Mehler, Alexander and Gleim, Rüdiger and Waltinger, Ulli and Ernst, Alexandra and Esch, Dietmar and Feith, Tobias}, Title = {eHumanities Desktop – eine webbasierte Arbeitsumgebung für die geisteswissenschaftliche Fachinformatik}, BookTitle = {Proceedings of the Symposium "Sprachtechnologie und eHumanities", 26.–27. Februar, Duisburg-Essen University}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/mehler_gleim_waltinger_ernst_esch_feith_2009.pdf}, website = {http://duepublico.uni-duisburg-essen.de/servlets/DocumentServlet?id=37041}, year = 2009 }
-
R. Gleim, A. Mehler, U. Waltinger, and P. Menke, “eHumanities Desktop – An extensible Online System for Corpus Management and Analysis,” in 5th Corpus Linguistics Conference, University of Liverpool, 2009.
[Abstract] [BibTeX]This paper presents the eHumanities Desktop - an online system for corpus management and analysis in support of computing in the humanities. Design issues and the overall architecture are described, as well as an outline of the applications offered by the system.
@InProceedings{Gleim:Mehler:Waltinger:Menke:2009, Author = {Gleim, Rüdiger and Mehler, Alexander and Waltinger, Ulli and Menke, Peter}, Title = {eHumanities Desktop – An extensible Online System for Corpus Management and Analysis}, BookTitle = {5th Corpus Linguistics Conference, University of Liverpool}, abstract = {This paper presents the eHumanities Desktop - an online system for corpus management and analysis in support of computing in the humanities. Design issues and the overall architecture are described, as well as an outline of the applications offered by the system.}, pdf = {http://www.ulliwaltinger.de/pdf/eHumanitiesDesktop-AnExtensibleOnlineSystem-CL2009.pdf}, website = {http://www.ulliwaltinger.de/ehumanities-desktop-an-extensible-online-system-for-corpus-management-and-analysis/}, year = 2009 }
-
R. Gleim, U. Waltinger, A. Ernst, A. Mehler, D. Esch, and T. Feith, “The eHumanities Desktop – An Online System for Corpus Management and Analysis in Support of Computing in the Humanities,” in Proceedings of the Demonstrations Session of the 12th Conference of the European Chapter of the Association for Computational Linguistics EACL 2009, 30 March – 3 April, Athens, 2009.
[BibTeX]@InProceedings{Gleim:Waltinger:Ernst:Mehler:Esch:Feith:2009, Author = {Gleim, Rüdiger and Waltinger, Ulli and Ernst, Alexandra and Mehler, Alexander and Esch, Dietmar and Feith, Tobias}, Title = {The eHumanities Desktop – An Online System for Corpus Management and Analysis in Support of Computing in the Humanities}, BookTitle = {Proceedings of the Demonstrations Session of the 12th Conference of the European Chapter of the Association for Computational Linguistics EACL 2009, 30 March – 3 April, Athens}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_waltinger_ernst_mehler_esch_feith_2009.pdf}, year = 2009 }
-
U. Waltinger, A. Mehler, and R. Gleim, “Social Semantics And Its Evaluation By Means of Closed Topic Models: An SVM-Classification Approach Using Semantic Feature Replacement By Topic Generalization,” in Proceedings of the Biennial GSCL Conference 2009, September 30 – October 2, Universität Potsdam, 2009.
[BibTeX]@InProceedings{Waltinger:Mehler:Gleim:2009:a, Author = {Waltinger, Ulli and Mehler, Alexander and Gleim, Rüdiger}, Title = {Social Semantics And Its Evaluation By Means of Closed Topic Models: An SVM-Classification Approach Using Semantic Feature Replacement By Topic Generalization}, BookTitle = {Proceedings of the Biennial GSCL Conference 2009, September 30 – October 2, Universit{\"a}t Potsdam}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/GSCL_2009_WaltingerMehlerGleim_camera_ready.pdf}, year = 2009 }
2008 (3)
-
O. Abramov, A. Mehler, and R. Gleim, “A Unified Database of Dependency Treebanks. Integrating, Quantifying and Evaluating Dependency Data,” in Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco), 2008.
[Abstract] [BibTeX]This paper describes a database of 11 dependency treebanks which were unified by means of a two-dimensional graph format. The format was evaluated with respect to storage-complexity on the one hand, and efficiency of data access on the other hand. An example of how the treebanks can be integrated within a unique interface is given by means of the DTDB interface.
@InProceedings{Pustylnikov:Mehler:Gleim:2008, Author = {Abramov, Olga and Mehler, Alexander and Gleim, Rüdiger}, Title = {A Unified Database of Dependency Treebanks. Integrating, Quantifying and Evaluating Dependency Data}, BookTitle = {Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco)}, abstract = {This paper describes a database of 11 dependency treebanks which were unified by means of a two-dimensional graph format. The format was evaluated with respect to storage-complexity on the one hand, and efficiency of data access on the other hand. An example of how the treebanks can be integrated within a unique interface is given by means of the DTDB interface. }, pdf = {http://wwwhomes.uni-bielefeld.de/opustylnikov/pustylnikov/pdfs/LREC08_full.pdf}, year = 2008 }
-
A. Mehler, R. Gleim, A. Ernst, and U. Waltinger, “WikiDB: Building Interoperable Wiki-Based Knowledge Resources for Semantic Databases,” Sprache und Datenverarbeitung. International Journal for Language Data Processing, vol. 32, iss. 1, pp. 47-70, 2008.
[Abstract] [BibTeX]This article describes an API for exploring the logical document and the logical network structure of wikis. It introduces an algorithm for the semantic preprocessing, filtering and typing of these building blocks. Further, this article models the process of wiki generation based on a unified format of syntactic, semantic and pragmatic representations. This three-level approach to make accessible syntactic, semantic and pragmatic aspects of wiki-based structure formation is complemented by a corresponding database model – called WikiDB – and an API operating thereon. Finally, the article provides an empirical study of using the three-fold representation format in conjunction with WikiDB.
@Article{Mehler:Gleim:Ernst:Waltinger:2008, Author = {Mehler, Alexander and Gleim, Rüdiger and Ernst, Alexandra and Waltinger, Ulli}, Title = {WikiDB: Building Interoperable Wiki-Based Knowledge Resources for Semantic Databases}, Journal = {Sprache und Datenverarbeitung. International Journal for Language Data Processing}, Volume = {32}, Number = {1}, Pages = {47-70}, abstract = {This article describes an API for exploring the logical document and the logical network structure of wikis. It introduces an algorithm for the semantic preprocessing, filtering and typing of these building blocks. Further, this article models the process of wiki generation based on a unified format of syntactic, semantic and pragmatic representations. This three-level approach to make accessible syntactic, semantic and pragmatic aspects of wiki-based structure formation is complemented by a corresponding database model – called WikiDB – and an API operating thereon. Finally, the article provides an empirical study of using the three-fold representation format in conjunction with WikiDB.}, pdf = {http://www.ulliwaltinger.de/pdf/Konvens_2008_WikiDB_Building_Semantic_Databases_MehlerGleimErnstWaltinger.pdf}, year = 2008 }
-
G. Rehm, M. Santini, A. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin, “Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems,” in Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco), 2008.
[Abstract] [BibTeX]We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres.
@InProceedings{Rehm:Santini:Mehler:Braslavski:Gleim:Stubbe:Symonenko:Tavosanis:Vidulin:2008, Author = {Rehm, Georg and Santini, Marina and Mehler, Alexander and Braslavski, Pavel and Gleim, Rüdiger and Stubbe, Andrea and Symonenko, Svetlana and Tavosanis, Mirko and Vidulin, Vedrana}, Title = {Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems}, BookTitle = {Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco)}, abstract = {We present initial results from an international and multi-disciplinary research collaboration that aims at the construction of a reference corpus of web genres. The primary application scenario for which we plan to build this resource is the automatic identification of web genres. Web genres are rather difficult to capture and to describe in their entirety, but we plan for the finished reference corpus to contain multi-level tags of the respective genre or genres a web document or a website instantiates. As the construction of such a corpus is by no means a trivial task, we discuss several alternatives that are, for the time being, mostly based on existing collections. Furthermore, we discuss a shared set of genre categories and a multi-purpose tool as two additional prerequisites for a reference corpus of web genres. }, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/rehm_santini_mehler_braslavski_gleim_stubbe_symonenko_tavosanis_vidulin_2008.pdf}, website = {http://www.lrec-conf.org/proceedings/lrec2008/summaries/94.html}, year = 2008 }
2007 (5)
-
R. Gleim, A. Mehler, M. Dehmer, and O. Abramov, “Aisles through the Category Forest – Utilising the Wikipedia Category System for Corpus Building in Machine Learning,” in 3rd International Conference on Web Information Systems and Technologies (WEBIST ’07), March 3-6, 2007, Barcelona, Barcelona, 2007, pp. 142-149.
[Abstract] [BibTeX]The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on genres found in written text, newly evolved genres on the web are much more demanding. In order to successfully develop approaches to web mining, respective corpora are needed. However, the composition of genre- or domain-specific web corpora is still an unsolved problem. It is time consuming to build large corpora of good quality because web pages typically lack reliable meta information. Wikipedia along with similar approaches of collaborative text production offers a way out of this dilemma. We examine how social tagging, as supported by the MediaWiki software, can be utilised as a source of corpus building. Further, we describe a representation format for social ontologies and present the Wikipedia Category Explorer, a tool which supports categorical views to browse through the Wikipedia and to construct domain specific corpora for machine learning.
@InProceedings{Gleim:Mehler:Dehmer:Abramov:2007, Author = {Gleim, Rüdiger and Mehler, Alexander and Dehmer, Matthias and Abramov, Olga}, Title = {Aisles through the Category Forest – Utilising the Wikipedia Category System for Corpus Building in Machine Learning}, BookTitle = {3rd International Conference on Web Information Systems and Technologies (WEBIST '07), March 3-6, 2007, Barcelona}, Editor = {Filipe, Joaquim and Cordeiro, José and Encarnação, Bruno and Pedrosa, Vitor}, Pages = {142-149}, Address = {Barcelona}, abstract = {The Word Wide Web is a continuous challenge to machine learning. Established approaches have to be enhanced and new methods be developed in order to tackle the problem of finding and organising relevant information. It has often been motivated that semantic classifications of input documents help solving this task. But while approaches of supervised text categorisation perform quite well on genres found in written text, newly evolved genres on the web are much more demanding. In order to successfully develop approaches to web mining, respective corpora are needed. However, the composition of genre- or domain-specific web corpora is still an unsolved problem. It is time consuming to build large corpora of good quality because web pages typically lack reliable meta information. Wikipedia along with similar approaches of collaborative text production offers a way out of this dilemma. We examine how social tagging, as supported by the MediaWiki software, can be utilised as a source of corpus building. Further, we describe a representation format for social ontologies and present the Wikipedia Category Explorer, a tool which supports categorical views to browse through the Wikipedia and to construct domain specific corpora for machine learning.}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2016/10/webist_2007-gleim_mehler_dehmer_pustylnikov.pdf}, year = 2007 }
-
A. Mehler, R. Gleim, and A. Wegner, “Structural Uncertainty of Hypertext Types. An Empirical Study,” in Proceedings of the Workshop “Towards Genre-Enabled Search Engines: The Impact of NLP”, September, 30, 2007, in conjunction with RANLP 2007, Borovets, Bulgaria, 2007, pp. 13-19.
[BibTeX]@InProceedings{Mehler:Gleim:Wegner:2007, Author = {Mehler, Alexander and Gleim, Rüdiger and Wegner, Armin}, Title = {Structural Uncertainty of Hypertext Types. An Empirical Study}, BookTitle = {Proceedings of the Workshop "Towards Genre-Enabled Search Engines: The Impact of NLP", September, 30, 2007, in conjunction with RANLP 2007, Borovets, Bulgaria}, Editor = {Rehm, Georg and Santini, Marina}, Pages = {13-19}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/RANLP.pdf}, year = 2007 }
-
A. Mehler, P. Geibel, R. Gleim, S. Herold, B. Jain, and O. Abramov, “Much Ado About Text Content. Learning Text Types Solely by Structural Differentiae,” in Proceedings of OTT ’06 – Ontologies in Text Technology: Approaches to Extract Semantic Knowledge from Structured Information, Osnabrück, 2007, pp. 63-71.
[Abstract] [BibTeX]In this paper, we deal with classifying texts into classes which denote text types whose textual instances serve more or less homogeneous functions. Other than mainstream approaches to text classification, which rely on the vector space model [30] or some of its descendants [2] and, thus, on content-related lexical features, we solely refer to structural differentiae, that is, to patterns of text structure as determinants of class membership. Further, we suppose that text types span a type hierarchy based on the type-subtype relation [31]. Thus, although we admit that class membership is fuzzy so that overlapping classes are inevitable, we suppose a non-overlapping type system structured into a rooted tree – whether solely based on functional or additional on, e.g., content- or mediabased criteria [1]. What regards criteria of goodness of classification, we perform a classical supervised categorization experiment [30] based on cross-validation as a method of model selection [11]. That is, we perform a categorization experiment in which for all training and test cases class membership is known ex ante. In summary, we perform a supervised experiment of text classification in order to learn functionally grounded text types where membership to these types is solely based on structural criteria.
@InProceedings{Mehler:Geibel:Gleim:Herold:Jain:Pustylnikov:2007, Author = {Mehler, Alexander and Geibel, Peter and Gleim, Rüdiger and Herold, Sebastian and Jain, Brijnesh-Johannes and Abramov, Olga}, Title = {Much Ado About Text Content. Learning Text Types Solely by Structural Differentiae}, BookTitle = {Proceedings of OTT '06 – Ontologies in Text Technology: Approaches to Extract Semantic Knowledge from Structured Information}, Editor = {Mönnich, Uwe and Kühnberger, Kai-Uwe}, Series = {Publications of the Institute of Cognitive Science (PICS)}, Pages = {63-71}, Address = {Osnabrück}, abstract = {In this paper, we deal with classifying texts into classes which denote text types whose textual instances serve more or less homogeneous functions. Other than mainstream approaches to text classification, which rely on the vector space model [30] or some of its descendants [2] and, thus, on content-related lexical features, we solely refer to structural differentiae, that is, to patterns of text structure as determinants of class membership. Further, we suppose that text types span a type hierarchy based on the type-subtype relation [31]. Thus, although we admit that class membership is fuzzy so that overlapping classes are inevitable, we suppose a non-overlapping type system structured into a rooted tree – whether solely based on functional or additional on, e.g., content- or mediabased criteria [1]. What regards criteria of goodness of classification, we perform a classical supervised categorization experiment [30] based on cross-validation as a method of model selection [11]. That is, we perform a categorization experiment in which for all training and test cases class membership is known ex ante. In summary, we perform a supervised experiment of text classification in order to learn functionally grounded text types where membership to these types is solely based on structural criteria.}, pdf = {http://ikw.uni-osnabrueck.de/~ott06/ott06-abstracts/Mehler_Geibel_abstract.pdf}, year = 2007 }
-
R. Gleim, A. Mehler, and H. Eikmeyer, “Representing and Maintaining Large Corpora,” in Proceedings of the Corpus Linguistics 2007 Conference, Birmingham (UK), 2007.
[BibTeX]@InProceedings{Gleim:Mehler:Eikmeyer:2007:a, Author = {Gleim, Rüdiger and Mehler, Alexander and Eikmeyer, Hans-Jürgen}, Title = {Representing and Maintaining Large Corpora}, BookTitle = {Proceedings of the Corpus Linguistics 2007 Conference, Birmingham (UK)}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_mehler_eikmeyer_2007_a.pdf}, year = 2007 }
-
R. Gleim, A. Mehler, H. Eikmeyer, and H. Rieser, “Ein Ansatz zur Repräsentation und Verarbeitung großer Korpora multimodaler Daten,” in Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007, 11.–13. April, Universität Tübingen, Tübingen, 2007, pp. 275-284.
[BibTeX]@InProceedings{Gleim:Mehler:Eikmeyer:Rieser:2007, Author = {Gleim, Rüdiger and Mehler, Alexander and Eikmeyer, Hans-Jürgen and Rieser, Hannes}, Title = {Ein Ansatz zur Repr{\"a}sentation und Verarbeitung gro{\ss}er Korpora multimodaler Daten}, BookTitle = {Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007, 11.–13. April, Universit{\"a}t Tübingen}, Editor = {Rehm, Georg and Witt, Andreas and Lemnitzer, Lothar}, Pages = {275-284}, Address = {Tübingen}, Publisher = {Narr}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_mehler_eikmeyer_rieser_2007.pdf}, year = 2007 }
2006 (5)
-
A. Mehler, R. Gleim, and M. Dehmer, “Towards Structure-Sensitive Hypertext Categorization,” in Proceedings of the 29th Annual Conference of the German Classification Society, March 9-11, 2005, Universität Magdeburg, Berlin/New York, 2006, pp. 406-413.
[Abstract] [BibTeX]Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.
@InProceedings{Mehler:Gleim:Dehmer:2006, Author = {Mehler, Alexander and Gleim, Rüdiger and Dehmer, Matthias}, Title = {Towards Structure-Sensitive Hypertext Categorization}, BookTitle = {Proceedings of the 29th Annual Conference of the German Classification Society, March 9-11, 2005, Universit{\"a}t Magdeburg}, Editor = {Spiliopoulou, Myra and Kruse, Rudolf and Borgelt, Christian and Nürnberger, Andreas and Gaul, Wolfgang}, Pages = {406-413}, Address = {Berlin/New York}, Publisher = {Springer}, abstract = {Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.}, website = {http://www.springerlink.com/content/l7665tm3u241317l/}, year = 2006 }
-
A. Mehler and R. Gleim, “The Net for the Graphs – Towards Webgenre Representation for Corpus Linguistic Studies,” in WaCky! Working Papers on the Web as Corpus, M. Baroni and S. Bernardini, Eds., Bologna: Gedit, 2006, pp. 191-224.
[BibTeX]@InCollection{Mehler:Gleim:2006:b, Author = {Mehler, Alexander and Gleim, Rüdiger}, Title = {The Net for the Graphs – Towards Webgenre Representation for Corpus Linguistic Studies}, BookTitle = {WaCky! Working Papers on the Web as Corpus}, Publisher = {Gedit}, Editor = {Baroni, Marco and Bernardini, Silvia}, Pages = {191-224}, Address = {Bologna}, website = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.510.4125}, year = 2006 }
-
R. Gleim, A. Mehler, and M. Dehmer, “Web Corpus Mining by Instance of Wikipedia,” in Proceedings of the EACL 2006 Workshop on Web as Corpus, April 3-7, 2006, Trento, Italy, 2006, pp. 67-74.
[Abstract] [BibTeX]Workshop organizer: Adam Kilgarriff
@InProceedings{Gleim:Mehler:Dehmer:2006:a, Author = {Gleim, Rüdiger and Mehler, Alexander and Dehmer, Matthias}, Title = {Web Corpus Mining by Instance of Wikipedia}, BookTitle = {Proceedings of the EACL 2006 Workshop on Web as Corpus, April 3-7, 2006, Trento, Italy}, Editor = {Kilgariff, Adam and Baroni, Marco}, Pages = {67-74}, abstract = {Workshop organizer: Adam Kilgarriff}, pdf = {http://www.aclweb.org/anthology/W06-1710}, website = {http://pub.uni-bielefeld.de/publication/1773538}, year = 2006 }
-
R. Gleim, “HyGraph – Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertextstrukturen,” in Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Tagung 2005, Universität Bonn, Frankfurt a. M., 2006, pp. 42-53.
[BibTeX]@InProceedings{Gleim:2006, Author = {Gleim, Rüdiger}, Title = {HyGraph - Ein Framework zur Extraktion, Repr{\"a}sentation und Analyse webbasierter Hypertextstrukturen}, BookTitle = {Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beitr{\"a}ge zur GLDV-Tagung 2005, Universit{\"a}t Bonn}, Editor = {Fisseni, Bernhard and Schmitz, Hans-Christian and Schröder, Bernhard and Wagner, Petra}, Pages = {42-53}, Address = {Frankfurt a. M.}, Publisher = {Lang}, pdf = {https://www.texttechnologylab.org/wp-content/uploads/2016/10/GLDV2005-HyGraph-Framework.pdf}, website = {https://www.researchgate.net/publication/268294000_HyGraph__Ein_Framework_zur_Extraktion_Reprsentation_und_Analyse_webbasierter_Hypertextstrukturen}, year = 2006 }
-
A. Mehler, M. Dehmer, and R. Gleim, “Towards Logical Hypertext Structure – A Graph-Theoretic Perspective,” in Proceedings of the Fourth International Workshop on Innovative Internet Computing Systems (I2CS ’04), Berlin/New York, 2006, pp. 136-150.
[Abstract] [BibTeX]Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bag-of-words model has been utilized just as HTML tags and link structures. In spite of promising results this adaptation stays in the framework of IR specific models since it neglects the content-based structuring inherent to hypertext units. This paper approaches hypertext modelling from the perspective of graph-theory. It presents an XML-based format for representing websites as hypergraphs. These hypergraphs are used to shed light on the relation of hypertext structure types and their web-based instances. We place emphasis on two characteristics of this relation: In terms of realizational ambiguity we speak of functional equivalents to the manifestation of the same structure type. In terms of polymorphism we speak of a single web unit which manifests different structure types. It is shown that polymorphism is a prevalent characteristic of web-based units. This is done by means of a categorization experiment which analyses a corpus of hypergraphs representing the structure and content of pages of conference websites. On this background we plead for a revision of text representation models by means of hypergraphs which are sensitive to the manifold structuring of web documents.
@InProceedings{Mehler:Dehmer:Gleim:2006, Author = {Mehler, Alexander and Dehmer, Matthias and Gleim, Rüdiger}, Title = {Towards Logical Hypertext Structure - A Graph-Theoretic Perspective}, BookTitle = {Proceedings of the Fourth International Workshop on Innovative Internet Computing Systems (I2CS '04)}, Editor = {Böhme, Thomas and Heyer, Gerhard}, Series = {Lecture Notes in Computer Science 3473}, Pages = {136-150}, Address = {Berlin/New York}, Publisher = {Springer}, abstract = {Facing the retrieval problem according to the overwhelming set of documents online the adaptation of text categorization to web units has recently been pushed. The aim is to utilize categories of web sites and pages as an additional retrieval criterion. In this context, the bag-of-words model has been utilized just as HTML tags and link structures. In spite of promising results this adaptation stays in the framework of IR specific models since it neglects the content-based structuring inherent to hypertext units. This paper approaches hypertext modelling from the perspective of graph-theory. It presents an XML-based format for representing websites as hypergraphs. These hypergraphs are used to shed light on the relation of hypertext structure types and their web-based instances. We place emphasis on two characteristics of this relation: In terms of realizational ambiguity we speak of functional equivalents to the manifestation of the same structure type. In terms of polymorphism we speak of a single web unit which manifests different structure types. It is shown that polymorphism is a prevalent characteristic of web-based units. This is done by means of a categorization experiment which analyses a corpus of hypergraphs representing the structure and content of pages of conference websites. On this background we plead for a revision of text representation models by means of hypergraphs which are sensitive to the manifold structuring of web documents.}, website = {http://rd.springer.com/chapter/10.1007/11553762_14}, year = 2006 }
2005 (2)
-
A. Mehler and R. Gleim, “Polymorphism in Generic Web Units. A corpus linguistic study,” in Proceedings of Corpus Linguistics ’05, July 14-17, 2005, University of Birmingham, Great Britian, 2005.
[Abstract] [BibTeX]Corpus linguistics and related disciplines which focus on statistical analyses of textual units have substantial need for large corpora. More specifically, genre or register specific corpora are needed which allow studying variations in language use. Along with the incredible growth of the internet, the web became an important source of linguistic data. Of course, web corpora face the same problem of acquiring genre specific corpora. Amongst other things, web mining is a framework of methods for automatically assigning category labels to web units and thus may be seen as a solution to this corpus acquisition problem as far as genre categories are applied. The paper argues that this approach is faced with the problem of a many-to-many relation between expression units on the one hand and content or function units on the other hand. A quantitative study is performed which supports the argumentation that functions of web-based communication are very often concentrated on single web pages and thus interfere any effort of directly applying the classical apparatus of categorization on web page level. The paper outlines a two-level algorithm as an alternative approach to category assignment which is sensitive to genre specific structures and thus may be used to tackle the problem of acquiring genre specific corpora.
@InProceedings{Mehler:Gleim:2005:a, Author = {Mehler, Alexander and Gleim, Rüdiger}, Title = {Polymorphism in Generic Web Units. A corpus linguistic study}, BookTitle = {Proceedings of Corpus Linguistics '05, July 14-17, 2005, University of Birmingham, Great Britian}, Volume = {Corpus Linguistics Conference Series 1(1)}, abstract = {Corpus linguistics and related disciplines which focus on statistical analyses of textual units have substantial need for large corpora. More specifically, genre or register specific corpora are needed which allow studying variations in language use. Along with the incredible growth of the internet, the web became an important source of linguistic data. Of course, web corpora face the same problem of acquiring genre specific corpora. Amongst other things, web mining is a framework of methods for automatically assigning category labels to web units and thus may be seen as a solution to this corpus acquisition problem as far as genre categories are applied. The paper argues that this approach is faced with the problem of a many-to-many relation between expression units on the one hand and content or function units on the other hand. A quantitative study is performed which supports the argumentation that functions of web-based communication are very often concentrated on single web pages and thus interfere any effort of directly applying the classical apparatus of categorization on web page level. The paper outlines a two-level algorithm as an alternative approach to category assignment which is sensitive to genre specific structures and thus may be used to tackle the problem of acquiring genre specific corpora.}, issn = {1747-9398}, pdf = {http://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2005-journal/Thewebasacorpus/AlexanderMehlerandRuedigerGleimCorpusLinguistics2005.pdf}, year = 2005 }
-
A. Mehler, M. Dehmer, and R. Gleim, “Zur Automatischen Klassifikation von Webgenres,” in Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Frühjahrstagung ’05, 10. März – 01. April 2005, Universität Bonn, Frankfurt a. M., 2005, pp. 158-174.
[BibTeX]@InProceedings{Mehler:Dehmer:Gleim:2005, Author = {Mehler, Alexander and Dehmer, Matthias and Gleim, Rüdiger}, Title = {Zur Automatischen Klassifikation von Webgenres}, BookTitle = {Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beitr{\"a}ge zur GLDV-Frühjahrstagung '05, 10. M{\"a}rz – 01. April 2005, Universit{\"a}t Bonn}, Editor = {Fisseni, Bernhard and Schmitz, Hans-Christina and Schröder, Bernhard and Wagner, Petra}, Pages = {158-174}, Address = {Frankfurt a. M.}, Publisher = {Lang}, year = 2005 }
2004 (1)
-
M. Dehmer, A. Mehler, and R. Gleim, “Aspekte der Kategorisierung von Webseiten,” in INFORMATIK 2004 – Informatik verbindet, Band 2, Beiträge der 34. Jahrestagung der Gesellschaft für Informatik e.V. (GI). Workshop Multimedia-Informationssysteme, 2004, pp. 39-43.
[Abstract] [BibTeX]Im Zuge der Web-basierten Kommunikation tritt die Frage auf, inwiefern Webpages zum Zwecke ihrer inhaltsorientierten Filterung kategorisiert werden können. Diese Studie untersucht zwei Phänomene, welche die Bedingung der Möglichkeit einer solchen Kategorisierung betreffen (siehe [6]): Mit dem Begriff der funktionalen Aquivalenz beziehen wir uns auf das Phänomen, dass dieselbe Funktions- oder Inhaltskategorie durch völlig verschiedene Bausteine Web-basierter Dokumente manifestiert werden kann. Mit dem Begriff des Polymorphie beziehen wir uns auf das Phänomen, dass dasselbe Dokument zugleich mehrere Funktions- oder Inhaltskategorien manifestieren kann. Die zentrale Hypothese lautet, dass beide Phänomene für Web-basierte Hypertextstrukturen charakteristisch sind. Ist dies der Fall, so kann die automatische Kategorisierung von Hypertexten [2, 10] nicht mehr als eindeutige Zuordnung verstanden werden, bei der einem Dokument genau eine Kategorie zugeordnet wird. In diesem Sinne thematisiert das Papier die Frage nach der adäquaten Modellierung multimedialer Dokumente.
@InProceedings{Dehmer:Mehler:Gleim:2004, Author = {Dehmer, Matthias and Mehler, Alexander and Gleim, Rüdiger}, Title = {Aspekte der Kategorisierung von Webseiten}, BookTitle = {INFORMATIK 2004 – Informatik verbindet, Band 2, Beitr{\"a}ge der 34. Jahrestagung der Gesellschaft für Informatik e.V. (GI). Workshop Multimedia-Informationssysteme}, Editor = {Dadam, Peter and Reichert, Manfred}, Volume = {51}, Series = {Lecture Notes in Informatics}, Pages = {39-43}, Publisher = {GI}, abstract = {Im Zuge der Web-basierten Kommunikation tritt die Frage auf, inwiefern Webpages zum Zwecke ihrer inhaltsorientierten Filterung kategorisiert werden können. Diese Studie untersucht zwei Ph{\"a}nomene, welche die Bedingung der Möglichkeit einer solchen Kategorisierung betreffen (siehe [6]): Mit dem Begriff der funktionalen Aquivalenz beziehen wir uns auf das Ph{\"a}nomen, dass dieselbe Funktions- oder Inhaltskategorie durch völlig verschiedene Bausteine Web-basierter Dokumente manifestiert werden kann. Mit dem Begriff des Polymorphie beziehen wir uns auf das Ph{\"a}nomen, dass dasselbe Dokument zugleich mehrere Funktions- oder Inhaltskategorien manifestieren kann. Die zentrale Hypothese lautet, dass beide Ph{\"a}nomene für Web-basierte Hypertextstrukturen charakteristisch sind. Ist dies der Fall, so kann die automatische Kategorisierung von Hypertexten [2, 10] nicht mehr als eindeutige Zuordnung verstanden werden, bei der einem Dokument genau eine Kategorie zugeordnet wird. In diesem Sinne thematisiert das Papier die Frage nach der ad{\"a}quaten Modellierung multimedialer Dokumente.}, pdf = {http://subs.emis.de/LNI/Proceedings/Proceedings51/GI-Proceedings.51-11.pdf}, website = {https://www.researchgate.net/publication/221385316_Aspekte_der_Kategorisierung_von_Webseiten}, year = 2004 }