Rüdiger Gleim

Rüdiger Gleim

Rüdiger Gleim

External doctoral candidate

 

 

ContactCurriculum VitaeResearch InterestsPublications
Linguistic Databases

Almost any study in corpus linguistics boils down to construct, annotate, represent and analyze linguistic data. The requirements on a proper database are often contradicting:

  • It should be able to scale well with ever growing corpora such as the Wikipedia- while still being flexible for annotation and edition.
  • It should serve a broad spectrum of analyses by minimizing the need to transform data for a specific kind of analysis, while still being space efficient.
  • The data model should be able to mediate between standard formats while not becoming over-generic and difficult to handle.
  • ….

Designing and developing linguistic databases has become a major topic for me. Realizing that there is no such thing as the ultimate solution Iam interested in all kinds of database management systems and paradigms including relational-, graph-, distributed- and NoSQL databases as well as APIs for persistent storage.

Total: 46

2020 (1)

  • [DOI] A. Mehler, R. Gleim, R. Gaitsch, T. Uslu, and W. Hemati, “From Topic Networks to Distributed Cognitive Maps: Zipfian Topic Universes in the Area of Volunteered Geographic Information,” Complexity, vol. 4, pp. 1-47, 2020.
    [BibTeX]

    @Article{Mehler:Gleim:Gaitsch:Uslu:Hemati:2020,
        author  = {Alexander Mehler and R{\"{u}}diger Gleim and Regina Gaitsch and Tolga Uslu and Wahed Hemati},
        title   = {From Topic Networks to Distributed Cognitive Maps: {Zipfian} Topic Universes in the Area of Volunteered Geographic Information},
        journal = {Complexity},
        volume = {4},
        doi={10.1155/2020/4607025},
        pages = {1-47},
        issuetitle = {Cognitive Network Science: A New Frontier},
        year    = {2020},
    }

2019 (2)

  • [PDF] A. Mehler, T. Uslu, R. Gleim, and D. Baumartz, “text2ddc meets Literature – Ein Verfahren für die Analyse und Visualisierung thematischer Makrostrukturen,” in Proceedings of the 6th Digital Humanities Conference in the German-speaking Countries, DHd 2019, 2019.
    [Poster][BibTeX]

    @InProceedings{Mehler:Uslu:Gleim:Baumartz:2019,
      Author         = {Mehler, Alexander and Uslu, Tolga and Gleim, Rüdiger and Baumartz, Daniel},
      Title          = {{text2ddc meets Literature - Ein Verfahren für die Analyse und Visualisierung thematischer Makrostrukturen}},
      BookTitle      = {Proceedings of the 6th Digital Humanities Conference in the German-speaking Countries, DHd 2019},
      poster   = {https://www.texttechnologylab.org/wp-content/uploads/2019/04/DHD_Poster___text2ddc_meets_Literature_Poster.pdf},
      Series         = {DHd 2019},
      pdf     = {https://www.texttechnologylab.org/wp-content/uploads/2019/04/Preprint_DHd2019_text2ddc_meets_Literature.pdf},
      location       = {Frankfurt, Germany},
      year           = 2019
    }
  • [PDF] [http://jlm.ipipan.waw.pl/index.php/JLM/article/view/205] [DOI] R. Gleim, S. Eger, A. Mehler, T. Uslu, W. Hemati, A. Lücking, A. Henlein, S. Kahlsdorf, and A. Hoenen, “A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin,” Journal of Language Modeling, 2019.
    [BibTeX]

    @article{Gleim:Eger:Mehler:2019,
      author    = {Gleim, R\"{u}diger and Eger, Steffen and Mehler, Alexander and Uslu, Tolga and Hemati, Wahed and L\"{u}cking, Andy and Henlein, Alexander and Kahlsdorf, Sven and Hoenen, Armin},
      title     = {A practitioner's view: a survey and comparison of lemmatization and morphological tagging in German and Latin},
      journal   = {Journal of Language Modeling},
      year      = {2019},
      pdf = {https://www.texttechnologylab.org/wp-content/uploads/2019/07/jlm-tagging.pdf},
      doi = {10.15398/jlm.v7i1.205},
      url = {http://jlm.ipipan.waw.pl/index.php/JLM/article/view/205} 
    }

2018 (6)

  • [https://doi.org/10.3897/biss.2.25876] [DOI] C. Driller, M. Koch, M. Schmidt, C. Weiland, T. Hörnschemeyer, T. Hickler, G. Abrami, S. Ahmed, R. Gleim, W. Hemati, T. Uslu, A. Mehler, A. Pachzelt, J. Rexhepi, T. Risse, J. Schuster, G. Kasperek, and A. Hausinger, “Workflow and Current Achievements of BIOfid, an Information Service Mobilizing Biodiversity Data from Literature Sources,” Biodiversity Information Science and Standards, vol. 2, p. e25876, 2018.
    [Abstract] [BibTeX]

    BIOfid is a specialized information service currently being developed to mobilize biodiversity data dormant in printed historical and modern literature and to offer a platform for open access journals on the science of biodiversity. Our team of librarians, computer scientists and biologists produce high-quality text digitizations, develop new text-mining tools and generate detailed ontologies enabling semantic text analysis and semantic search by means of user-specific queries. In a pilot project we focus on German publications on the distribution and ecology of vascular plants, birds, moths and butterflies extending back to the Linnaeus period about 250 years ago. The three organism groups have been selected according to current demands of the relevant research community in Germany. The text corpus defined for this purpose comprises over 400 volumes with more than 100,000 pages to be digitized and will be complemented by journals from other digitization projects, copyright-free and project-related literature. With TextImager (Natural Language Processing & Text Visualization) and TextAnnotator (Discourse Semantic Annotation) we have already extended and launched tools that focus on the text-analytical section of our project. Furthermore, taxonomic and anatomical ontologies elaborated by us for the taxa prioritized by the project’s target group - German institutions and scientists active in biodiversity research - are constantly improved and expanded to maximize scientific data output. Our poster describes the general workflow of our project ranging from literature acquisition via software development, to data availability on the BIOfid web portal (http://biofid.de/), and the implementation into existing platforms which serve to promote global accessibility of biodiversity data.
    @article{Driller:et:al:2018,
            author = {Christine Driller and Markus Koch and Marco Schmidt and Claus Weiland and Thomas Hörnschemeyer and Thomas Hickler and Giuseppe Abrami and Sajawel Ahmed and Rüdiger Gleim and Wahed Hemati and Tolga Uslu and Alexander Mehler and Adrian Pachzelt and Jashar Rexhepi and Thomas Risse and Janina Schuster and Gerwin Kasperek and Angela Hausinger},
            title = {Workflow and Current Achievements of BIOfid, an Information Service Mobilizing Biodiversity Data from Literature Sources},
            volume = {2},
            number = {},
            year = {2018},
            doi = {10.3897/biss.2.25876},
            publisher = {Pensoft Publishers},
            abstract = {BIOfid is a specialized information service currently being developed to mobilize biodiversity data dormant in printed historical and modern literature and to offer a platform for open access journals on the science of biodiversity. Our team of librarians, computer scientists and biologists produce high-quality text digitizations, develop new text-mining tools and generate detailed ontologies enabling semantic text analysis and semantic search by means of user-specific queries. In a pilot project we focus on German publications on the distribution and ecology of vascular plants, birds, moths and butterflies extending back to the Linnaeus period about 250 years ago. The three organism groups have been selected according to current demands of the relevant research community in Germany. The text corpus defined for this purpose comprises over 400 volumes with more than 100,000 pages to be digitized and will be complemented by journals from other digitization projects, copyright-free and project-related literature. With TextImager (Natural Language Processing & Text Visualization) and TextAnnotator (Discourse Semantic Annotation) we have already extended and launched tools that focus on the text-analytical section of our project. Furthermore, taxonomic and anatomical ontologies elaborated by us for the taxa prioritized by the project’s target group - German institutions and scientists active in biodiversity research - are constantly improved and expanded to maximize scientific data output. Our poster describes the general workflow of our project ranging from literature acquisition via software development, to data availability on the BIOfid web portal (http://biofid.de/), and the implementation into existing platforms which serve to promote global accessibility of biodiversity data.},
            issn = {},
            pages = {e25876},
            URL = {https://doi.org/10.3897/biss.2.25876},
            eprint = {https://doi.org/10.3897/biss.2.25876},
            journal = {Biodiversity Information Science and Standards}
    }
  • A. Mehler, W. Hemati, R. Gleim, and D. Baumartz, “VienNA: Auf dem Weg zu einer Infrastruktur für die verteilte interaktive evolutionäre Verarbeitung natürlicher Sprache,” in Forschungsinfrastrukturen und digitale Informationssysteme in der germanistischen Sprachwissenschaft , H. Lobin, R. Schneider, and A. Witt, Eds., Berlin: De Gruyter, 2018, vol. 6.
    [BibTeX]

    @InCollection{Mehler:Hemati:Gleim:Baumartz:2018,
      Author         = {Alexander Mehler and Wahed Hemati and Rüdiger Gleim
                       and Daniel Baumartz},
      Title          = {{VienNA: }{Auf dem Weg zu einer Infrastruktur für die verteilte
                       interaktive evolutionäre Verarbeitung natürlicher
                       Sprache}},
      BookTitle      = {Forschungsinfrastrukturen und digitale
                       Informationssysteme in der germanistischen
                       Sprachwissenschaft },
      Publisher      = {De Gruyter},
      Editor         = {Henning Lobin and Roman Schneider and Andreas Witt},
      Volume         = {6},
      Address        = {Berlin},
      year           = 2018
    }
  • T. Uslu, L. Miebach, S. Wolfsgruber, M. Wagner, K. Fließbach, R. Gleim, W. Hemati, A. Henlein, and A. Mehler, “Automatic Classification in Memory Clinic Patients and in Depressive Patients,” in Proceedings of Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric impairments (RaPID-2), 2018.
    [BibTeX]

    @InProceedings{Uslu:et:al:2018:a,
      Author         = {Tolga Uslu and Lisa Miebach and Steffen Wolfsgruber
                       and Michael Wagner and Klaus Fließbach and Rüdiger
                       Gleim and Wahed Hemati and Alexander Henlein and
                       Alexander Mehler},
      Title          = {{Automatic Classification in Memory Clinic Patients
                       and in Depressive Patients}},
      BookTitle      = {Proceedings of Resources and ProcessIng of linguistic,
                       para-linguistic and extra-linguistic Data from people
                       with various forms of cognitive/psychiatric impairments
                       (RaPID-2)},
      Series         = {RaPID},
      location       = {Miyazaki, Japan},
      year           = 2018
    }
  • [PDF] A. Mehler, R. Gleim, A. Lücking, T. Uslu, and C. Stegbauer, “On the Self-similarity of Wikipedia Talks: a Combined Discourse-analytical and Quantitative Approach,” Glottometrics, vol. 40, pp. 1-44, 2018.
    [BibTeX]

    @Article{Mehler:Gleim:Luecking:Uslu:Stegbauer:2018,
      Author         = {Alexander Mehler and Rüdiger Gleim and Andy Lücking
                       and Tolga Uslu and Christian Stegbauer},
      Title          = {On the Self-similarity of {Wikipedia} Talks: a
                       Combined Discourse-analytical and Quantitative Approach},
      Journal        = {Glottometrics},
      Volume         = {40},
      Pages          = {1-44},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2018/03/Glottometrics-Mehler.pdf},
      year           = 2018
    }
  • [PDF] R. Gleim, A. Mehler, and S. Y. Song, “WikiDragon: A Java Framework For Diachronic Content And Network Analysis Of MediaWikis,” in Proceedings of the 11th edition of the Language Resources and Evaluation Conference, May 7 – 12, Miyazaki, Japan, 2018.
    [BibTeX]

    @InProceedings{Gleim:Mehler:Song:2018,
      Author         = {R{\"u}diger Gleim and Alexander Mehler and Sung Y.
                       Song},
      Title          = {WikiDragon: A Java Framework For Diachronic Content
                       And Network Analysis Of MediaWikis},
      BookTitle      = {Proceedings of the 11th edition of the Language
                       Resources and Evaluation Conference, May 7 - 12},
      Series         = {LREC 2018},
      Address        = {Miyazaki, Japan},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2018/03/WikiDragon.pdf},
      year           = 2018
    }
  • G. Abrami, S. Ahmed, R. Gleim, W. Hemati, A. Mehler, and U. Tolga, Natural Language Processing and Text Mining for BIOfid, 2018.
    [BibTeX]

    @misc{Abrami:et:al:2018b,
     author = {Abrami, Giuseppe and Ahmed, Sajawel and Gleim, R{\"u}diger and Hemati, Wahed and Mehler, Alexander and Uslu Tolga},
     title = {{Natural Language Processing and Text Mining for BIOfid}},
     howpublished = {Presentation at the 1st Meeting of the Scientific Advisory Board of the BIOfid Project},
     adress = {Goethe-University, Frankfurt am Main, Germany},
     year = {2018},
     month = {March},
     day = {08},
     pdf = {}
    }

2017 (1)

  • A. Mehler, R. Gleim, W. Hemati, and T. Uslu, “Skalenfreie online soziale Lexika am Beispiel von Wiktionary,” in Proceedings of 53rd Annual Conference of the Institut für Deutsche Sprache (IDS), March 14-16, Mannheim, Germany, Berlin, 2017. In German. Title translates into: Scale-free online-social Lexika by Example of Wiktionary
    [Abstract] [BibTeX]

    In English: The paper deals with characteristics of the structural, thematic and participatory dynamics of collaboratively generated lexical networks. This is done by example of Wiktionary. Starting from a network-theoretical model in terms of so-called multi-layer networks, we describe Wiktionary as a scale-free lexicon. Systems of this sort are characterized by the fact that their content-related dynamics is determined by the underlying dynamics of collaborating authors. This happens in a way that social structure imprints on content structure. According to this conception, the unequal distribution of the activities of authors results in a correspondingly unequal distribution of the information units documented within the lexicon. The paper focuses on foundations for describing such systems starting from a parameter space which requires to deal with Wiktionary as an issue in big data analysis.  In German: Der Beitrag thematisiert Eigenschaften der strukturellen, thematischen und partizipativen Dynamik kollaborativ erzeugter lexikalischer Netzwerke am Beispiel von Wiktionary. Ausgehend von einem netzwerktheoretischen Modell in Form so genannter Mehrebenennetzwerke wird Wiktionary als ein skalenfreies Lexikon beschrieben. Systeme dieser Art zeichnen sich dadurch aus, dass ihre inhaltliche Dynamik durch die zugrundeliegende Kollaborationsdynamik bestimmt wird, und zwar so, dass sich die soziale Struktur der entsprechenden inhaltlichen Struktur aufprägt. Dieser Auffassung gemäß führt die Ungleichverteilung der Aktivitäten von Lexikonproduzenten zu einer analogen Ungleichverteilung der im Lexikon dokumentierten Informationseinheiten. Der Beitrag thematisiert Grundlagen zur Beschreibung solcher Systeme ausgehend von einem Parameterraum, welcher die netzwerkanalytische Betrachtung von Wiktionary als Big-Data-Problem darstellt.
    @InProceedings{Mehler:Gleim:Hemati:Uslu:2017,
      Author         = {Alexander Mehler and Rüdiger Gleim and Wahed Hemati
                       and Tolga Uslu},
      Title          = {{Skalenfreie online soziale Lexika am Beispiel von
                       Wiktionary}},
      BookTitle      = {Proceedings of 53rd Annual Conference of the Institut
                       für Deutsche Sprache (IDS), March 14-16, Mannheim,
                       Germany},
      Editor         = {Stefan Engelberg and Henning Lobin and Kathrin Steyer
                       and Sascha Wolfer},
      Address        = {Berlin},
      Publisher      = {De Gruyter},
      Note           = {In German. Title translates into: Scale-free
                       online-social Lexika by Example of Wiktionary},
      abstract       = {In English: The paper deals with characteristics of
    the structural, thematic and participatory dynamics of
    collaboratively generated lexical networks. This is
    done by example of Wiktionary. Starting from a
    network-theoretical model in terms of so-called
    multi-layer networks, we describe Wiktionary as a
    scale-free lexicon. Systems of this sort are
    characterized by the fact that their content-related
    dynamics is determined by the underlying dynamics of
    collaborating authors. This happens in a way that
    social structure imprints on content structure.
    According to this conception, the unequal distribution
    of the activities of authors results in a
    correspondingly unequal distribution of the information
    units documented within the lexicon. The paper focuses
    on foundations for describing such systems starting
    from a parameter space which requires to deal with
    Wiktionary as an issue in big data analysis. 
    In German:
    Der Beitrag thematisiert Eigenschaften der
    strukturellen, thematischen und partizipativen Dynamik
    kollaborativ erzeugter lexikalischer Netzwerke am
    Beispiel von Wiktionary. Ausgehend von einem
    netzwerktheoretischen Modell in Form so genannter
    Mehrebenennetzwerke wird Wiktionary als ein
    skalenfreies Lexikon beschrieben. Systeme dieser Art
    zeichnen sich dadurch aus, dass ihre inhaltliche
    Dynamik durch die zugrundeliegende
    Kollaborationsdynamik bestimmt wird, und zwar so, dass
    sich die soziale Struktur der entsprechenden
    inhaltlichen Struktur aufprägt. Dieser Auffassung
    gemäß führt die Ungleichverteilung der Aktivitäten
    von Lexikonproduzenten zu einer analogen
    Ungleichverteilung der im Lexikon dokumentierten
    Informationseinheiten. Der Beitrag thematisiert
    Grundlagen zur Beschreibung solcher Systeme ausgehend
    von einem Parameterraum, welcher die
    netzwerkanalytische Betrachtung von Wiktionary als
    Big-Data-Problem darstellt.},
      year           = 2017
    }

2016 (3)

  • [http://dh2016.adho.org/abstracts/250] A. Mehler, B. Wagner, and R. Gleim, “Wikidition: Towards A Multi-layer Network Model of Intertextuality,” in Proceedings of DH 2016, 12-16 July, 2016.
    [Abstract] [BibTeX]

    The paper presents Wikidition, a novel text mining tool for generating online editions of text corpora. It explores lexical, sentential and textual relations to span multi-layer networks (linkification) that allow for browsing syntagmatic and paradigmatic relations among the constituents of its input texts. In this way, relations of text reuse can be explored together with lexical relations within the same literary memory information system. Beyond that, Wikidition contains a module for automatic lexiconisation to extract author specific vocabularies. Based on linkification and lexiconisation, Wikidition does not only allow for traversing input corpora on different (lexical, sentential and textual) levels. Rather, its readers can also study the vocabulary of authors on several levels of resolution including superlemmas, lemmas, syntactic words and wordforms. We exemplify Wikidition by a range of literary texts and evaluate it by means of the apparatus of quantitative network analysis.
    @InProceedings{Mehler:Wagner:Gleim:2016,
      Author         = {Mehler, Alexander and Wagner, Benno and Gleim,
                       R\"{u}diger},
      Title          = {Wikidition: Towards A Multi-layer Network Model of
                       Intertextuality},
      BookTitle      = {Proceedings of DH 2016, 12-16 July},
      Series         = {DH 2016},
      abstract       = {The paper presents Wikidition, a novel text mining
    tool for generating online editions of text corpora. It
    explores lexical, sentential and textual relations to
    span multi-layer networks (linkification) that allow
    for browsing syntagmatic and paradigmatic relations
    among the constituents of its input texts. In this way,
    relations of text reuse can be explored together with
    lexical relations within the same literary memory
    information system. Beyond that, Wikidition contains a
    module for automatic lexiconisation to extract author
    specific vocabularies. Based on linkification and
    lexiconisation, Wikidition does not only allow for
    traversing input corpora on different (lexical,
    sentential and textual) levels. Rather, its readers can
    also study the vocabulary of authors on several levels
    of resolution including superlemmas, lemmas, syntactic
    words and wordforms. We exemplify Wikidition by a range
    of literary texts and evaluate it by means of the
    apparatus of quantitative network analysis.},
      location       = {Kraków},
      url            = {http://dh2016.adho.org/abstracts/250},
      year           = 2016
    }
  • [PDF] S. Eger, R. Gleim, and A. Mehler, “Lemmatization and Morphological Tagging in German and Latin: A comparison and a survey of the state-of-the-art,” in Proceedings of the 10th International Conference on Language Resources and Evaluation, 2016.
    [BibTeX]

    @InProceedings{Eger:Mehler:Gleim:2016,
      Author         = {Eger, Steffen and Gleim, R\"{u}diger and Mehler,
                       Alexander},
      Title          = {Lemmatization and Morphological Tagging in {German}
                       and {Latin}: A comparison and a survey of the
                       state-of-the-art},
      BookTitle      = {Proceedings of the 10th International Conference on
                       Language Resources and Evaluation},
      Series         = {LREC 2016},
      location       = {Portoro\v{z} (Slovenia)},
      pdf            = {http://www.texttechnologylab.org/wp-content/uploads/2016/04/lrec_eger_gleim_mehler.pdf},
      year           = 2016
    }
  • [DOI] A. Mehler, R. Gleim, T. vor der Brück, W. Hemati, T. Uslu, and S. Eger, “Wikidition: Automatic Lexiconization and Linkification of Text Corpora,” Information Technology, vol. 58, pp. 70-79, 2016.
    [Abstract] [BibTeX]

    We introduce a new text technology, called Wikidition, which automatically generates large scale editions of corpora of natural language texts. Wikidition combines a wide range of text mining tools for automatically linking lexical, sentential and textual units. This includes the extraction of corpus-specific lexica down to the level of syntactic words and their grammatical categories. To this end, we introduce a novel measure of text reuse and exemplify Wikidition by means of the capitularies, that is, a corpus of Medieval Latin texts.
    @Article{Mehler:et:al:2016,
      Author         = {Alexander Mehler and Rüdiger Gleim and Tim vor der
                       Brück and Wahed Hemati and Tolga Uslu and Steffen Eger},
      Title          = {Wikidition: Automatic Lexiconization and
                       Linkification of Text Corpora},
      Journal        = {Information Technology},
      Volume   = {58}, 
      Pages          = {70-79},
      abstract       = {We introduce a new text technology, called Wikidition,
    which automatically generates large scale editions of
    corpora of natural language texts. Wikidition combines
    a wide range of text mining tools for automatically
    linking lexical, sentential and textual units. This
    includes the extraction of corpus-specific lexica down
    to the level of syntactic words and their grammatical
    categories. To this end, we introduce a novel measure
    of text reuse and exemplify Wikidition by means of the
    capitularies, that is, a corpus of Medieval Latin
    texts.},
      doi            = {10.1515/itit-2015-0035},
      year           = 2016
    }

2015 (3)

  • A. Mehler and R. Gleim, “Linguistic Networks — An Online Platform for Deriving Collocation Networks from Natural Language Texts,” in Towards a Theoretical Framework for Analyzing Complex Linguistic Networks, A. Mehler, A. Lücking, S. Banisch, P. Blanchard, and B. Frank-Job, Eds., Springer, 2015.
    [BibTeX]

    @InCollection{Mehler:Gleim:2015:a,
      Author         = {Mehler, Alexander and Gleim, Rüdiger},
      Title          = {Linguistic Networks -- An Online Platform for Deriving
                       Collocation Networks from Natural Language Texts},
      BookTitle      = {Towards a Theoretical Framework for Analyzing Complex
                       Linguistic Networks},
      Publisher      = {Springer},
      Editor         = {Mehler, Alexander and Lücking, Andy and Banisch, Sven
                       and Blanchard, Philippe and Frank-Job, Barbara},
      Series         = {Understanding Complex Systems},
      year           = 2015
    }
  • A. Mehler, T. vor der Brück, R. Gleim, and T. Geelhaar, “Towards a Network Model of the Coreness of Texts: An Experiment in Classifying Latin Texts using the TTLab Latin Tagger,” in Text Mining: From Ontology Learning to Automated text Processing Applications, C. Biemann and A. Mehler, Eds., Berlin/New York: Springer, 2015, pp. 87-112.
    [Abstract] [BibTeX]

    The analysis of longitudinal corpora of historical texts requires the integrated development of tools for automatically preprocessing these texts and for building representation models of their genre- and register-related dynamics. In this chapter we present such a joint endeavor that ranges from resource formation via preprocessing to network-based text representation and classification. We start with presenting the so-called TTLab Latin Tagger (TLT) that preprocesses texts of classical and medieval Latin. Its lexical resource in the form of the Frankfurt Latin Lexicon (FLL) is also briefly introduced. As a first test case for showing the expressiveness of these resources, we perform a tripartite classification task of authorship attribution, genre detection and a combination thereof. To this end, we introduce a novel text representation model that explores the core structure (the so-called coreness) of lexical network representations of texts. Our experiment shows the expressiveness of this representation format and mediately of our Latin preprocessor.
    @InCollection{Mehler:Brueck:Gleim:Geelhaar:2015,
      Author         = {Mehler, Alexander and vor der Brück, Tim and Gleim,
                       Rüdiger and Geelhaar, Tim},
      Title          = {Towards a Network Model of the Coreness of Texts: An
                       Experiment in Classifying Latin Texts using the TTLab
                       Latin Tagger},
      BookTitle      = {Text Mining: From Ontology Learning to Automated text
                       Processing Applications},
      Publisher      = {Springer},
      Editor         = {Chris Biemann and Alexander Mehler},
      Series         = {Theory and Applications of Natural Language Processing},
      Pages          = {87-112},
      Address        = {Berlin/New York},
      abstract       = {The analysis of longitudinal corpora of historical
    texts requires the integrated development of tools for
    automatically preprocessing these texts and for
    building representation models of their genre- and
    register-related dynamics. In this chapter we present
    such a joint endeavor that ranges from resource
    formation via preprocessing to network-based text
    representation and classification. We start with
    presenting the so-called TTLab Latin Tagger (TLT) that
    preprocesses texts of classical and medieval Latin. Its
    lexical resource in the form of the Frankfurt Latin
    Lexicon (FLL) is also briefly introduced. As a first
    test case for showing the expressiveness of these
    resources, we perform a tripartite classification task
    of authorship attribution, genre detection and a
    combination thereof. To this end, we introduce a novel
    text representation model that explores the core
    structure (the so-called coreness) of lexical network
    representations of texts. Our experiment shows the
    expressiveness of this representation format and
    mediately of our Latin preprocessor.},
      website        = {http://link.springer.com/chapter/10.1007/978-3-319-12655-5_5},
      year           = 2015
    }
  • [PDF] R. Gleim and A. Mehler, “TTLab Preprocessor – Eine generische Web-Anwendung für die Vorverarbeitung von Texten und deren Evaluation,” in Accepted in the Proceedings of the Jahrestagung der Digital Humanities im deutschsprachigen Raum, 2015.
    [BibTeX]

    @InProceedings{Gleim:Mehler:2015,
      Author         = {Gleim, Rüdiger and Mehler, Alexander},
      Title          = {TTLab Preprocessor – Eine generische Web-Anwendung
                       für die Vorverarbeitung von Texten und deren
                       Evaluation},
      BookTitle      = {Accepted in the Proceedings of the Jahrestagung der
                       Digital Humanities im deutschsprachigen Raum},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/Gleim_Mehler_PrePro_DHGraz2015.pdf},
      year           = 2015
    }

2013 (1)

  • A. Mehler, C. Stegbauer, and R. Gleim, “Zur Struktur und Dynamik der kollaborativen Plagiatsdokumentation am Beispiel des GuttenPlag Wiki: eine Vorstudie,” in Die Dynamik sozialer und sprachlicher Netzwerke. Konzepte, Methoden und empirische Untersuchungen am Beispiel des WWW, B. Frank-Job, A. Mehler, and T. Sutter, Eds., Wiesbaden: VS Verlag, 2013.
    [BibTeX]

    @InCollection{Mehler:Stegbauer:Gleim:2013,
      Author         = {Mehler, Alexander and Stegbauer, Christian and Gleim,
                       Rüdiger},
      Title          = {Zur Struktur und Dynamik der kollaborativen
                       Plagiatsdokumentation am Beispiel des GuttenPlag Wiki:
                       eine Vorstudie},
      BookTitle      = {Die Dynamik sozialer und sprachlicher Netzwerke.
                       Konzepte, Methoden und empirische Untersuchungen am
                       Beispiel des WWW},
      Publisher      = {VS Verlag},
      Editor         = {Frank-Job, Barbara and Mehler, Alexander and Sutter,
                       Tilman},
      Address        = {Wiesbaden},
      year           = 2013
    }

2012 (3)

  • [PDF] A. Mehler, C. Stegbauer, and R. Gleim, “Latent Barriers in Wiki-based Collaborative Writing,” in Proceedings of the Wikipedia Academy: Research and Free Knowledge. June 29 – July 1 2012, Berlin, 2012.
    [BibTeX]

    @InProceedings{Mehler:Stegbauer:Gleim:2012:b,
      Author         = {Mehler, Alexander and Stegbauer, Christian and Gleim,
                       Rüdiger},
      Title          = {Latent Barriers in Wiki-based Collaborative Writing},
      BookTitle      = {Proceedings of the Wikipedia Academy: Research and
                       Free Knowledge. June 29 - July 1 2012},
      Address        = {Berlin},
      month          = {July},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/12_Paper_Alexander_Mehler_Christian_Stegbauer_Ruediger_Gleim.pdf},
      year           = 2012
    }
  • [PDF] R. Gleim, A. Mehler, and A. Ernst, “SOA implementation of the eHumanities Desktop,” in Proceedings of the Workshop on Service-oriented Architectures (SOAs) for the Humanities: Solutions and Impacts, Digital Humanities 2012, Hamburg, Germany, 2012.
    [Abstract] [BibTeX]

    The eHumanities Desktop is a system which allows users                    to upload, organize and share resources using a web                    interface. Furthermore resources can be processed,                    annotated and analyzed in various ways. Registered                    users can organize themselves in groups and                    collaboratively work on their data. The eHumanities                    Desktop is platform independent and runs in a web                    browser. This paper presents the system focusing on its                    service orientation and process management.
    @InProceedings{Gleim:Mehler:Ernst:2012,
      Author         = {Gleim, Rüdiger and Mehler, Alexander and Ernst,
                       Alexandra},
      Title          = {SOA implementation of the eHumanities Desktop},
      BookTitle      = {Proceedings of the Workshop on Service-oriented
                       Architectures (SOAs) for the Humanities: Solutions and
                       Impacts, Digital Humanities 2012, Hamburg, Germany},
      abstract       = {The eHumanities Desktop is a system which allows users
                       to upload, organize and share resources using a web
                       interface. Furthermore resources can be processed,
                       annotated and analyzed in various ways. Registered
                       users can organize themselves in groups and
                       collaboratively work on their data. The eHumanities
                       Desktop is platform independent and runs in a web
                       browser. This paper presents the system focusing on its
                       service orientation and process management.},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/dhc2012.pdf},
      year           = 2012
    }
  • A. Mehler, S. Schwandt, R. Gleim, and A. Ernst, “Inducing Linguistic Networks from Historical Corpora: Towards a New Method in Historical Semantics,” in Proceedings of the Conference on New Methods in Historical Corpora, P. Bennett, M. Durrell, S. Scheible, and R. J. Whitt, Eds., Tübingen: Narr, 2012, vol. 3, pp. 257-274.
    [BibTeX]

    @InCollection{Mehler:Schwandt:Gleim:Ernst:2012,
      Author         = {Mehler, Alexander and Schwandt, Silke and Gleim,
                       Rüdiger and Ernst, Alexandra},
      Title          = {Inducing Linguistic Networks from Historical Corpora:
                       Towards a New Method in Historical Semantics},
      BookTitle      = {Proceedings of the Conference on New Methods in
                       Historical Corpora},
      Publisher      = {Narr},
      Editor         = {Paul Bennett and Martin Durrell and Silke Scheible and
                       Richard J. Whitt},
      Volume         = {3},
      Series         = {Corpus linguistics and Interdisciplinary perspectives
                       on language (CLIP)},
      Pages          = {257--274},
      Address        = {Tübingen},
      year           = 2012
    }

2011 (3)

  • [PDF] A. Mehler, N. Diewald, U. Waltinger, R. Gleim, D. Esch, B. Job, T. Küchelmann, O. Abramov, and P. Blanchard, “Evolution of Romance Language in Written Communication: Network Analysis of Late Latin and Early Romance Corpora,” Leonardo, vol. 44, iss. 3, 2011.
    [BibTeX]

    @Article{Mehler:Diewald:Waltinger:et:al:2010,
      Author         = {Mehler, Alexander and Diewald, Nils and Waltinger,
                       Ulli and Gleim, Rüdiger and Esch, Dietmar and Job,
                       Barbara and Küchelmann, Thomas and Abramov, Olga and
                       Blanchard, Philippe},
      Title          = {Evolution of Romance Language in Written
                       Communication: Network Analysis of Late Latin and Early
                       Romance Corpora},
      Journal        = {Leonardo},
      Volume         = {44},
      Number         = {3},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/mehler_diewald_waltinger_gleim_esch_job_kuechelmann_pustylnikov_blanchard_2010.pdf},
      publisher      = {MIT Press},
      year           = 2011
    }
  • [PDF] R. Gleim, A. Hoenen, N. Diewald, A. Mehler, and A. Ernst, “Modeling, Building and Maintaining Lexica for Corpus Linguistic Studies by Example of Late Latin,” in Corpus Linguistics 2011, 20-22 July, Birmingham, 2011.
    [BibTeX]

    @InProceedings{Gleim:Hoenen:Diewald:Mehler:Ernst:2011,
      Author         = {Gleim, Rüdiger and Hoenen, Armin and Diewald, Nils
                       and Mehler, Alexander and Ernst, Alexandra},
      Title          = {Modeling, Building and Maintaining Lexica for Corpus
                       Linguistic Studies by Example of Late Latin},
      BookTitle      = {Corpus Linguistics 2011, 20-22 July, Birmingham},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/Paper-48.pdf},
      year           = 2011
    }
  • [PDF] A. Mehler, S. Schwandt, R. Gleim, and B. Jussen, “Der eHumanities Desktop als Werkzeug in der historischen Semantik: Funktionsspektrum und Einsatzszenarien,” Journal for Language Technology and Computational Linguistics (JLCL), vol. 26, iss. 1, pp. 97-117, 2011.
    [Abstract] [BibTeX]

    Die Digital Humanities bzw. die Computational                    Humanities entwickeln sich zu eigenständigen                    Disziplinen an der Nahtstelle von Geisteswissenschaft                    und Informatik. Diese Entwicklung betrifft zunehmend                    auch die Lehre im Bereich der geisteswissenschaftlichen                    Fachinformatik. In diesem Beitrag thematisieren wir den                    eHumanities Desktop als ein Werkzeug für diesen                    Bereich der Lehre. Dabei geht es genauer um einen                    Brückenschlag zwischen Geschichtswissenschaft und                    Informatik: Am Beispiel der historischen Semantik                    stellen wir drei Lehrszenarien vor, in denen der                    eHumanities Desktop in der geschichtswissenschaftlichen                    Lehre zum Einsatz kommt. Der Beitrag schliesst mit                    einer Anforderungsanalyse an zukünftige Entwicklungen                    in diesem Bereich.
    @Article{Mehler:Schwandt:Gleim:Jussen:2011,
      Author         = {Mehler, Alexander and Schwandt, Silke and Gleim,
                       Rüdiger and Jussen, Bernhard},
      Title          = {Der eHumanities Desktop als Werkzeug in der
                       historischen Semantik: Funktionsspektrum und
                       Einsatzszenarien},
      Journal        = {Journal for Language Technology and Computational
                       Linguistics (JLCL)},
      Volume         = {26},
      Number         = {1},
      Pages          = {97-117},
      abstract       = {Die Digital Humanities bzw. die Computational
                       Humanities entwickeln sich zu eigenst{\"a}ndigen
                       Disziplinen an der Nahtstelle von Geisteswissenschaft
                       und Informatik. Diese Entwicklung betrifft zunehmend
                       auch die Lehre im Bereich der geisteswissenschaftlichen
                       Fachinformatik. In diesem Beitrag thematisieren wir den
                       eHumanities Desktop als ein Werkzeug für diesen
                       Bereich der Lehre. Dabei geht es genauer um einen
                       Brückenschlag zwischen Geschichtswissenschaft und
                       Informatik: Am Beispiel der historischen Semantik
                       stellen wir drei Lehrszenarien vor, in denen der
                       eHumanities Desktop in der geschichtswissenschaftlichen
                       Lehre zum Einsatz kommt. Der Beitrag schliesst mit
                       einer Anforderungsanalyse an zukünftige Entwicklungen
                       in diesem Bereich.},
      pdf            = {http://media.dwds.de/jlcl/2011_Heft1/8.pdf },
      year           = 2011
    }

2010 (3)

  • [PDF] R. Gleim and A. Mehler, “Computational Linguistics for Mere Mortals – Powerful but Easy-to-use Linguistic Processing for Scientists in the Humanities,” in Proceedings of LREC 2010, Malta, 2010.
    [Abstract] [BibTeX]

    Delivering linguistic resources and easy-to-use                    methods to a broad public in the humanities is a                    challenging task. On the one hand users rightly demand                    easy to use interfaces but on the other hand want to                    have access to the full flexibility and power of the                    functions being offered. Even though a growing number                    of excellent systems exist which offer convenient means                    to use linguistic resources and methods, they usually                    focus on a specific domain, as for example corpus                    exploration or text categorization. Architectures which                    address a broad scope of applications are still rare.                    This article introduces the eHumanities Desktop, an                    online system for corpus management, processing and                    analysis which aims at bridging the gap between                    powerful command line tools and intuitive user                    interfaces. 
    @InProceedings{Gleim:Mehler:2010:b,
      Author         = {Gleim, Rüdiger and Mehler, Alexander},
      Title          = {Computational Linguistics for Mere Mortals –
                       Powerful but Easy-to-use Linguistic Processing for
                       Scientists in the Humanities},
      BookTitle      = {Proceedings of LREC 2010},
      Address        = {Malta},
      Publisher      = {ELDA},
      abstract       = {Delivering linguistic resources and easy-to-use
                       methods to a broad public in the humanities is a
                       challenging task. On the one hand users rightly demand
                       easy to use interfaces but on the other hand want to
                       have access to the full flexibility and power of the
                       functions being offered. Even though a growing number
                       of excellent systems exist which offer convenient means
                       to use linguistic resources and methods, they usually
                       focus on a specific domain, as for example corpus
                       exploration or text categorization. Architectures which
                       address a broad scope of applications are still rare.
                       This article introduces the eHumanities Desktop, an
                       online system for corpus management, processing and
                       analysis which aims at bridging the gap between
                       powerful command line tools and intuitive user
                       interfaces. },
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_mehler_2010.pdf},
      year           = 2010
    }
  • [PDF] A. Mehler, R. Gleim, U. Waltinger, and N. Diewald, “Time Series of Linguistic Networks by Example of the Patrologia Latina,” in Proceedings of INFORMATIK 2010: Service Science, September 27 – October 01, 2010, Leipzig, 2010, pp. 609-616.
    [BibTeX]

    @InProceedings{Mehler:Gleim:Waltinger:Diewald:2010,
      Author         = {Mehler, Alexander and Gleim, Rüdiger and Waltinger,
                       Ulli and Diewald, Nils},
      Title          = {Time Series of Linguistic Networks by Example of the
                       Patrologia Latina},
      BookTitle      = {Proceedings of INFORMATIK 2010: Service Science,
                       September 27 - October 01, 2010, Leipzig},
      Editor         = {F{\"a}hnrich, Klaus-Peter and Franczyk, Bogdan},
      Volume         = {2},
      Series         = {Lecture Notes in Informatics},
      Pages          = {609-616},
      Publisher      = {GI},
      pdf            = {http://subs.emis.de/LNI/Proceedings/Proceedings176/586.pdf},
      year           = 2010
    }
  • [PDF] R. Gleim, P. Warner, and A. Mehler, “eHumanities Desktop – An Architecture for Flexible Annotation in Iconographic Research,” in Proceedings of the 6th International Conference on Web Information Systems and Technologies (WEBIST ’10), April 7-10, 2010, Valencia, 2010.
    [BibTeX]

    @InProceedings{Gleim:Warner:Mehler:2010,
      Author         = {Gleim, Rüdiger and Warner, Paul and Mehler, Alexander},
      Title          = {eHumanities Desktop - An Architecture for Flexible
                       Annotation in Iconographic Research},
      BookTitle      = {Proceedings of the 6th International Conference on Web
                       Information Systems and Technologies (WEBIST '10),
                       April 7-10, 2010, Valencia},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_warner_mehler_2010.pdf},
      website        = {https://www.researchgate.net/publication/220724277_eHumanities_Desktop_-_An_Architecture_for_Flexible_Annotation_in_Iconographic_Research},
      year           = 2010
    }

2009 (4)

  • [PDF] A. Mehler, R. Gleim, U. Waltinger, A. Ernst, D. Esch, and T. Feith, “eHumanities Desktop – eine webbasierte Arbeitsumgebung für die geisteswissenschaftliche Fachinformatik,” in Proceedings of the Symposium “Sprachtechnologie und eHumanities”, 26.–27. Februar, Duisburg-Essen University, 2009.
    [BibTeX]

    @InProceedings{Mehler:Gleim:Waltinger:Ernst:Esch:Feith:2009,
      Author         = {Mehler, Alexander and Gleim, Rüdiger and Waltinger,
                       Ulli and Ernst, Alexandra and Esch, Dietmar and Feith,
                       Tobias},
      Title          = {eHumanities Desktop – eine webbasierte
                       Arbeitsumgebung für die geisteswissenschaftliche
                       Fachinformatik},
      BookTitle      = {Proceedings of the Symposium "Sprachtechnologie und
                       eHumanities", 26.–27. Februar, Duisburg-Essen
                       University},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/mehler_gleim_waltinger_ernst_esch_feith_2009.pdf},
      website        = {http://duepublico.uni-duisburg-essen.de/servlets/DocumentServlet?id=37041},
      year           = 2009
    }
  • [PDF] R. Gleim, A. Mehler, U. Waltinger, and P. Menke, “eHumanities Desktop – An extensible Online System for Corpus Management and Analysis,” in 5th Corpus Linguistics Conference, University of Liverpool, 2009.
    [Abstract] [BibTeX]

    This paper presents the eHumanities Desktop - an                    online system for corpus management and analysis in                    support of computing in the humanities. Design issues                    and the overall architecture are described, as well as                    an outline of the applications offered by the system.
    @InProceedings{Gleim:Mehler:Waltinger:Menke:2009,
      Author         = {Gleim, Rüdiger and Mehler, Alexander and Waltinger,
                       Ulli and Menke, Peter},
      Title          = {eHumanities Desktop – An extensible Online System
                       for Corpus Management and Analysis},
      BookTitle      = {5th Corpus Linguistics Conference, University of
                       Liverpool},
      abstract       = {This paper presents the eHumanities Desktop - an
                       online system for corpus management and analysis in
                       support of computing in the humanities. Design issues
                       and the overall architecture are described, as well as
                       an outline of the applications offered by the system.},
      pdf            = {http://www.ulliwaltinger.de/pdf/eHumanitiesDesktop-AnExtensibleOnlineSystem-CL2009.pdf},
      website        = {http://www.ulliwaltinger.de/ehumanities-desktop-an-extensible-online-system-for-corpus-management-and-analysis/},
      year           = 2009
    }
  • [PDF] R. Gleim, U. Waltinger, A. Ernst, A. Mehler, D. Esch, and T. Feith, “The eHumanities Desktop – An Online System for Corpus Management and Analysis in Support of Computing in the Humanities,” in Proceedings of the Demonstrations Session of the 12th Conference of the European Chapter of the Association for Computational Linguistics EACL 2009, 30 March – 3 April, Athens, 2009.
    [BibTeX]

    @InProceedings{Gleim:Waltinger:Ernst:Mehler:Esch:Feith:2009,
      Author         = {Gleim, Rüdiger and Waltinger, Ulli and Ernst,
                       Alexandra and Mehler, Alexander and Esch, Dietmar and
                       Feith, Tobias},
      Title          = {The eHumanities Desktop – An Online System for
                       Corpus Management and Analysis in Support of Computing
                       in the Humanities},
      BookTitle      = {Proceedings of the Demonstrations Session of the 12th
                       Conference of the European Chapter of the Association
                       for Computational Linguistics EACL 2009, 30 March – 3
                       April, Athens},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_waltinger_ernst_mehler_esch_feith_2009.pdf},
      year           = 2009
    }
  • [PDF] U. Waltinger, A. Mehler, and R. Gleim, “Social Semantics And Its Evaluation By Means of Closed Topic Models: An SVM-Classification Approach Using Semantic Feature Replacement By Topic Generalization,” in Proceedings of the Biennial GSCL Conference 2009, September 30 – October 2, Universität Potsdam, 2009.
    [BibTeX]

    @InProceedings{Waltinger:Mehler:Gleim:2009:a,
      Author         = {Waltinger, Ulli and Mehler, Alexander and Gleim,
                       Rüdiger},
      Title          = {Social Semantics And Its Evaluation By Means of Closed
                       Topic Models: An SVM-Classification Approach Using
                       Semantic Feature Replacement By Topic Generalization},
      BookTitle      = {Proceedings of the Biennial GSCL Conference 2009,
                       September 30 – October 2, Universit{\"a}t Potsdam},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/GSCL_2009_WaltingerMehlerGleim_camera_ready.pdf},
      year           = 2009
    }

2008 (3)

  • [PDF] O. Abramov, A. Mehler, and R. Gleim, “A Unified Database of Dependency Treebanks. Integrating, Quantifying and Evaluating Dependency Data,” in Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco), 2008.
    [Abstract] [BibTeX]

    This paper describes a database of 11 dependency                    treebanks which were unified by means of a                    two-dimensional graph format. The format was evaluated                    with respect to storage-complexity on the one hand, and                    efficiency of data access on the other hand. An example                    of how the treebanks can be integrated within a unique                    interface is given by means of the DTDB interface. 
    @InProceedings{Pustylnikov:Mehler:Gleim:2008,
      Author         = {Abramov, Olga and Mehler, Alexander and Gleim,
                       Rüdiger},
      Title          = {A Unified Database of Dependency Treebanks.
                       Integrating, Quantifying and Evaluating Dependency Data},
      BookTitle      = {Proceedings of the 6th Language Resources and
                       Evaluation Conference (LREC 2008), Marrakech (Morocco)},
      abstract       = {This paper describes a database of 11 dependency
                       treebanks which were unified by means of a
                       two-dimensional graph format. The format was evaluated
                       with respect to storage-complexity on the one hand, and
                       efficiency of data access on the other hand. An example
                       of how the treebanks can be integrated within a unique
                       interface is given by means of the DTDB interface. },
      pdf            = {http://wwwhomes.uni-bielefeld.de/opustylnikov/pustylnikov/pdfs/LREC08_full.pdf},
      year           = 2008
    }
  • [PDF] A. Mehler, R. Gleim, A. Ernst, and U. Waltinger, “WikiDB: Building Interoperable Wiki-Based Knowledge Resources for Semantic Databases,” Sprache und Datenverarbeitung. International Journal for Language Data Processing, vol. 32, iss. 1, pp. 47-70, 2008.
    [Abstract] [BibTeX]

    This article describes an API for exploring the                    logical document and the logical network structure of                    wikis. It introduces an algorithm for the semantic                    preprocessing, filtering and typing of these building                    blocks. Further, this article models the process of                    wiki generation based on a unified format of syntactic,                    semantic and pragmatic representations. This                    three-level approach to make accessible syntactic,                    semantic and pragmatic aspects of wiki-based structure                    formation is complemented by a corresponding database                    model – called WikiDB – and an API operating                    thereon. Finally, the article provides an empirical                    study of using the three-fold representation format in                    conjunction with WikiDB.
    @Article{Mehler:Gleim:Ernst:Waltinger:2008,
      Author         = {Mehler, Alexander and Gleim, Rüdiger and Ernst,
                       Alexandra and Waltinger, Ulli},
      Title          = {WikiDB: Building Interoperable Wiki-Based Knowledge
                       Resources for Semantic Databases},
      Journal        = {Sprache und Datenverarbeitung. International Journal
                       for Language Data Processing},
      Volume         = {32},
      Number         = {1},
      Pages          = {47-70},
      abstract       = {This article describes an API for exploring the
                       logical document and the logical network structure of
                       wikis. It introduces an algorithm for the semantic
                       preprocessing, filtering and typing of these building
                       blocks. Further, this article models the process of
                       wiki generation based on a unified format of syntactic,
                       semantic and pragmatic representations. This
                       three-level approach to make accessible syntactic,
                       semantic and pragmatic aspects of wiki-based structure
                       formation is complemented by a corresponding database
                       model – called WikiDB – and an API operating
                       thereon. Finally, the article provides an empirical
                       study of using the three-fold representation format in
                       conjunction with WikiDB.},
      pdf            = {http://www.ulliwaltinger.de/pdf/Konvens_2008_WikiDB_Building_Semantic_Databases_MehlerGleimErnstWaltinger.pdf},
      year           = 2008
    }
  • [PDF] G. Rehm, M. Santini, A. Mehler, P. Braslavski, R. Gleim, A. Stubbe, S. Symonenko, M. Tavosanis, and V. Vidulin, “Towards a Reference Corpus of Web Genres for the Evaluation of Genre Identification Systems,” in Proceedings of the 6th Language Resources and Evaluation Conference (LREC 2008), Marrakech (Morocco), 2008.
    [Abstract] [BibTeX]

    We present initial results from an international and                    multi-disciplinary research collaboration that aims at                    the construction of a reference corpus of web genres.                    The primary application scenario for which we plan to                    build this resource is the automatic identification of                    web genres. Web genres are rather difficult to capture                    and to describe in their entirety, but we plan for the                    finished reference corpus to contain multi-level tags                    of the respective genre or genres a web document or a                    website instantiates. As the construction of such a                    corpus is by no means a trivial task, we discuss                    several alternatives that are, for the time being,                    mostly based on existing collections. Furthermore, we                    discuss a shared set of genre categories and a                    multi-purpose tool as two additional prerequisites for                    a reference corpus of web genres. 
    @InProceedings{Rehm:Santini:Mehler:Braslavski:Gleim:Stubbe:Symonenko:Tavosanis:Vidulin:2008,
      Author         = {Rehm, Georg and Santini, Marina and Mehler, Alexander
                       and Braslavski, Pavel and Gleim, Rüdiger and Stubbe,
                       Andrea and Symonenko, Svetlana and Tavosanis, Mirko and
                       Vidulin, Vedrana},
      Title          = {Towards a Reference Corpus of Web Genres for the
                       Evaluation of Genre Identification Systems},
      BookTitle      = {Proceedings of the 6th Language Resources and
                       Evaluation Conference (LREC 2008), Marrakech (Morocco)},
      abstract       = {We present initial results from an international and
                       multi-disciplinary research collaboration that aims at
                       the construction of a reference corpus of web genres.
                       The primary application scenario for which we plan to
                       build this resource is the automatic identification of
                       web genres. Web genres are rather difficult to capture
                       and to describe in their entirety, but we plan for the
                       finished reference corpus to contain multi-level tags
                       of the respective genre or genres a web document or a
                       website instantiates. As the construction of such a
                       corpus is by no means a trivial task, we discuss
                       several alternatives that are, for the time being,
                       mostly based on existing collections. Furthermore, we
                       discuss a shared set of genre categories and a
                       multi-purpose tool as two additional prerequisites for
                       a reference corpus of web genres. },
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/rehm_santini_mehler_braslavski_gleim_stubbe_symonenko_tavosanis_vidulin_2008.pdf},
      website        = {http://www.lrec-conf.org/proceedings/lrec2008/summaries/94.html},
      year           = 2008
    }

2007 (5)

  • [PDF] R. Gleim, A. Mehler, M. Dehmer, and O. Abramov, “Aisles through the Category Forest – Utilising the Wikipedia Category System for Corpus Building in Machine Learning,” in 3rd International Conference on Web Information Systems and Technologies (WEBIST ’07), March 3-6, 2007, Barcelona, Barcelona, 2007, pp. 142-149.
    [Abstract] [BibTeX]

    The Word Wide Web is a continuous challenge to machine                    learning. Established approaches have to be enhanced                    and new methods be developed in order to tackle the                    problem of finding and organising relevant information.                    It has often been motivated that semantic                    classifications of input documents help solving this                    task. But while approaches of supervised text                    categorisation perform quite well on genres found in                    written text, newly evolved genres on the web are much                    more demanding. In order to successfully develop                    approaches to web mining, respective corpora are                    needed. However, the composition of genre- or                    domain-specific web corpora is still an unsolved                    problem. It is time consuming to build large corpora of                    good quality because web pages typically lack reliable                    meta information. Wikipedia along with similar                    approaches of collaborative text production offers a                    way out of this dilemma. We examine how social tagging,                    as supported by the MediaWiki software, can be utilised                    as a source of corpus building. Further, we describe a                    representation format for social ontologies and present                    the Wikipedia Category Explorer, a tool which supports                    categorical views to browse through the Wikipedia and                    to construct domain specific corpora for machine                    learning.
    @InProceedings{Gleim:Mehler:Dehmer:Abramov:2007,
      Author         = {Gleim, Rüdiger and Mehler, Alexander and Dehmer,
                       Matthias and Abramov, Olga},
      Title          = {Aisles through the Category Forest – Utilising the
                       Wikipedia Category System for Corpus Building in
                       Machine Learning},
      BookTitle      = {3rd International Conference on Web Information
                       Systems and Technologies (WEBIST '07), March 3-6, 2007,
                       Barcelona},
      Editor         = {Filipe, Joaquim and Cordeiro, José and Encarnação,
                       Bruno and Pedrosa, Vitor},
      Pages          = {142-149},
      Address        = {Barcelona},
      abstract       = {The Word Wide Web is a continuous challenge to machine
                       learning. Established approaches have to be enhanced
                       and new methods be developed in order to tackle the
                       problem of finding and organising relevant information.
                       It has often been motivated that semantic
                       classifications of input documents help solving this
                       task. But while approaches of supervised text
                       categorisation perform quite well on genres found in
                       written text, newly evolved genres on the web are much
                       more demanding. In order to successfully develop
                       approaches to web mining, respective corpora are
                       needed. However, the composition of genre- or
                       domain-specific web corpora is still an unsolved
                       problem. It is time consuming to build large corpora of
                       good quality because web pages typically lack reliable
                       meta information. Wikipedia along with similar
                       approaches of collaborative text production offers a
                       way out of this dilemma. We examine how social tagging,
                       as supported by the MediaWiki software, can be utilised
                       as a source of corpus building. Further, we describe a
                       representation format for social ontologies and present
                       the Wikipedia Category Explorer, a tool which supports
                       categorical views to browse through the Wikipedia and
                       to construct domain specific corpora for machine
                       learning.},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2016/10/webist_2007-gleim_mehler_dehmer_pustylnikov.pdf},
      year           = 2007
    }
  • [PDF] A. Mehler, R. Gleim, and A. Wegner, “Structural Uncertainty of Hypertext Types. An Empirical Study,” in Proceedings of the Workshop “Towards Genre-Enabled Search Engines: The Impact of NLP”, September, 30, 2007, in conjunction with RANLP 2007, Borovets, Bulgaria, 2007, pp. 13-19.
    [BibTeX]

    @InProceedings{Mehler:Gleim:Wegner:2007,
      Author         = {Mehler, Alexander and Gleim, Rüdiger and Wegner,
                       Armin},
      Title          = {Structural Uncertainty of Hypertext Types. An
                       Empirical Study},
      BookTitle      = {Proceedings of the Workshop "Towards Genre-Enabled
                       Search Engines: The Impact of NLP", September, 30,
                       2007, in conjunction with RANLP 2007, Borovets,
                       Bulgaria},
      Editor         = {Rehm, Georg and Santini, Marina},
      Pages          = {13-19},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/RANLP.pdf},
      year           = 2007
    }
  • [PDF] A. Mehler, P. Geibel, R. Gleim, S. Herold, B. Jain, and O. Abramov, “Much Ado About Text Content. Learning Text Types Solely by Structural Differentiae,” in Proceedings of OTT ’06 – Ontologies in Text Technology: Approaches to Extract Semantic Knowledge from Structured Information, Osnabrück, 2007, pp. 63-71.
    [Abstract] [BibTeX]

    In this paper, we deal with classifying texts into                    classes which denote text types whose textual instances                    serve more or less homogeneous functions. Other than                    mainstream approaches to text classification, which                    rely on the vector space model [30] or some of its                    descendants [2] and, thus, on content-related lexical                    features, we solely refer to structural differentiae,                    that is, to patterns of text structure as determinants                    of class membership. Further, we suppose that text                    types span a type hierarchy based on the type-subtype                    relation [31]. Thus, although we admit that class                    membership is fuzzy so that overlapping classes are                    inevitable, we suppose a non-overlapping type system                    structured into a rooted tree – whether solely based                    on functional or additional on, e.g., content- or                    mediabased criteria [1]. What regards criteria of                    goodness of classification, we perform a classical                    supervised categorization experiment [30] based on                    cross-validation as a method of model selection [11].                    That is, we perform a categorization experiment in                    which for all training and test cases class membership                    is known ex ante. In summary, we perform a supervised                    experiment of text classification in order to learn                    functionally grounded text types where membership to                    these types is solely based on structural criteria.
    @InProceedings{Mehler:Geibel:Gleim:Herold:Jain:Pustylnikov:2007,
      Author         = {Mehler, Alexander and Geibel, Peter and Gleim,
                       Rüdiger and Herold, Sebastian and Jain,
                       Brijnesh-Johannes and Abramov, Olga},
      Title          = {Much Ado About Text Content. Learning Text Types
                       Solely by Structural Differentiae},
      BookTitle      = {Proceedings of OTT '06 – Ontologies in Text
                       Technology: Approaches to Extract Semantic Knowledge
                       from Structured Information},
      Editor         = {Mönnich, Uwe and Kühnberger, Kai-Uwe},
      Series         = {Publications of the Institute of Cognitive Science
                       (PICS)},
      Pages          = {63-71},
      Address        = {Osnabrück},
      abstract       = {In this paper, we deal with classifying texts into
                       classes which denote text types whose textual instances
                       serve more or less homogeneous functions. Other than
                       mainstream approaches to text classification, which
                       rely on the vector space model [30] or some of its
                       descendants [2] and, thus, on content-related lexical
                       features, we solely refer to structural differentiae,
                       that is, to patterns of text structure as determinants
                       of class membership. Further, we suppose that text
                       types span a type hierarchy based on the type-subtype
                       relation [31]. Thus, although we admit that class
                       membership is fuzzy so that overlapping classes are
                       inevitable, we suppose a non-overlapping type system
                       structured into a rooted tree – whether solely based
                       on functional or additional on, e.g., content- or
                       mediabased criteria [1]. What regards criteria of
                       goodness of classification, we perform a classical
                       supervised categorization experiment [30] based on
                       cross-validation as a method of model selection [11].
                       That is, we perform a categorization experiment in
                       which for all training and test cases class membership
                       is known ex ante. In summary, we perform a supervised
                       experiment of text classification in order to learn
                       functionally grounded text types where membership to
                       these types is solely based on structural criteria.},
      pdf            = {http://ikw.uni-osnabrueck.de/~ott06/ott06-abstracts/Mehler_Geibel_abstract.pdf},
      year           = 2007
    }
  • [PDF] R. Gleim, A. Mehler, and H. Eikmeyer, “Representing and Maintaining Large Corpora,” in Proceedings of the Corpus Linguistics 2007 Conference, Birmingham (UK), 2007.
    [BibTeX]

    @InProceedings{Gleim:Mehler:Eikmeyer:2007:a,
      Author         = {Gleim, Rüdiger and Mehler, Alexander and Eikmeyer,
                       Hans-Jürgen},
      Title          = {Representing and Maintaining Large Corpora},
      BookTitle      = {Proceedings of the Corpus Linguistics 2007 Conference,
                       Birmingham (UK)},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_mehler_eikmeyer_2007_a.pdf},
      year           = 2007
    }
  • [PDF] R. Gleim, A. Mehler, H. Eikmeyer, and H. Rieser, “Ein Ansatz zur Repräsentation und Verarbeitung großer Korpora multimodaler Daten,” in Data Structures for Linguistic Resources and Applications. Proceedings of the Biennial GLDV Conference 2007, 11.–13. April, Universität Tübingen, Tübingen, 2007, pp. 275-284.
    [BibTeX]

    @InProceedings{Gleim:Mehler:Eikmeyer:Rieser:2007,
      Author         = {Gleim, Rüdiger and Mehler, Alexander and Eikmeyer,
                       Hans-Jürgen and Rieser, Hannes},
      Title          = {Ein Ansatz zur Repr{\"a}sentation und Verarbeitung
                       gro{\ss}er Korpora multimodaler Daten},
      BookTitle      = {Data Structures for Linguistic Resources and
                       Applications. Proceedings of the Biennial GLDV
                       Conference 2007, 11.–13. April, Universit{\"a}t
                       Tübingen},
      Editor         = {Rehm, Georg and Witt, Andreas and Lemnitzer, Lothar},
      Pages          = {275-284},
      Address        = {Tübingen},
      Publisher      = {Narr},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2015/08/gleim_mehler_eikmeyer_rieser_2007.pdf},
      year           = 2007
    }

2006 (5)

  • A. Mehler, R. Gleim, and M. Dehmer, “Towards Structure-Sensitive Hypertext Categorization,” in Proceedings of the 29th Annual Conference of the German Classification Society, March 9-11, 2005, Universität Magdeburg, Berlin/New York, 2006, pp. 406-413.
    [Abstract] [BibTeX]

    Hypertext categorization is the task of automatically                    assigning category labels to hypertext units.                    Comparable to text categorization it stays in the area                    of function learning based on the bag-of-features                    approach. This scenario faces the problem of a                    many-to-many relation between websites and their hidden                    logical document structure. The paper argues that this                    relation is a prevalent characteristic which interferes                    any effort of applying the classical apparatus of                    categorization to web genres. This is confirmed by a                    threefold experiment in hypertext categorization. In                    order to outline a solution to this problem, the paper                    sketches an alternative method of unsupervised learning                    which aims at bridging the gap between statistical and                    structural pattern recognition (Bunke et al. 2001) in                    the area of web mining.
    @InProceedings{Mehler:Gleim:Dehmer:2006,
      Author         = {Mehler, Alexander and Gleim, Rüdiger and Dehmer,
                       Matthias},
      Title          = {Towards Structure-Sensitive Hypertext Categorization},
      BookTitle      = {Proceedings of the 29th Annual Conference of the
                       German Classification Society, March 9-11, 2005,
                       Universit{\"a}t Magdeburg},
      Editor         = {Spiliopoulou, Myra and Kruse, Rudolf and Borgelt,
                       Christian and Nürnberger, Andreas and Gaul, Wolfgang},
      Pages          = {406-413},
      Address        = {Berlin/New York},
      Publisher      = {Springer},
      abstract       = {Hypertext categorization is the task of automatically
                       assigning category labels to hypertext units.
                       Comparable to text categorization it stays in the area
                       of function learning based on the bag-of-features
                       approach. This scenario faces the problem of a
                       many-to-many relation between websites and their hidden
                       logical document structure. The paper argues that this
                       relation is a prevalent characteristic which interferes
                       any effort of applying the classical apparatus of
                       categorization to web genres. This is confirmed by a
                       threefold experiment in hypertext categorization. In
                       order to outline a solution to this problem, the paper
                       sketches an alternative method of unsupervised learning
                       which aims at bridging the gap between statistical and
                       structural pattern recognition (Bunke et al. 2001) in
                       the area of web mining.},
      website        = {http://www.springerlink.com/content/l7665tm3u241317l/},
      year           = 2006
    }
  • A. Mehler and R. Gleim, “The Net for the Graphs – Towards Webgenre Representation for Corpus Linguistic Studies,” in WaCky! Working Papers on the Web as Corpus, M. Baroni and S. Bernardini, Eds., Bologna: Gedit, 2006, pp. 191-224.
    [BibTeX]

    @InCollection{Mehler:Gleim:2006:b,
      Author         = {Mehler, Alexander and Gleim, Rüdiger},
      Title          = {The Net for the Graphs – Towards Webgenre
                       Representation for Corpus Linguistic Studies},
      BookTitle      = {WaCky! Working Papers on the Web as Corpus},
      Publisher      = {Gedit},
      Editor         = {Baroni, Marco and Bernardini, Silvia},
      Pages          = {191-224},
      Address        = {Bologna},
      website        = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.510.4125},
      year           = 2006
    }
  • [PDF] R. Gleim, A. Mehler, and M. Dehmer, “Web Corpus Mining by Instance of Wikipedia,” in Proceedings of the EACL 2006 Workshop on Web as Corpus, April 3-7, 2006, Trento, Italy, 2006, pp. 67-74.
    [Abstract] [BibTeX]

    Workshop organizer: Adam Kilgarriff
    @InProceedings{Gleim:Mehler:Dehmer:2006:a,
      Author         = {Gleim, Rüdiger and Mehler, Alexander and Dehmer,
                       Matthias},
      Title          = {Web Corpus Mining by Instance of Wikipedia},
      BookTitle      = {Proceedings of the EACL 2006 Workshop on Web as
                       Corpus, April 3-7, 2006, Trento, Italy},
      Editor         = {Kilgariff, Adam and Baroni, Marco},
      Pages          = {67-74},
      abstract       = {Workshop organizer: Adam Kilgarriff},
      pdf            = {http://www.aclweb.org/anthology/W06-1710},
      website        = {http://pub.uni-bielefeld.de/publication/1773538},
      year           = 2006
    }
  • [PDF] R. Gleim, “HyGraph – Ein Framework zur Extraktion, Repräsentation und Analyse webbasierter Hypertextstrukturen,” in Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Tagung 2005, Universität Bonn, Frankfurt a. M., 2006, pp. 42-53.
    [BibTeX]

    @InProceedings{Gleim:2006,
      Author         = {Gleim, Rüdiger},
      Title          = {HyGraph - Ein Framework zur Extraktion,
                       Repr{\"a}sentation und Analyse webbasierter
                       Hypertextstrukturen},
      BookTitle      = {Sprachtechnologie, mobile Kommunikation und
                       linguistische Ressourcen. Beitr{\"a}ge zur GLDV-Tagung
                       2005, Universit{\"a}t Bonn},
      Editor         = {Fisseni, Bernhard and Schmitz, Hans-Christian and
                       Schröder, Bernhard and Wagner, Petra},
      Pages          = {42-53},
      Address        = {Frankfurt a. M.},
      Publisher      = {Lang},
      pdf            = {https://www.texttechnologylab.org/wp-content/uploads/2016/10/GLDV2005-HyGraph-Framework.pdf},
      website        = {https://www.researchgate.net/publication/268294000_HyGraph__Ein_Framework_zur_Extraktion_Reprsentation_und_Analyse_webbasierter_Hypertextstrukturen},
      year           = 2006
    }
  • A. Mehler, M. Dehmer, and R. Gleim, “Towards Logical Hypertext Structure – A Graph-Theoretic Perspective,” in Proceedings of the Fourth International Workshop on Innovative Internet Computing Systems (I2CS ’04), Berlin/New York, 2006, pp. 136-150.
    [Abstract] [BibTeX]

    Facing the retrieval problem according to the                    overwhelming set of documents online the adaptation of                    text categorization to web units has recently been                    pushed. The aim is to utilize categories of web sites                    and pages as an additional retrieval criterion. In this                    context, the bag-of-words model has been utilized just                    as HTML tags and link structures. In spite of promising                    results this adaptation stays in the framework of IR                    specific models since it neglects the content-based                    structuring inherent to hypertext units. This paper                    approaches hypertext modelling from the perspective of                    graph-theory. It presents an XML-based format for                    representing websites as hypergraphs. These hypergraphs                    are used to shed light on the relation of hypertext                    structure types and their web-based instances. We place                    emphasis on two characteristics of this relation: In                    terms of realizational ambiguity we speak of functional                    equivalents to the manifestation of the same structure                    type. In terms of polymorphism we speak of a single web                    unit which manifests different structure types. It is                    shown that polymorphism is a prevalent characteristic                    of web-based units. This is done by means of a                    categorization experiment which analyses a corpus of                    hypergraphs representing the structure and content of                    pages of conference websites. On this background we                    plead for a revision of text representation models by                    means of hypergraphs which are sensitive to the                    manifold structuring of web documents.
    @InProceedings{Mehler:Dehmer:Gleim:2006,
      Author         = {Mehler, Alexander and Dehmer, Matthias and Gleim,
                       Rüdiger},
      Title          = {Towards Logical Hypertext Structure - A
                       Graph-Theoretic Perspective},
      BookTitle      = {Proceedings of the Fourth International Workshop on
                       Innovative Internet Computing Systems (I2CS '04)},
      Editor         = {Böhme, Thomas and Heyer, Gerhard},
      Series         = {Lecture Notes in Computer Science 3473},
      Pages          = {136-150},
      Address        = {Berlin/New York},
      Publisher      = {Springer},
      abstract       = {Facing the retrieval problem according to the
                       overwhelming set of documents online the adaptation of
                       text categorization to web units has recently been
                       pushed. The aim is to utilize categories of web sites
                       and pages as an additional retrieval criterion. In this
                       context, the bag-of-words model has been utilized just
                       as HTML tags and link structures. In spite of promising
                       results this adaptation stays in the framework of IR
                       specific models since it neglects the content-based
                       structuring inherent to hypertext units. This paper
                       approaches hypertext modelling from the perspective of
                       graph-theory. It presents an XML-based format for
                       representing websites as hypergraphs. These hypergraphs
                       are used to shed light on the relation of hypertext
                       structure types and their web-based instances. We place
                       emphasis on two characteristics of this relation: In
                       terms of realizational ambiguity we speak of functional
                       equivalents to the manifestation of the same structure
                       type. In terms of polymorphism we speak of a single web
                       unit which manifests different structure types. It is
                       shown that polymorphism is a prevalent characteristic
                       of web-based units. This is done by means of a
                       categorization experiment which analyses a corpus of
                       hypergraphs representing the structure and content of
                       pages of conference websites. On this background we
                       plead for a revision of text representation models by
                       means of hypergraphs which are sensitive to the
                       manifold structuring of web documents.},
      website        = {http://rd.springer.com/chapter/10.1007/11553762_14},
      year           = 2006
    }

2005 (2)

  • [PDF] A. Mehler and R. Gleim, “Polymorphism in Generic Web Units. A corpus linguistic study,” in Proceedings of Corpus Linguistics ’05, July 14-17, 2005, University of Birmingham, Great Britian, 2005.
    [Abstract] [BibTeX]

    Corpus linguistics and related disciplines which focus                    on statistical analyses of textual units have                    substantial need for large corpora. More specifically,                    genre or register specific corpora are needed which                    allow studying variations in language use. Along with                    the incredible growth of the internet, the web became                    an important source of linguistic data. Of course, web                    corpora face the same problem of acquiring genre                    specific corpora. Amongst other things, web mining is                    a framework of methods for automatically assigning                    category labels to web units and thus may be seen as a                    solution to this corpus acquisition problem as far as                    genre categories are applied. The paper argues that                    this approach is faced with the problem of a                    many-to-many relation between expression units on the                    one hand and content or function units on the other                    hand. A quantitative study is performed which supports                    the argumentation that functions of web-based                    communication are very often concentrated on single web                    pages and thus interfere any effort of directly                    applying the classical apparatus of categorization on                    web page level. The paper outlines a two-level                    algorithm as an alternative approach to category                    assignment which is sensitive to genre specific                    structures and thus may be used to tackle the problem                    of acquiring genre specific corpora.
    @InProceedings{Mehler:Gleim:2005:a,
      Author         = {Mehler, Alexander and Gleim, Rüdiger},
      Title          = {Polymorphism in Generic Web Units. A corpus linguistic
                       study},
      BookTitle      = {Proceedings of Corpus Linguistics '05, July 14-17,
                       2005, University of Birmingham, Great Britian},
      Volume         = {Corpus Linguistics Conference Series 1(1)},
      abstract       = {Corpus linguistics and related disciplines which focus
                       on statistical analyses of textual units have
                       substantial need for large corpora. More specifically,
                       genre or register specific corpora are needed which
                       allow studying variations in language use. Along with
                       the incredible growth of the internet, the web became
                       an important source of linguistic data. Of course, web
                       corpora face the same problem of acquiring genre
                       specific corpora. Amongst other things, web mining is
                       a framework of methods for automatically assigning
                       category labels to web units and thus may be seen as a
                       solution to this corpus acquisition problem as far as
                       genre categories are applied. The paper argues that
                       this approach is faced with the problem of a
                       many-to-many relation between expression units on the
                       one hand and content or function units on the other
                       hand. A quantitative study is performed which supports
                       the argumentation that functions of web-based
                       communication are very often concentrated on single web
                       pages and thus interfere any effort of directly
                       applying the classical apparatus of categorization on
                       web page level. The paper outlines a two-level
                       algorithm as an alternative approach to category
                       assignment which is sensitive to genre specific
                       structures and thus may be used to tackle the problem
                       of acquiring genre specific corpora.},
      issn           = {1747-9398},
      pdf            = {http://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2005-journal/Thewebasacorpus/AlexanderMehlerandRuedigerGleimCorpusLinguistics2005.pdf},
      year           = 2005
    }
  • A. Mehler, M. Dehmer, and R. Gleim, “Zur Automatischen Klassifikation von Webgenres,” in Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen. Beiträge zur GLDV-Frühjahrstagung ’05, 10. März – 01. April 2005, Universität Bonn, Frankfurt a. M., 2005, pp. 158-174.
    [BibTeX]

    @InProceedings{Mehler:Dehmer:Gleim:2005,
      Author         = {Mehler, Alexander and Dehmer, Matthias and Gleim,
                       Rüdiger},
      Title          = {Zur Automatischen Klassifikation von Webgenres},
      BookTitle      = {Sprachtechnologie, mobile Kommunikation und
                       linguistische Ressourcen. Beitr{\"a}ge zur
                       GLDV-Frühjahrstagung '05, 10. M{\"a}rz – 01. April
                       2005, Universit{\"a}t Bonn},
      Editor         = {Fisseni, Bernhard and Schmitz, Hans-Christina and
                       Schröder, Bernhard and Wagner, Petra},
      Pages          = {158-174},
      Address        = {Frankfurt a. M.},
      Publisher      = {Lang},
      year           = 2005
    }

2004 (1)

  • [PDF] M. Dehmer, A. Mehler, and R. Gleim, “Aspekte der Kategorisierung von Webseiten,” in INFORMATIK 2004 – Informatik verbindet, Band 2, Beiträge der 34. Jahrestagung der Gesellschaft für Informatik e.V. (GI). Workshop Multimedia-Informationssysteme, 2004, pp. 39-43.
    [Abstract] [BibTeX]

    Im Zuge der Web-basierten Kommunikation tritt die                    Frage auf, inwiefern Webpages zum Zwecke ihrer                    inhaltsorientierten Filterung kategorisiert werden                    können. Diese Studie untersucht zwei Phänomene,                    welche die Bedingung der Möglichkeit einer solchen                    Kategorisierung betreffen (siehe [6]): Mit dem Begriff                    der funktionalen Aquivalenz beziehen wir uns auf das                    Phänomen, dass dieselbe Funktions- oder                    Inhaltskategorie durch völlig verschiedene Bausteine                    Web-basierter Dokumente manifestiert werden kann. Mit                    dem Begriff des Polymorphie beziehen wir uns auf das                    Phänomen, dass dasselbe Dokument zugleich mehrere                    Funktions- oder Inhaltskategorien manifestieren kann.                    Die zentrale Hypothese lautet, dass beide Phänomene                    für Web-basierte Hypertextstrukturen charakteristisch                    sind. Ist dies der Fall, so kann die automatische                    Kategorisierung von Hypertexten [2, 10] nicht mehr als                    eindeutige Zuordnung verstanden werden, bei der einem                    Dokument genau eine Kategorie zugeordnet wird. In                    diesem Sinne thematisiert das Papier die Frage nach der                    adäquaten Modellierung multimedialer Dokumente.
    @InProceedings{Dehmer:Mehler:Gleim:2004,
      Author         = {Dehmer, Matthias and Mehler, Alexander and Gleim,
                       Rüdiger},
      Title          = {Aspekte der Kategorisierung von Webseiten},
      BookTitle      = {INFORMATIK 2004 – Informatik verbindet, Band 2,
                       Beitr{\"a}ge der 34. Jahrestagung der Gesellschaft für
                       Informatik e.V. (GI). Workshop
                       Multimedia-Informationssysteme},
      Editor         = {Dadam, Peter and Reichert, Manfred},
      Volume         = {51},
      Series         = {Lecture Notes in Informatics},
      Pages          = {39-43},
      Publisher      = {GI},
      abstract       = {Im Zuge der Web-basierten Kommunikation tritt die
                       Frage auf, inwiefern Webpages zum Zwecke ihrer
                       inhaltsorientierten Filterung kategorisiert werden
                       können. Diese Studie untersucht zwei Ph{\"a}nomene,
                       welche die Bedingung der Möglichkeit einer solchen
                       Kategorisierung betreffen (siehe [6]): Mit dem Begriff
                       der funktionalen Aquivalenz beziehen wir uns auf das
                       Ph{\"a}nomen, dass dieselbe Funktions- oder
                       Inhaltskategorie durch völlig verschiedene Bausteine
                       Web-basierter Dokumente manifestiert werden kann. Mit
                       dem Begriff des Polymorphie beziehen wir uns auf das
                       Ph{\"a}nomen, dass dasselbe Dokument zugleich mehrere
                       Funktions- oder Inhaltskategorien manifestieren kann.
                       Die zentrale Hypothese lautet, dass beide Ph{\"a}nomene
                       für Web-basierte Hypertextstrukturen charakteristisch
                       sind. Ist dies der Fall, so kann die automatische
                       Kategorisierung von Hypertexten [2, 10] nicht mehr als
                       eindeutige Zuordnung verstanden werden, bei der einem
                       Dokument genau eine Kategorie zugeordnet wird. In
                       diesem Sinne thematisiert das Papier die Frage nach der
                       ad{\"a}quaten Modellierung multimedialer Dokumente.},
      pdf            = {http://subs.emis.de/LNI/Proceedings/Proceedings51/GI-Proceedings.51-11.pdf},
      website        = {https://www.researchgate.net/publication/221385316_Aspekte_der_Kategorisierung_von_Webseiten},
      year           = 2004
    }