Talk by Alexander Mehler, Wahed Hemati, Tolga Uslu and Rüdiger Gleim at D2K 2016 (From Digitization to Knowledge 2016)
Juli 11, 2016, Krakow
Exploring Wikipedia as a Text-technological Resource: From Natural Language Processing to Modeling Language Change
Wikipedia and related projects such as Wiktionary are widely used as resources for solving various tasks in Natural Language Processing and Computational Linguistics. In the area of semantics, this relates, for example, to explicit semantic analysis and topic labeling as well as to wikification, ontology mining, merging and enrichment. A central prerequisite of any such endeavor is the accurate and detailed pre-processing of these resources along such tasks as automatic tagging of parts of speech and grammatical categories, disambiguation, dependency parsing, relation extraction and topic modeling. In this talk, I will address these tasks by example of the German Wikipedia. The aim is to show how they scale with the size of Wikipedia subject to its growth over time. This is done as a prerequisite for studying laws of semantic change by analyzing several consecutive stages of Wikipedia thereby giving insights into the time and space complexity of time-related lexical-semantic analyses. This is done from the point of view of complex network theory. We show the dependency of the entropy of lexical networks derived from Wikipedia as a function of time. This is done to get sparse representations of vector embeddings to be used for the various tasks of semantic modeling. A special focus will be on visualizing the output of such analyses with the help of the TextImager. The UIMA-based TextImager automatically extracts a wide range of linguistic information from input texts to derive representational images of these texts.