The following master/bachelor thesis topics are currently offered by the Text Technology Lab. You are also welcome to bring in your own topics and ideas. If you are interested in one of them, please contact Alexander Mehler.
Thesis in document classification
- Word Sense Disambiguation using Support Vector Machines (Master or Bachelor)
A word can have several meanings or readings. For example, the word bank can denote a bank to sit on or a financial institution. The word work can specify the act of executing some kind of labor or the correct functioning of an electronic or mechanical device (e.g., the computer does not work anymore). A word sense disambiguation aims to determine the meaning of a word automatically using other words appearing in its context. The goal of this thesis is to devise such a word sense disambiguation by using a support vector machine. A support vector machine denotes a supervised machine learning method which distributes the data by means of a separating hyperplane into two classes. This hyperplane is automatically calculated from an annotated data set.
- Document classification with nearest neighbor (Master or Bachelor)
In many cases it is of interest to distribute documents into several topics automatically. Such a classification can for example be used to assign customer mails to the correct person in charge or to separate spam from non-spam. One way to do this automatically is to employ the so-called nearest neighbor method. This is the subject of the proposed thesis. The nearest neighbor method is a machine learning approach which determines the class (here the topic) of a data set by investigating only the data sets which are most similar (according to certain features) to the dataset to be classified.
- Development of a tool to annotate the document structure (Master or Bachelor)
In some cases, the assignment of an entire document to a single topic is not sufficient. Frequently, a document consists of loosely-connected sub-documents which focus on different aspects of the superordinate topic. For instance, a book about mathematics could consist of chapters of algebra as well as of calculus. The goal of this thesis is to develop a hierarchical, GUI-based annotation tool to allow for manual assignment of chapters, sections and paragraphs to topics. Note that the documents as well as the topics are usually hierarchically ordered.
Thesis in author identification
- Author Identification in copies of ancient manuscripts (Master)
Author identification has been performed on various documents and with various methods. Not only James Bond or intelligence agencies around the world make use of it, NLP scientists do, too. A yet largely unexplored subfield is that of copyist identification. Author identification often uses support vector machines and is a machine learning approach, where a training set of different texts of one author is used to obtain statistical values combined into a feature vector. Most commonly, a feature vector for each author is computed, containing features such as the frequencies of the words he/she uses, setting him/her apart from other authors. In a copy process however, as observed in modern classrooms as well as in ancient manuscripts the use of word vectors is impossible, as a copied text displays no lexical choice of the copyist himself. In copies, the only available information often are errors or unusual writings, a good source of which are contained in ancient texts from times without orthographically binding conventions. In this thesis a tool for the identification of copyists should be developed.
- Multi-word units (MWU) extraction
Multi-word units (MWU) extraction is a prospective area of research that is vital for many areas of NLP such as Machine Translation, Question-Answering systems, Information Retrieval etc. Non-compositional MWUs are especially tough as their meaning cannot be inferred from the meaning of the MWU’s components. The importance as well as difficulty of the correct detection of non-compositional MWUs is revealed by the cruel word-by-word translation of the German sentence: “Er reißt sich bei der Hausaufgabe die Beine aus”. Recently, Fazly, Cook and Stevenson showed that lexical and syntactic fixedness are good indicators of idiomaticity. Their information-theoretic approach (PMI) is based on the estimation of lexical and syntactic fixedness. Lexical fixedness measures the strength of the association between the idiom’s constituents. Syntactic fixedness measures if a MWU usage is limited by one or few syntactic patterns. The thesis task is to implement these measures and apply them to a corpus of Latin, the Patrologia Latina.
Fazly, Afsaneh; Cook, Paul, and Stevenson, Suzanne. Unsupervised type and token identification of idiomatic expressions, Computational Linguistics 35(1): 61–103