Document Type

Theses, Ph.D


This item is available under a Creative Commons License for non-commercial use only



Publication Details

Successfully submitted for the award of Doctor of Philosophy (Ph.D) to the Dublin Institute of Technology, October, 2011.


Text classification approaches are used extensively to solve real-world challenges. The success or failure of text classification systems hangs on the datasets used to train them, without a good dataset it is impossible to build a quality system. This thesis examines the applicability of active learning in text classification for the rapid and economical creation of labelled training data. Four main contributions are made in this thesis. First, we present two novel selection strategies to choose the most informative examples for manually labelling. One is an approach using an advanced aggregated confidence measurement instead of the direct output of classifiers to measure the confidence of the prediction and choose the examples with least confidence for querying. The other is a simple but effective exploration guided active learning selection strategy which uses only the notions of density and diversity, based on similarity, in its selection strategy. Second, we propose new methods of using deterministic clustering algorithms to help bootstrap the active learning process. We first illustrate the problems of using non-deterministic clustering for selecting initial training sets, showing how non-deterministic clustering methods can result in inconsistent behaviour in the active learning process. We then compare various deterministic clustering techniques and commonly used non-deterministic ones, and show that deterministic clustering algorithms are as good as non-deterministic clustering algorithms at selecting initial training examples for the active learning process. More importantly, we show that the use of deterministic approaches stabilises the active learning process. Our third direction is in the area of visualising the active learning process. We demonstrate the use of an existing visualisation technique in understanding active learning selection strategies to show that a better understanding of selection strategies can be achieved with the help of visualisation techniques. Finally, to evaluate the practicality and usefulness of active learning as a general dataset labelling methodology, it is desirable that actively labelled dataset can be reused more widely instead of being only limited to some particular classifier. We compare the reusability of popular active learning methods for text classification and identify the best classifiers to use in active learning for text classification. This thesis is concerned using active learning methods to label large unlabelled textual datasets. Our domain of interest is text classification, but most of the methods proposed are quite general and so are applicable to other domains having large collections of data with high dimensionality.