Document Type

Dissertation

Rights

This item is available under a Creative Commons License for non-commercial use only

Disciplines

Computer Sciences

Publication Details

A dissertation submitted in partial fulfilment of the requirements of Dublin Institute of Technology for the degree of M.Sc. in Computing (Data Analytics) 2017.

Abstract

The recent increase in the widespread use of short messages, for example micro-blogs or SMS communications, has created an opportunity to harvest a vast amount of information through machine-based classification. However, traditional classification methods have failed to produce accuracies comparable to those obtained from similar classification of longer texts. Several approaches have been employed to extend traditional methods to overcome this problem, including the enhancement of the original texts through the construction of associations with external data enrichment sources, ranging from thesauri and semantic nets such as Wordnet, to pre-built online taxonomies such as Wikipedia. Other avenues of investigation have used more formal extensions such as Latent Semantic Analysis (LSA) to extend or replace the more basic, traditional, methods better suited to classification of longer texts. This work examines the changes in classification accuracy of a small selection of classification methods using a variety of enhancement methods, as target text length decreases. The experimental data used is a corpus of micro-blog (twitter) posts obtained from the ‘Sentiment140’1 sentiment classification and analysis project run by Stanford University and described by Go, Bhayani and Huang (2009), which has been split into sub-corpora differentiated by text length.

Share

COinS