This item is available under a Creative Commons License for non-commercial use only
The recent increase in the widespread use of short messages, for example micro-blogs or SMS communications, has created an opportunity to harvest a vast amount of information through machine-based classification. However, traditional classification methods have failed to produce accuracies comparable to those obtained from similar classification of longer texts. Several approaches have been employed to extend traditional methods to overcome this problem, including the enhancement of the original texts through the construction of associations with external data enrichment sources, ranging from thesauri and semantic nets such as Wordnet, to pre-built online taxonomies such as Wikipedia. Other avenues of investigation have used more formal extensions such as Latent Semantic Analysis (LSA) to extend or replace the more basic, traditional, methods better suited to classification of longer texts. This work examines the changes in classification accuracy of a small selection of classification methods using a variety of enhancement methods, as target text length decreases. The experimental data used is a corpus of micro-blog (twitter) posts obtained from the ‘Sentiment140’1 sentiment classification and analysis project run by Stanford University and described by Go, Bhayani and Huang (2009), which has been split into sub-corpora differentiated by text length.
McCartney, A. (2017)How short is a piece of string?”: An Investigation into the Impact of Text Length on Short-text Classification Accuracy, Masters Dissertation, Dublin Institute of Technology.