Technical Reports - Query Results

Your query term was 'number = 98-30'
1 report found
OFAI-TR-98-30 ( 28kB g-zipped PostScript file,  295kB PDF file)

A Study Using n-gram Features for Text Categorization

Johannes Fürnkranz

In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm RIPPER indicate that, after the removal of stop words, word sequences of length 2 or 3 are most useful. Using longer sequences reduces classification performance.

Keywords: Machine Learning, Text Categorization

Citation: Fürnkranz J.: A Study Using n-gram Features for Text Categorization, Austrian Research Institute for Artificial Intelligence, Vienna, TR-98-30, 1998.