Project MetaL: On-line Bibliography on Meta-Learning

Empirical Comparisons

This page is part of the On-line Bibliography on Meta-Learning, which is provided within the ESPRIT METAL Project (26.357) A Meta-Learning Assistant for Providing User Support in Machine Learning Mining. This part of the bibliography covers empirical studies of different types of learning algorithms.

If you think of a paper that should be here, please let us know.

19 references, last updated Fri Mar 9 13:02:51 MET 2001

[Bauer and Kohavi, 1999]
Eric Bauer and Ron Kohavi. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Machine Learning, 36:105-139, 1999.
Abstract: Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and real-world datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (3 variants) and a naive-Bayes inducer. The purpose of the study is to improve our understanding of why and when these algorithms, which use perturbation, re-weighting, and combination techniques, affect classification error. We provide a bias and variance decomposition of the error to show how different methods and variants influence these two terms. This allowed us to determine that bagging reduced variance of unstable models, while boosting methods (AdaBoost and Arc-x4) reduced both the bias and variance of unstable models but increased the variance for naive-Bayes which was very stable. We observed that Arc-x4 behaves differently than AdaBoost if re-weighting is used instead of resampling, indicating a fundamental difference. Voting variants, some of which are introduced in this paper, include: pruning versus no-pruning, use of probabilistic estimates, weight perturbations(Wagging) and backfitting of data. We found that Bagging improves when probabilistic estimates in conjunction with no-pruning are used, as well as when data was backfit. We measure tree sizes and show an interesting positive correlation between the increase in the average tree size in AdaBoost trials and its success in reducing the error. We compare the mean-squared error of voting methods to non-voting methods and show that the voting methods lead to a large and significant reductions in the mean-squared error. Practical problems that arise in the implementation of boosting algorithms are explored, including numerical instabilities and underflows. We use scatterplots that graphically show how AdaBoost reweights instances, emphasizing not only "hard" areas but also outliers and noise

[Buntine and Niblett, 1992]
Wray Buntine and Tim Niblett. A further comparison of splitting rules for decision-tree induction. Machine Learning, 8:75-85, 1992.

[Chan and Stolfo, 1998]
Philip K. Chan and Salvatore J. Stolfo. Learning with non-uniform class and cost distributions: Effects and a distributed multi-classifier approach. In Proceedings of the KDD-98 Workshop on Distributed Data Mining, pages 1-9, 1998.
Comment: The authors study how the class distribution of the training examples influences four learning algorithms (C4.5, CART, Ripper, Bayes). They also propose a cost-sensitive multi-classifier meta-classification scheme.

[Dietterich, 2000]
Thomas G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139-157, 2000.
Abstract: Bagging and boosting are methods that generate a diverse ensemble of classifiers by manipulating the training data given to a base learning algorithm. Breiman has pointed out that they rely for their effectiveness on the instability of the base learning algorithm. An alternative approach to generating an ensemble is to randomize the internal decisions made by the base algorithm. This general approach has been studied previously by Ali and Pazzani and by Dietterich and Kong. This paper compares the effectiveness of randomization, bagging, and boosting for improving the performance of the decision-tree algorithm C4.5. The experiments show that in situations with little or no classification noise, randomization is competitive with (and perhaps slightly superior to) bagging but not as accurate as boosting. In situations with substantial classification noise, bagging is much better than boosting, and sometimes better than randomization.

[Feng et al., 1993]
C. Feng, A. Sutherland, R. King, S. Muggleton, and R. Henery. Comparison of machine learning classifiers to statistics and neural networks. In Proceedings of the 4th International Workshop on AI and Statistics, pages 41-52, Florida, 1993.
Comment: The authors perform a comparison of 17 algorithms from the areas of machine learning, neural networks and statistics. No algorithm or method was uniformly better than the others. The authors also give some evidence that the algorithms' performance depends on measurable dataset characteristics, and show a clustering of the algorithms that has been derived on the basis of such measures.

[Hilario and Kalousis, 2000]
Melanie Hilario and Alexandros Kalousis. Quantifying the resilience of inductive algorithms for classification. In Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 106-115. Springer, September 2000.
Abstract: Selecting the most appropriate learning algorithm for a given task has become a crucial research issue since the advent of multi-paradigm data mining tool suites. To address this issue, researchers have tried to extract dataset characteristics which might provide clues as to the most appropriate learning algorithm. We propose to extend this research by extracting inducer profiles, i.e., sets of metalevel features which characterize learning algorithms from the point of view of their representation and functionality, efficiency, practicality, and resilience. Values for these features can be determined on the basis of author specifications, expert consensus or previous case studies. However, there is a need to characterize learning algorithms in more quantitative terms on the basis of extensive, controlled experiments. This paper illustrates the proposed approach and reports empirical findings on one resilience-related characteristic of classifier inducers, namely their tolerance to irrelevant variables in training data.

[Jammernegg et al., 1998]
Werner Jammernegg, Mikulas Luptacik, Gholamreza Nakhaeizadeh, and Alexander Schnabl. Ist ein fairer Vergleich von Data Mining Algorithmen moeglich? In Gholamreza Nakhaeizadeh, editor, Data Mining. Theoretische Aspekte und Anwendungen, pages 225-240. Physica, Heidelberg, Brighton, UK, 1998.

[Kalousis and Hilario, 2000]
A. Kalousis and M. Hilario. Knowledge discovery from incomplete data. In Proceedings of the 2nd International Conference on Data Mining. WIT Press, June 2000.
Abstract: Incomplete data can raise more or less serious problems in knowledge discovery systems depending on the quantity and pattern of missing values as well as the generalization method used. For instance, some methods are inherently resilient to missing values while others have built-in methods for coping with them. Still others require that none of the values are missing; for such methods, preliminary imputation of missing values is indispensable. After a quick overview of current practice in the machine learning field, we explore the problem of missing values from a statistical perspective. In particular, we adopt the well-known distinction between three patterns of missing value (missing completely at random (MCAR), missing at random (MAR) and not missing at random (NMAR)) to focus a comparative study of eight learning algorithms from the point of view of their tolerance to incomplete data. Experiments on 47 datasets reveal a rough ranking from the most resilient (e.g., Naive Bayes) to the most sensitive (e.g., IB1 and surprisingly C50rules). More importantly, results show that for a given amount of missing values, their dispersion among the predictive variables is at least as important as the pattern of missingness

[King et al., 1995]
R. D. King, C. Feng, and A. Sutherland. STATLOG: comparison of classification algorithms on large real-world problems. Applied Artificial Intelligence, 9(3):289-334, 1995.

[Lim et al., 2000]
Tjen-Sien Lim, Wei-Yin Loh, and Yu-Shan Shih. A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40(3):203-228, 2000.
Abstract: Twenty-two decision tree, nine statistical, and two neural network algorithms are compared on thirty-two datasets in terms of classification accuracy, training time, and (in the case of trees) number of leaves. Classification accuracy is measured by mean error rate and mean rank of error rate. Both criteria place a statistical, spline-based, algorithm called POLYCLSSS at the top, although it is not statistically significantly different from twenty other algorithms. Another statistical algorithm, logistic regression, is second with respect to the two accuracy criteria. The most accurate decision tree algorithm is QUEST with linear splits, which ranks fourth and fifth, respectively. Although spline-based statistical algorithms tend to have good accuracy, they also require relatively long training times. POLYCLASS, for example, is third last in terms of median training time. It often requires hours of training compared to seconds for other algorithms. The QUEST and logistic regression algorithms are substantially faster. Among decision tree algorithms with univariate splits, C4.5, IND-CART, and QUEST have the best combinations of error rate and speed. But C4.5 tends to produce trees with twice as many leaves as those from IND-CART and QUEST.
Comment: This paper essentially extends the results of the StatLog project, both in size and scope. It reports on a comparison of 22 decision-tree, 9 statistical and 2 neural network algorithms on 33 datasets in terms of classification accuracy, training time and (in the case of trees) number of leaves. In addition to the standard settings, the paper also studies the effect of noise and examines scalability as the sample size increases. Interestingly (especially in the light of Wolpert's work), the results show that most algorithms adapt to noise quite well and the mean error rates of many algorithms are sufficiently similar that their differences are statistically insignificant. Thus the authors suggest that the differences are also probably insignificant in practical terms, that is, criteria other than predictive accuracy should be considered in selecting algorithms. This paper is a gold mine of empirical results.

[Lindner et al., 1998]
Guido Lindner, Robert Engels, and Rudi Studer. Model selection for tasks. In Willi Kloesgen and Jan Zytkow, editors, Handbook for Knowledge Discovery and Data Mining, chapter 6.2. Oxford Press, USA, 1998.

[Michie et al., 1994]
D. Michie, D.J. Spiegelhalter, and C.C. Taylor, editors. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994.
Comment: This is the book that came out of the STATLOG project. It contains a number of empirical and theoretical studies of different learning algorithms and some preliminary work on meta-learning.

[Mingers, 1989a]
John Mingers. An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4:227-243, 1989.

[Mingers, 1989b]
John Mingers. An empirical comparison of selection measures for decision-tree induction. Machine Learning, 3:319-342, 1989.

[Quinlan, 1988]
J. Ross Quinlan. An empirical comparison of genetic and decision-tree classifiers. In Proceedings of the 5th International Conference on Machine Learning (ICML-88), pages pp.135-141, 1988.

[Quinlan, 1993]
J. R. Quinlan. Comparing connectionist and symbolic learning methods. In S. Hanson, G. Drastal, and R. Rivest, editors, Computational Learning Theory and Natural Learning Systems: Constraints and Ppects. MIT Press, 1993.
Comment: The author review a number of comparative studies and drew two main conclusions: (1) no single method is superior to the others, and (2) problem-specific knowledge can help to determine the right method. In particular, he discriminates between parallel tasks, in which a combination of all or most of the features will be important, and sequential tasks, in which only a few attributes are relevant to the classification. Neural networks should be used for the fomer type of problem, while symbolic methods are appropriate for the latter. (summary based on Schaffer, MLJ 1993)

[Seewald et al., 2001]
Alexander K. Seewald, Johann Petrak, and Gerhard Widmer. Hybrid decision tree learners with alternative leaf classifiers: An empirical study. In Proceedings of the 14th International FLAIRS Conference (FLAIRS-2001), Key West, Florida, 2001.
Abstract: There has been surprisingly little research so far that systematically investigated the possibility of constructing hybrid learning algorithms by simple local modifications to decision tree learners. In this paper we analyze three variants of a C4.5-style learner, introducing alternative leaf models (Naive Bayes, IB1, and multi-response linear regression, respectively) which can replace the original C4.5 leaf nodes during reduced error post-pruning. We empirically show that these simple modifications can improve upon the performance of the original decision tree algorithm and even upon both constituent algorithms. We see this as a step towards the construction of learners that locally optimize their bias for different regions of the instance space.

[Shavlik et al., 1991]
Jude W. Shavlik, Raymond J. Mooney, and Geoffrey G. Towell. Symbolic and neural learning algorithms: An experimental comparison. Machine Learning, 6:111-143, 1991.

[Thrun et al., 1991]
S.B. Thrun, J. Bala, E. Bloedorn, I. Bratko, B. Cestnik, J. Cheng, K. De Jong, S. Dzeroski, S.E. Fahlman, D. Fisher, R. Hamann, K. Kaufman, S. Keller, I. Kononenko, J. Kreuziger, R.S. Michalski, T. Mitchell, P. Pachowitz, Y. Reich, H. Vafaie, W. Van de Welde, W. Wenzel, J. Wnek, and J. Zhang. The MONK's problems: A performance comparison of different learning algorithms. Technical Report CMU-CS-91-197, Carnegie Mellon University, December 1991.
Comment: A report that describes the results of a wide variety of learning algorithms (mostly by used by their authors) on three artificial problems.