OFAI

Project MetaL: On-line Bibliography on Meta-Learning

Algorithm and Data Measurements

This page is part of the On-line Bibliography on Meta-Learning, which is provided within the ESPRIT METAL Project (26.357) A Meta-Learning Assistant for Providing User Support in Machine Learning Mining. This part of the bibliography contains is concerned with measures for characterizing datasets and measuring algorithm performance.

If you think of a paper that should be here, please let us know.


33 references, last updated Sat Jan 4 23:28:26 2003

[Alpaydin, 1999]
Ethem Alpaydin. Combined 5x2cv f-test for comparing supervised classification learning algorithms. Neural Computation, 11(8):1885-1982, 1999.
Abstract: Dietterich (1998) reviews five statistical tests proposing the 5x2cv t test for determining whether there is a significant difference between the error rates of two classifie rs. In our experiments, we noticed that the 5x2cv t test result may vary depending on factors that should not affect the test and we propose a variant, the combined 5x2cv F test, that comb ines multiple statistics to get a more robust test. Simulation results show that this combined version of the test has lower Type I error and higher power than 5x2cv proper.

[Bailey and Elkan, 1993]
Timothy L. Bailey and Charles Elkan. Estimating the accuracy of learned concepts. In Proceedings of the 13th International Joint Conference on Artificial Intelligence (IJCAI-93), pages 895-900, Chambery, France, 1993.

[Bradley, 1997]
A. Bradley. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 1997.
Abstract: In this paper we investigate the use of the area under the receiver operating characteristic (ROC) curve (AUC) as a performance measure for machine learning algorithms. As a case study we evaluate six machine learning algorithms (C4.5, Multiscale Classifier, Perceptron, Multi-layer perceptron, k-nearest neighbours, and a quadratic discriminant) on six real world medical diagnosis data sets. We compare and discuss the use of AUC to the more conventional overall accuracy and find that AUC exhibits a number of desirable properties when compared to overall accuracy: increased sensitivity in Analysis of Variance (ANOVA) tests; a standard error decreased as both AUC and the number of test samples increased; decision threshold independent; and it is invariant to a priori class probabilities. The paper concludes with the recommendation that AUC be used in preference to overall accuracy for single number evaluation of machine learning algorithms.

[Dietterich, 1998]
Thomas G. Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10(7):1895-1924, 1998.
Comment: Critique to t-paired tests.

[Domingos, 2000]
Pedro Domingos. A unified bias-variance decomposition and its applications. In Pat Langley, editor, Machine Learning, Proceedings of the 17th International Conference. Morgan Kaufmann, 2000.
Abstract: This paper presents a unified bias-variance decomposition that is applicable to squared loss, zero-one loss, variable misclassification costs, and other loss functions. The unified decomposition sheds light on a number of significant issues: the relation between some of the previously-proposed decompositions for zero-one loss and the original one for squared loss, the relation between bias, variance and Schapire et al.'s (1997) notion of margin, and the nature of the trade-off between bias and variance in classification. While the bias-variance behavior of zero-one loss and and variable misclassification costs is quite different from that of square loss, this difference derives directly from the different definitions of loss. We have applied the proposed decomposition to decision tree learning, instance-based learning and boosting on a large suite of benchmark data sets, and made several significant observations.

[Efron, 1982]
Bradley Efron. The Jackknife, the Bootstrap, and Other Resampling Plans. Society for Industrial and Applied Mathematics, 1982.
Keywords: cross-validation, bootstrapping

[Efron, 1983]
Bradley Efron. Estimating the error rate of a prediction rule: Improvements on cross-validation. Journal of the American Statistical Association, 78:316-331, 1983.
Keywords: cross-validation, bootstrapping

[Engels and Theusinger, 1998a]
Robert Engels and Christiane Theusinger. Support for data transformation in machine learning applications. In C. Giraud-Carrier and M. Hilario, editors, ECML'98 Workshop Notes - Upgrading Learning to the Meta-Level: Model Selection and Data Transformation, Technical Report CSR-98-02, Fakultät für Informatik, Technische Universität Chemnitz, 1998.

[Engels and Theusinger, 1998b]
Robert Engels and Christiane Theusinger. Using a data metric for offering preprocessing advice in data-mining applications. In Henri Prade, editor, Proceedings of the 13th European Conference on Artificial Intelligence, pages 430-434, Brighton, UK, 1998. IOS Press, Amsterdam.

[Engels, 1999]
Robert Engels. Component-Based User Guidance in Knowledge Discovery and Data Mining. PhD thesis, Dissertationen zur K?nstlichen Intelligenz, ISBN 3-89838-211-7, 1999.

[Feelders and Verkooijen, 1995]
A. Feelders and W. Verkooijen. Which method learns most from the data? Methodological issues in the analysis of comparative studies. In Proceedings of the 5th International Workshop on Artificial Intelligence and Statistics, pages 219-225, Fort Lauderdale, Florida, 1995.
Comment: A theoretical paper showing one way for doing comparisons of a number of learning algorithms in a statistically sound manner. It covers both regression and classification.

[Gama and Brazdil, 2000]
Joao Gama and Pavel Brazdil. Cascade generalization. Machine Learning, 41(3):315-343, 2000.
Comment: Like Stacking, cascading is a meta-classification scheme that uses the prediction of base classifiers to improve predictions with a meta-level-classifiers. It differs from stacking in that the input is enriched with the results of a single meta classifier (instead of replacing the input with the predictions of several meta classifiers) and that several meta levels can be used, each enriching the dataset of the previous layer with one additional feature (hence a cascade of classifiers)

[Gordon and Segre, 1996]
G. Gordon and A. Segre. Nonparametric statistical methods for experimental evaluation of speedup learning. In L. Saitta, editor, Machine Learning Proc of 13th International Conference. Morgan Kaufmann, 1996.

[Halkidi et al., 2000]
M. Halkidi, M. Vazirgiannis, and Y. Batistakis. Quality scheme assessment in the clustering process. In D.A. Zighed, J. Komorowski, and J. Zytkow, editors, Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD2000), pages 265-276. Springer, 2000.

[Hilario and Kalousis, 1999]
M. Hilario and A. Kalousis. Building algorithm profiles for prior model selection in knowledge discovery systems. In Proceedings of the IEEE SMC'99, International Conference on Systems, Man and Cybernetics. IEEE press, October 1999.
Abstract: We propose the use of learning algorithm profiles to address the model selection problem in knowledge discovery systems. These profiles consist of metalevel feature-value vectors which describe learning algorithms from the point of view of their representation and functionality, efficiency, robustness, and practicality. Values for these features are assigned on the basis of author specifications, expert consensus or previous empirical studies. We review past evaluations of the better known learning algorithms and suggest an experimental strategy for building algorithm profiles on more quantitative grounds. Preliminary experiments have disconfirmed expert judgments on certain algorithm features, thus showing the need to build and refine such profiles via controlled experiments.

[Hilario and Kalousis, 2000]
M. Hilario and A. Kalousis. Building algorithm profiles for prior model selection in knowledge discovery systems. Engineering Intelligent Systems, 8(2), 2000.
Abstract: We propose the use of learning algorithm profiles to address the model selection problem in knowledge discovery systems. These profiles consist of metalevel feature- value vectors which describe learning algorithms from the point of view of their representation and functionality, efficiency, resilience, and practicality. Values for these features are assigned on the basis of author specifications, expert consensus or previous empirical studies. We review past evaluations of the better known learning algorithms and suggest an experimental strategy for building algorithm profiles on more quantitative grounds. Preliminary experiments have disconfirmed expert judgments on certain algorithm features, thus showing the need to build and refine such profiles via controlled experiments.

[Hilderman and Hamilton, 2000]
R.J. Hilderman and H.J. Hamilton. Applying objective interestingness measures in data mining systems. In D.A. Zighed, J. Komorowski, and J. Zytkow, editors, Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD2000), pages 432-439. Springer, 2000.

[Hilderman et al., 1999]
R.J. Hilderman, H.J. Hamilton, and B. Barber. Ranking the interestingness of summaries from data mining systems. In Prooceedings of the Twelfth International Florida AI Research Society Conference (FLAIRS'99), pages 100-106, 1999.
Comment: summarization as a kind of table compression; four interestingness measures defined; empirically show that measures are correlated when this can be observed from their definitions; show that interestingness measure give preference to low complexity summaries

[Hilderman, 2000]
R.K. Hilderman. Mining Summaries from Databases Using Domain Generalization Graphs and Objective Measures of Interestingness. PhD thesis, Computer Science Department, University of Regina, 2000.

[James and Hastie, 1997]
Gareth James and Trevor Hastie. Generalizations of the bias/variance decomposition for prediction error. Technical report, Statistics Department, Standford University, 1997.
Abstract: The bias and variance of a real valued random variable, using squared error loss, are well understood. However because of recent developments in classification techniques it has become desirable to extend these concepts to general random variables and loss functions. The 0-1 (misclassifications) loss function with categorical random variables has been of particular interest. We explore the concepts of variance and bias and develop a decomposition of the prediction error into functions of the systematic and variable parts of our predictor. After providing some examples we conclude with a discussion of the various definitions that have been proposed.

[Kohavi and Wolpert, 1996]
Ron Kohavi and David H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In L. Saitta, editor, Machine Learning, Proceedings of the 13th International Conference. Morgan Kaufmann, 1996.

[Kohavi, 1995]
Ron Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In C.S. Mellish, editor, Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI-95), pages 1137-1143, Montreal, Canada, 1995. Morgan Kaufmann.
Keywords: cross-validation, bootstrapping

[Kononenko and Bratko, 1991]
Igor Kononenko and Ivan Bratko. Information-based evaluation criterion for classifier's performance. Machine Learning, 6:67-80, 1991.
Comment: The paper suggest the use of a measure called ``information score'' to assess the performance of classifiers on problems with skewed class distributions.

[Lindner and Studer, 1999]
Guido Lindner and Rudi Studer. Ast: Support for algorithm selection with a CBR approach. In Jan Rauch and Jan Zytkow, editors, Proceedings of the 3rd International Conference on Principles and Practiceof Knowledge Discovery in Databases (PKDD-99), pages 418-423, Prague, Czech Republic, 1999. Springer, Berlin.

[Muyeba and Keane, 2000]
M.K. Muyeba and J.A. Keane. Interestingness in attribute-oriented induction (AOI): Multiple-level rule induction. In D.A. Zighed, J. Komorowski, and J. Zytkow, editors, Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD2000), pages 542-549. Springer, 2000.
Comment: summarization on attributes with hierarchical values; groups cases with similar (taking hierarchy into account) attribute values; unclear paper

[Nakhaeizadeh and Schnabl, 1997]
Gholamreza Nakhaeizadeh and Alexander Schnabl. Development of multi-criteria metrics for evaluation of data mining algorithms. In David Heckerman, Heikki Mannila, and Daryl Pregibon, editors, Proceedings of the 3rd International Conference on Knowledge Discovery and DataMining (KDD-97), pages 37-42, Newport Beach, CA, 1997. AAAI Press, CA.
Comment: When multiple performance criteria are desirable, the problem of mapping the performance results into comparable scalar values becomes apparent. In this work the ``Data Envelopment Analysis'' (DEA) methodology has been proposed as a solution to this problem. In this work, positive performance metrics like accuracy, and negative ones, like training time, are combined into a single performance ratio, called ``efficiency''. A common weighting scheme is not required; it is computed on the basis of anyone of the algorithms under inspection. Despite the intuition that such a weighting scheme would be biased, it is shown that an objective comparison of the algorithms is possible [multicriteria:KDD97]. However, a methodology of using the comparison results to select the most appropriate algorithm for a new problem is not proposed.

[Rendell and Cho, 1990]
Larry A. Rendell and H. Cho. Empirical learning as a function of concept character. Machine Learning, 5:267-298, 1990.
Comment: This paper characterises and investigates the extensive role of data character as a determiner of system behavior in empirical concept learning. The main contribution is the consideration of concepts as functions or surfaces over the instance space, which 1) leads to an useful characterisation of complexity based on shape, size and concentration, and 2) allows systematic artificial data generation. The results, which focus on measures of complexity, show that shape and especially concentration have significant effects. Furthermore, the degree of degradation due to class error may depend on other data characteristics such as concept size. Inspired by some of the results, it is suggested that constructive induction focus on transforming the instance space so as to merge peaks (i.e., increase concentration). The results are based on only two learning algorithms: ID3 and PLS1.

[Ricci and Aha, 1998]
F. Ricci and D. Aha. Bias, variance, and error correction output codes for local learners. Technical Report AIC-095-025, Naval Research Laboratory, Navy Center for Applied Research in Artificial Intelligence, 1998.
Abstract: This paper focuses on a bias variance decomposition analysis of a local learning algorithm, the nearest neighbor classifier, that has been extended with error correcting output codes. This extended algorithm often considerably reduces the 0-1 (i.e., classification) error in comparison with nearest neighbor (Ricci & Aha, 1997). The analysis presented here reveals that this performance improvement is obtained by drastically reducing bias at the cost of increasing variance. We also show that, even in classification problems with few classes (m<=5), extending the codeword length beyond the limit that assures column separation yields an error reduction. This error reduction is not only in the variance, which is due to the voting mechanism used for error-correcting output codes, but also in the bias.
Keywords: Error-correcting output codes, lazy learning algorithms, feature selection, bias/variance decomposition

[Rissanen, 1978]
J. Rissanen. Modeling by shortest data description. Automatica, 14:465-471, 1978.
Keywords: minimum-description length principle

[Sahar, 1999]
S. Sahar. Interestingness via what is not interesting. In S. Chandhuni and D. Madigan, editors, Proceedings of the Fifth International Conference on Knowledge Discovery & Data Mining, pages 332-336. ACM, 1999.

[Salzberg, 1997]
S. Salzberg. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, Vol. 1, 1997.

[Stone, 1974]
M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society B, 36:111-147, 1974.
Keywords: cross-validation

[Wallace and Boulton, 1968]
C. S. Wallace and D. M. Boulton. An information measure for classification. Computer Journal, 11:185-194, 1968.
Keywords: minimum-description length principle