OFAI

Project MetaL: On-line Bibliography on Meta-Learning

Meta-Learning

This page is part of the On-line Bibliography on Meta-Learning, which is provided within the ESPRIT METAL Project (26.357) A Meta-Learning Assistant for Providing User Support in Machine Learning Mining. This part of the bibliography covers meta-learning approaches.

If you think of a paper that should be here, please let us know.


73 references, last updated Fri Mar 9 13:02:50 MET 2001

[Aha, 1992]
D. Aha. Generalizing from case studies: A case study. In D.Sleeman and P.Edwards, editors, Machine Learning, Proceedings of the 9th International Conference. Morgan Kaufmann, 1992.
Abstract: Most empirical evaluations of machine learningalgorithms are case studies - evaluations of multiple algorithms on multiple databases. Authors of case studies implicitly or explicitly hypothesize that the pattern of their results, which often suggests that one algorithm performs significantly better than others, is not limited to the small number of databases investigated, but instead holds for some general class of learning problems. However, these hypotheses are rarely supported with additional evidence, which leaves them suspect. This paper describes an empirical method for generalizing results from case studies and an example application. This method yields rules describing when some algorithms significantly outperform others on some dependent measures. Advantages for generalizing from case studies and limitations of this particular approach are also described.
Comment: Aha proposes to construct parametrized variants of datasets and to study the behaviour of algorithms on these artificial datasets in order to obtain more knowledge about their behavior under different circumstances than would be possible with a single experiment. He demonstrates his technique in a case study using the letter recognition database (which is easy to parametrize).

[Bay and Pazzani, 2000]
Stephen D. Bay and Michael J. Pazzani. Characterizing model errors and differences. In Pat Langley, editor, Machine Learning, Proceedings of the 17th International Conference. Morgan Kaufmann, 2000.
Comment: Bay and Pazzani propose to use a meta-classification scheme for characterizing model errors. The idea is to train a meta-learner to discriminate between correct and incorrect predictions of a base learner. However, they did not use this approach for decision making, but instead aimed at providing insight about the domain regions in which a learner is not able to discriminate well. Among the two level-1 learners they analyzed, they found that C5.0 does not produce sufficiently understandable concept descriptions, while their own algorithm STUCCO was a little better.

[Bensusan and Giraud-Carrier, 2000a]
H. Bensusan and C. Giraud-Carrier. Discovering task neighbourhoods through landmark learning performances. In D.A. Zighed, J. Komorowski, and J. Zytkow, editors, Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD2000), pages 325-330. Springer, 2000.
Abstract: Meta-learning is concerned with the selection of a suitable learning tool for a given application. Landmarking is a novel approach to meta-leaning. It uses simple, quick and efficient learners to describe tasks and therefore to locate the problem in the space of expertise areas of the learners being considered. It relies on the performance of a set of selected learning algorithms to uncover the sort of learning tool that the task requires. The paper presents landmarking and reports its performance in experiments involving both artificial and real-world databases. The experiments are performed in a supervised learning scenario where a task is classified according to the most suitable learner from a pool. Meta-learning hypotheses are constructed from some tasks and tested on others. The experiments contrast the new technique with an information-theoretical approach to meta-learning. Results show that landmarking outperforms its competitor and satisfactory selects suitable learning tools in all cases examined.

[Bensusan and Giraud-Carrier, 2000b]
H. Bensusan and C. Giraud-Carrier. If you see la sagrada familia, you know where you are: Landmarking the learner space. Technical report, Department of Computer Science, University of Bristol, 2000.

[Bensusan and Giraud-Carrier, 2000c]
Hilan Bensusan and Christophe Giraud-Carrier. Casa Batló is in Passeig de Gràcia or landmarking the expertise space. In J. Keller and C. Giraud-Carrier, editors, Proceedings of the ECML-00 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pages 29-46, Barcelona, Spain, 2000.
Abstract: Task description is crucial not only to every meta-learning enterprise but also to related endeavours like transfer of learning. This paper evaluates the performance of a newly introduced method of task description, landmarking, in a supervised meta-learning scenario. The method relies on correlations between simple and more sophisticated learning algorithms to select the best learner for a task. The results compare favourably with an information-based method and suggest that landmarking holds promise.

[Bensusan and Giraud-Carrier, 2000d]
Hilan Bensusan and Christophe Giraud-Carrier. Harmonia loosely praestabilita: discovering adequate inductive strategies. In Proceedings of the 22nd Annual Meeting of the Cognitive Science Society, pages 609-614. Cognitive Science Society, August 2000.
Abstract: Landmarking is a novel approach to inductive model selection in Machine Learning. It uses simple, bare-bone inductive strategies to describe tasks and induce correlations between tasks and strategies. The paper presents the technique and reports experiments showing that landmarking performs well in a number of different scenarios. It also discusses the implications of landmarking to our understanding of inductive refinement.

[Bensusan and Williams, 1997]
H. Bensusan and P. Williams. Learning to learn boolean tasks by decision tree descriptors. In M.V. Someren and G. Widmer, editors, 9th European Conference on Machine Learning (Poster Papers), pages 1-11, 1997.

[Bensusan et al., 2000a]
Hilan Bensusan, Christophe Giraud-Carrier, and Claire Kennedy. A higher-order approach to meta-learning. In Proceedings of the ECML'2000 workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pages 109-117. ECML'2000, June 2000.
Abstract: Meta-learning, as applied to model selection, consists of inducing mappings from tasks to learners. Traditionally, tasks are characterised by the values of pre-computed meta-attributes, such as statistical and information-theoretic measures, induced decision trees' characteristics and/or landmarkers' performances. In this position paper, we propose to (meta-)learn directly from induced decision trees, rather than rely on an ad hoc set of pre-computed characteristics. Such meta-learning is possible within the framework of the typed higher-order inductive learning framework we have developed.

[Bensusan et al., 2000b]
Hilan Bensusan, Christophe Giraud-Carrier, and Bernhard Pfahringer. What works well tells us what works better. In Proceedings of ICML'2000 workshop on What Works Well Where, pages 1-8. ICML'2000, June 2000.
Abstract: We have now a large number of learning algorithms available. What works well where? In order to find correlations between areas of expertise of learning algorithms and learning tasks, we can resort to meta-learning. Several meta-learning scenarios have been proposed. In one scenario, we are searching for the best learning algorithm for a problem. The decision can be made using different strategies. In any approach to meta-learning, it is crucial to choose relevant features to describe a task. Different strategies of task description have been proposed: some strategies based on statistical features of the dataset, some based on information-theoretic properties, others based on a learning algorithm's representation of the task. In this work we present a novel approach to task description, called landmarking.

[Bensusan, 1998a]
H. Bensusan. God doesn't always shave with occam's razor - learning when and how to prune. In C. Nédellec and C. Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning, pages 119-124, 1998.
Abstract: This paper shows how a meta-learning technique can be applied to decide when to prune, how much pruning is appropriate and what the best pruning technique is for a given learning task. The meta-learning technique uses unpruned decision trees and information from decision tree construction to describe the learning tasks. Results concerning two decision tree pruning strategies (Quinlan's error-based pruning and a cost-complexity pruning) show that the technique improves overall accuracy. The paper suggests that induction on the connection between tasks and biases is the way to decide about the convenience of the different simplicity biases.

[Bensusan, 1998b]
Hilan Bensusan. Odd bites into bananas don't make you blind: learning about simplicity and attribute addition. In Proceedings of the ECML'98 workshop on upgrading learning to the meta-level: model selection and data transformation, pages 30-42. ECML'98, April 1998.
Abstract: This paper shows how a meta-learning technique can be applied to decisions about pruning and representation adequacy. It describes a meta-learning technique that uses unpruned decision trees and information from decision tree construction to describe the learning tasks. Based on previous experience, a meta-learner decides which attribute addition strategy is the more appropriate for a given new learning problem. The technique is applied to simplicity and representation issues. In the simplicity camp, it is used to decide when to prune, how much pruning is appropriate and what the best pruning technique is for a given learning task. In constructive induction, it is used to decide between a pool of alternative new attribute constructors. Results suggest that induction on the connection between problems and simplicity and representation biases improves learning performance.

[Bensusan, 1999]
Hilan Bensusan. Automatic bias learning: An inquiry into the inductive basis of induction. PhD thesis, School of Cognitive and Computing Sciences, University of Sussex, July 1999.
Abstract: This thesis combines an epistemological concern about induction with a computational exploration of inductive mechanisms. It aims to investigate how inductive performance could be improved by using induction to select appropriate generalisation procedures. The thesis revolves around a meta-learning system, called textsc The Entrencher, designed to investigate how inductive performances could be improved by using induction to select appropriate generalisation procedures. The performance of textsc The Entrencher is discussed against the background of epistemological issues concerning induction, such as the role of theoretical vocabularies and the value of simplicity. After an introduction about machine learning and epistemological concerns with induction, Part I looks at learning mechanisms. It reviews some concepts and issues in machine learning and presents textsc The Entrencher. The system is the first attempt to develop a learning system that induces over learning mechanisms through algorithmic representations of tasks. Part II deals with the need for theoretical terms in induction. Experiments where textsc The Entrencher selects between different strategies for representation change are reported. The system is compared to other methods and some conclusions are drawn concerning how best to use the system. Part III considers the connection between simplicity and inductive success. Arguments for Occam's razor are considered and experiments are reported where textsc The Entrencher is used to select when, how and how much a decision tree needs to be pruned. Part IV looks at some philosophical consequences of the picture of induction that emerges from the experiments with textsc The Entrencher and goes over the motivations for meta-learning. Based on the picture of induction that emerges in the thesis, a new position in the scientific realism debate, transcendental surrealism, is proposed and defended. The thesis closes with some considerations concerning induction, justification and epistemological naturalism.

[Berrer et al., 2000]
Helmut Berrer, Iain Paterson, and Joerg Keller. Evaluation of machine-learning algorithm ranking advisors. In Pavel Brazdil and Alipio Jorge, editors, Proceedings of the PKDD-00 Workshop on Data Mining, Decision Support,Meta-Learning and ILP: Forum for Practical Problem Presentation andProspective Solutions, Lyon, France, 2000.
Keywords: Meta-learning, Ranking, k-NN

[Bloedorn et al., 1993]
E. Bloedorn, R.S. Michalski, and J. Wnek. Multistrategy constructive induction: Aq17-mci. In Proceedings of the 2nd International Workshop on Multistrategy Learning, pages 188-202, 1993.
Comment: In this paper, meta-rules are built, from meta-data characterizing datasets, to guide the selection of operators. According to the information source used to select operators and attributes, constructive induction methods are classified in three categories, data driven, hypothesis driven and knowledge driven methods. In data driven constructive induction, information from the training examples is used. For example in order to select attributes from which new ones will be derived one may use the information gain metric of the attributes. In hypothesis driven constructive induction, results from the analysis of the form of intermediate hypothesis are used.

[Brazdil and Henery, 1994]
P. Brazdil and R. Henery. Analysis of results - cap.10. Machine Learning, Neural and Statistical Classification, 1994.

[Brazdil and Soares, 1999a]
P. Brazdil and C. Soares. Exploiting past experience in ranking classifiers: Comparison between different ranking methods. In C. Giraud-Carrier and B. Pfahringer, editors, Recent Advances in Meta-Learning and Future Work, pages 48-58. J. Stefan Institute, 1999.

[Brazdil and Soares, 1999b]
P. Brazdil and C. Soares. Exploiting past experience in ranking classifiers. In H. Bacelar-Nicolau, F. Costa Nicolau, and J. Janssen, editors, Applied Stochastic Models and Data Analysis, pages 299-304. Instituto Nacional de Estatística, 1999.
Comment: Presents three methods to aggregate performance information into a ranking of candidate algorithms; describes an evaluation methodology for ranking based on rank correlation; comparison of ranking methods using results of 6 algorithms on 16 datasets
Keywords: Ranking Methods, Ranking Evaluation, Rank Correlation, Meta-Learning

[Brazdil and Soares, 2000a]
P. Brazdil and C. Soares. A comparison of ranking methods for classification algorithm selection. In R.L. de Mántaras and E. Plaza, editors, Machine Learning: Proceedings of the 11th European Conference on Machine Learning ECML2000, pages 63-74. Springer, 2000.
Comment: Presents three methods to aggregate performance information into a ranking of candidate algorithms; describes an evaluation methodology for ranking based on rank correlation; comparison of ranking methods using results of 6 algorithms on 16 datasets
Keywords: Ranking Methods, Ranking Evaluation, Rank Correlation, Meta-Learning

[Brazdil and Soares, 2000b]
P. Brazdil and C. Soares. Ranking classification algorithms based on relevant performance information. In J. Keller and C. Giraud-Carrier, editors, Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pages 61-71, 2000.
Comment: presents a IBL-based ranking method that uses general, statistical and information-theoretic measures to characterize datasets; presents a way of combining success rate and time in a performance measure; presents results based on meta-data for 6 algorithms on 16 datasets
Keywords: Meta-Learning, Ranking, Multicriterion Evaluation

[Brazdil and Soares, 2000c]
P. Brazdil and C. Soares. Zoomed ranking: Selection of classification algorithms based on relevant performance information. In D. Oblinger, I. Rish, J. Hellerstein, and R. Vilalta, editors, "What Works Well Where?" Workshop of ICML2000, 2000.
Comment: presents a IBL-based ranking method that uses general, statistical and information-theoretic measures to characterize datasets; presents a way of combining success rate and time in a performance measure
Keywords: Meta-Learning, Ranking, Multicriterion Evaluation

[Brazdil and Soares, 2000d]
P.B. Brazdil and C. Soares. A comparison of ranking methods for classification algorithm selection. In Machine Learning ECML-00. LNAI , Springer Verlag, 2000.

[Brazdil and Soares, 2001]
P. Brazdil and C. Soares. Reducing rankings of classifiers by eliminating redundant cases. In JOCLAD 2001: VII Jornadas de Classifica c c~ao e Análise de Dados, pages 76-79, 2001.
Comment: presents method to eliminate the algorithms in a ranking that are not expected to bring any improvement over the others; presents adequate evalution methodology capable of handling rankings of uneven length
Keywords: Ranking, Ranking Evaluation, Ranking Redundancy Elimination

[Brazdil et al., 1994]
P. Brazdil, J. Gama, and R. Henery. Characterizing the applicability of classification algorithms using meta level learning. In F. Bergadano and L. de Raedt, editors, Machine Learning - ECML-94. LNAI 784, Springer Verlag, 1994.
Comment: An interesting approach is proposed in this paper (and further explored in (Gama and Brazdil, 1995)). They used results from the STATLOG project and proposed a method to automatically derive rules that will guide the classifier selection. The approach is based on the characteristics of the data. They define a set of characteristics that are expected to affect the performance of the classifiers (performance defined in terms of predictive accuracy). Then, they invoke machine learning techniques to create models that associate the characteristics with the performance measures of the classifiers. The main advantage of this approach is the automated procedure for producing new knowledge on the expected performance of each new classifier. However, the method has only been used in a limited number of problems and incorporates a relatively limitited set of data characteristics. Moreover, accuracy is the sole performance measure.

[Breiman, 1996]
Leo Breiman. Stacked regressions. Machine Learning, 24:49-64, 1996.
Abstract: Stacking Regressions is a method for forming linear combinations of different predictors to give improved prediction accuracy. The idea is to use cross-validation data and least squares under non-negativity contraints to determine the coefficients in the combination. Its effectiveness is demonstrated in stacking regression trees of different sizes and in a simulation stacking linear subset and ridge regressions. Reasons why this method works are explored. The idea of stacking originated with Wolpert [1992].

[Brodley, 1995]
C. E. Brodley. Recursive automatic bias selection for classifier construction. Machine Learning, 20:63-94, 1995.
Comment: Each algorithm has a ``selective superiority'' i.e. it is better than the rest for specific types of problems. This happens because each algorithm has a so-called ``inductive bias'' caused by the assumptions it makes in order to generalize from the training data to unseen examples. MCS is the system described here. In that system the selection of models and algorithms, from a pool of available ones, is performed on the basis of existing knowledge from the expert, encoded in the form of rules. However, those rules are incorporated into the system and cannot be extended, when new models and algorithms become available.

[Chan and Stolfo, 1995]
Philip Kin-Wah Chan and Salvatore J. Stolfo. A comparative evaluation of voting and meta-learning on partitioned data. In Proceedings of the 12th International Conference on Machine Learning (ICML-95), pages 90-98, 1995.
Comment: This paper describes the idea of arbiters and combiners as described in Chan's thesis (Chan 1996), and show some experimental results.

[Chan and Stolfo, 1996]
Philip Kin-Wah Chan and Salvatore J. Stolfo. Sharing learned models among remote database partitions by local meta-learning. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pages 2-7, 1996.

[Chan and Stolfo, 1997]
Philip Kin-Wah Chan and Salvatore J. Stolfo. On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information Systems, 8:5-28, 1997.

[Chan, 1996]
Philip Kin-Wah Chan. An Extensible Meta-Learning Approach for Scalable and Accurate Inductive Learning. PhD thesis, Department of Computer Science, Columbia University, New York, NY, 1996. Technical Report CUCS-044-96.
Comment: Chan's notion of meta-learning is comparable to stacking, cascading etc. He proposes the concept of combiners and arbiters. A combiner is learned from a training set that uses the predictions of the base classifier as attributes (possibly in addition to the original attributes) and the correct prediction as the class attribute. An arbiter is a special classifier that is trained on the subset of the avialable instances on which the base classifiers disagree. It is used for tie-breaking in such cases.

[Fan et al., 1999]
Wei Fan, Salvatore J. Stolfo, and Philip K. Chan. Using conflicts among multiple base classifiers to measure the performance of stacking. In C. Giraud-Carrier and B. Pfahringer, editors, Proceedings of the ICML-99 Workshop on Recent Advances in Meta-Learning and Future Work, pages 10-17. Joszef Stefan Institute, 1999.
Comment: The authors study the task of predicting the performance of stacking from characteristics of the base classifiers. They claim that their own characteristic CB-accuracy (which is derived from a complete table of possible meta-instances and their frequency of occurrence) outperforms other characteristics like error correlation, diversity, and specialty. However, the study is only performed on 2 natural and 2 artificial data sets.

[Gama and Brazdil, 1995]
J. Gama and P. Brazdil. Characterization of classification algorithms. In C.Pinto Ferreira and N. Mamede., editors, Progress in Artificial Intelligence - EPIA95. LNAI 990, Springer Verlag, 1995.
Abstract: This paper is concerned with the problem of characterization of classification algorithms. The aim is to determine under what circumstances a particular classification algorithm is applicable. The method used involves generation of different kinds of models. These include regression and rule models, piecewise linear models (model trees) and instance based models. These are generated automatically on the basis of dataset characteristics and given test results. The lack of data is compensated for by various types of preprocessing. The models obtained are characterized by quantifying their predictive capability and the best models are identified.

[Gama and Brazdil, 2000]
Joao Gama and Pavel Brazdil. Cascade generalization. Machine Learning, 41(3):315-343, 2000.
Comment: Like Stacking, cascading is a meta-classification scheme that uses the prediction of base classifiers to improve predictions with a meta-level-classifiers. It differs from stacking in that the input is enriched with the results of a single meta classifier (instead of replacing the input with the predictions of several meta classifiers) and that several meta levels can be used, each enriching the dataset of the previous layer with one additional feature (hence a cascade of classifiers)

[Kalousis and Hilario, 2000a]
A. Kalousis and M. Hilario. A comparison of inducer selection via instance-based and boosted decision-tree meta-learning. In Proceedings of the 5th International Workshop on Multistrategy Learning, Guimaraes, Portugal, June 2000.
Abstract: The selection of an appropriate inducer is crucial for performing effective classification. In previous work we presented a system called NOEMON which relied on a mapping between dataset characteristics and inducer performance to propose inducers for specific datasets. Instance-based learning was used to create that mapping. Here we extend and refine the set of data characteristics; we also use a wider range of base-level inducers and a much larger collection of datasets to create the meta-models. We compare the performance of meta-models produced by instance-based learners and boosted decision trees. The results show that boosted decision tree models enhance the performance of the system.

[Kalousis and Hilario, 2000b]
A. Kalousis and M. Hilario. Model selection via meta-learning: a comparative study. In Proceedings of the 12th International IEEE Conference on Tools with AI, Vancouver, November 2000. IEEE press.
Abstract: The selection of an appropriate inducer is crucial for performing effective classification. In previous work we presented a system called Noemon which relied on a mapping between dataset characteristics and inducer performance to propose inducers for specific datasets. Instance-based learning was used to create that mapping. Here we extend and refine the set of data characteristics; we also use a wider range of base-level inducers and a much larger collection of datasets to create the meta-models. We compare the performance of meta-models produced by instance-based learners, decision trees and boosted decision trees. The results show that decision trees and boosted decision trees models enhance the performance of the system.

[Kalousis and Hilario, 2001]
A. Kalousis and M. Hilario. Feature selection for meta-learning. In Proceedings of the 5th Pacific Asia Conference on Knowledge Discovery and Data Mining. Springer, APRIL 2001.
Abstract: The selection of an appropriate inducer is crucial for performing effective classification. In previous work we presented a system called Noemon which relied on a mapping between dataset characteristics and inducer performance to propose inducers for specific datasets. Instance-based learning was applied to meta-learning problems, each one associated with a specific pair of inducers. The generated models were used to provide a ranking of inducers on new datasets. Instance-based learning assumes that all the attributes have the same importance. We discovered that the best set of discriminating attributes is different for every pair of inducers. We applied a feature selection method on the meta-learning problems, to get the best set of attributes for each problem. The performance of the system is significantly improved.

[Kalousis and Theoharis, 1999]
A. Kalousis and T. Theoharis. Noemon: Design, implementation and performance results of an intelligent assistant for classifier selection. Intelligent Data Analysis, 3(5):319-337, November 1999.
Abstract: The selection of an appropriate classification model and algorithm is crucial for effective knowledge discovery on a dataset. For large databases, common in data mining, such a selection is necessary, because the cost of invoking all alternative classifiers is prohibitive. This selection task is impeded by two factors: First, there are many performance criteria, and the behaviour of a classifier varies considerably with them. Second, a classifier's performance is strongly affected by the characteristics of the dataset. Classifier selection implies mastering a lot of background information on the dataset, the models and the algorithms in question. An intelligent assistant can reduce this effort by inducing helpful suggestions from background information. In this study, we present such an assistant, NOEMON. For each registered classifier, NOEMON measures its performance for a collection of datasets. Rules are induced from those measurements and accommodated in a knowledge base. The suggestion on the most appropriate classifier(s) for a dataset is then based on those rules. Results on the performance of an initial prototype are also given.

[Kalousis et al., 1997]
A. Kalousis, G. Zarkadakis, and T. Theoharis. Noemon: Adding intelligence to the knowledge discovery process. In Proceedings of the 17th SGES International Conference on Knowledge Based Systems and Applied AI, pages 235-249. SGES Publications, 1997.
Abstract: In this paper an architecture is proposed that supports the selection of task, model and algorithm in the knowledge discovery process by utilising artificial intelligence techniques. The proposed system acts as an assistant to the analyst by suggesting possible selections. It improves its data mining performance by observing the analysis sequences applied by the analyst to new data sets. The system associates data characteristics with specific selections that lead to positive or negative results and uses this information to guide later analysis.

[Kaynak and Alpaydin, 2000]
C. Kaynak and Ethem Alpaydin. Multistage cascading of multiple classifiers: One man's noise is another man's data. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000), pages 455-462, Stanford, CA, 2000.
Comment: This work aims at building classification systems with high accuracy and low cost through a multistage approach, called cascading, that uses a small number of classifiers of increasing complexity. An early simple classifier handles a majority of cases and a more complex classifier is only used for a small portion of the examples, thereby not significantly increasing the overall complexity. Early classifiers are generally semi-parametric (e.g., MLP) and the final classifier is always non-parametric (e.g., kNN). Thus the system can be viewed as creating rules in the early layers and catching exceptions at the final layer. Furthermore, different inductive biases are used (thus improving the value of the ensemble) and the final kNN layer can be used to place a limit on the number of layers in the cascade. At each layer, examples in the training set are drawn based on the confidence of the previous layer's classification. Hence, examples with low confidence at layer i are more likely to be sampled while training layer i+1. Cascading is similar to bagging and boosting. Significant advantages include: (i) cascading combines a small number of classifiers to reduce complexity, (ii) cascading uses different classifiers (i.e., different learning biases are available in the resulting ensemble), (iii) cascading is multistage whilst bagging/boosting are multiexpert (i.e., not all classifiers need be consulted in testing) and (iv) cascading takes into account the distance to the discriminant, whereas bagging/boosting only check if the input is on the right side of the discriminant or not. Empirical results show that cascading produces higher accuracy at lower cost than its individual components, and bagging and boosting.

[Keller et al., 2000]
Joerg Keller, Iain Paterson, and Helmut Berrer. An integrated concept for multi-criteria ranking of data mining algorithms. In Joerg Keller and Christophe Giraud-Carrier, editors, Proceedings of the ECML-00 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selction and Method Combination, Barcelona, Spain, 2000.
Keywords: Multi-Criteria Ranking, DEA, Meta-Data, Meta-Learning

[Koepf et al., 2000a]
Christian Koepf, Charles C. Taylor, and Joerg Keller. Meta-analysis: Data characterisation for classification and regression on a meta-level. In Antony Unwin, Adalbert Wilhelm, and Ulrike Hofmann, editors, Proceedings of the International Symposium on Data Mining and Statistics, Lyon, France, 2000.
Abstract: Meta-Analysis can serve as a base for Meta-Learning: To the user's support withautomated guidance in model selection and data transformation. The first applicationfield in METAL(Meta-Learning assistant, ESPRIT project 26.357) was classification wheredata characteristics, measures, and tests had to be evaluated and now in the phase ofregression learning, they had to be proved. We describe necessary statistics and datacharacteristics for regression learning. As a new approach, we present Meta-Regression:Regression learning performed on the meta level for Meta-Learning. This new directioncould ``sharpen'' the accuracy of Meta-Learning, in particular when compared to aclassification of error rates.
Keywords: Meta-Learning, Data Characterisation, Meta-Regression

[Koepf et al., 2000b]
Christian Koepf, Charles C. Taylor, and Joerg Keller. Meta-analysis: From data characterisation for meta-learning to meta-regression. In Pavel Brazdil and Alipio Jorge, editors, Proceedings of the PKDD-00 Workshop on Data Mining, Decision Support,Meta-Learning and ILP: Forum for Practical Problem Presentation andProspective Solutions, Lyon, France, 2000.
Abstract: An extended Meta-Analysis fertilizes a Meta-Learning, which is applied tosupport the user with an automated guidance in model selection and data-transformation.Two major application fields were selected in METAL(Meta-Learning assistant, ESPRITproject 26.357): classification and regression learning. In phase 1 of the project, the data characteristics, measures and tests have been evaluated for an automated use ofclassification algorithms. For regression learning, the statistics, informationtheoretical measures and tests had to be proved. This paper works out necessarystatistics and tests for regression learning. The new approach of this paper is touse a Meta-Regression: A regression learning on the meta level for Meta-Learning.In comparison to a classification of error rates, calculated for cross-validation tests,our new approach could improve the accuracy for Meta-Learning.
Keywords: Meta-Learning, Data Characterisation, Classification, Meta-Regression

[Kohavi and John, 1995]
Ron Kohavi and George H. John. Automatic parameter selection by minimizing estimated error. In A. Prieditis and S. Russell, editors, Proceedings of the 12th International Conference on Machine Learning (ICML-95), 1995.
Comment: The problem of finding the right parameter settings for a learning algorithm is an interesting problem for meta-learning. This study, which employs best-first search and cross-validation, should be the basic benchmark for such an attempt.

[Lagoudakis and Littman, 2000]
Michail G. Lagoudakis and Michael L. Littman. Algorithm selection using reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML-2000), pages 511-518, Stanford, CA, 2000.
Comment: This work focuses on algorithm selection in the context of dynamically choosing an algorithm to attack an instance of a problem with the goal of minimising execution time. The paper shows how, given (i) a set of algorithms that are equivalent in terms of the problem they solve (e.g., sort), but can differ in say scalability and (ii) a set of instance features (e.g., problem size) that can be used to select the most appropriate algorithm from the set for a given problem instance, a reinforcement learning approach can be used to select the right algorithm for each instance at run-time based on the instance features. The focus is on selection while the instance is being solved, so that each instance is solved by a mixture of algorithms formed dynamically at run-time. The example used is that of sorting, where depending on available algorithms and the size of the data, different algorithms may be used (e.g., start with mergesort and recurse until the sizes are small enough for shellsort or bubblesort to be applied more efficiently). Although this paper is not about learning algorithm selection, some ML algorithms are recursive (e.g., C4.5) so that the approach presented may have some relevance in that context. Certainly, the goals are the same; it is the problem instances that change. It may be worth investigating.

[Lindner and Studer, 1999]
Guido Lindner and Rudi Studer. AST: Support for algorithm selection with a CBR approach. In C. Giraud-Carrier and B. Pfahringer, editors, Proceedings of the ICML-99 Workshop on Recent Advances in Meta-Learning and Future Work, Bled, Slovenia, 1999.

[Petrak, 2000]
Johann Petrak. Fast subsampling performance estimates for classification algorithm selection. In J. Keller and C. Giraud-Carrier, editors, Proceedings of the ECML-00 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pages 3-14, Barcelona, Spain, 2000.
Abstract: The typical data mining process is characterized by the prospective and iterative application of a variety of different data mining algorithms from an algorithm toolbox. While it would be desirable to check many different algorithms and algorithm combinations for their performance on a database, it is often not feasible because of time and other resource constraints. This paper investigates the effectiveness of simple and fast subsampling strategies for algorithm selection. We show that even such simple strategies perform quite well in many cases and propose to use them as a base-line for comparison with meta-learning and other advanced algorithm selection strategies.

[Pfahringer et al., 2000]
Bernhard Pfahringer, Hilan Bensusan, and Christophe Giraud-Carrier. Meta-learning by landmarking various learning algorithms. In Pat Langley, editor, Proceedings of the 17th International Conference on Machine Learning (ICML-2000), pages 743-750, Stanford, CA, 2000.
Abstract: Landmarking is a novel approach to describing tasks in meta-learning. Previous approaches to meta-learning mostly considered only statistics-inspired measures of the data as a source for the definition of meta-attributes. Contrary to such approaches, landmarking tries to determine the location of a specific learning problem in the space of all learning problems by directly measuring the performance of some simple and efficient learning algorithms themselves. The experiments reported show how such a use of landmark values can help to distinguish between areas of the learning space favouring different learners. Experiments, both with artificial and real-world databases, show that landmarking selects, with moderate but reasonable level of success, the best performing of a set of learning algorithms.

[Provost and Buchanan, 1995]
F.J. Provost and B. G. Buchanan. Inductive policy: The pragmatics of bias selection. Machine Learning, 20:35-61, 1995.
Comment: Provost and Buchanan in SBS consider the problem of model and algorithm selection as a search in the meta-space of possible models of representation and in the meta-spaces of the possible ways to traverse each one of the possible models. Movements in those meta-spaces are performed by the dedicated operators implemented in SBS. SBS does not produce new knowledge and can only utilize existing knowledge that can be given in the form of preconditions for the operators. The analyst must encode this knowledge explicitly.

[Rendell et al., 1987]
Larry Rendell, R. Seshu, and D. Tcheng. Layered concept learning and dynamically variable bias management. In Proceedings of the 10th International Joint Conference on AI, pages 308-314, 1987.
Comment: Rendell at al, describe a system called VBMS. VBMS tries to predict which of the available algorithms will perform better for a given classification problem. It uses problem characteristics like the number of examples and the number of attributes. The system actually produces new knowledge for each new problem it faces, by associating problem characteristics with algorithm performance. The main disadvantage of VBMS is that it is trained as new classification tasks are presented to it; this makes it quite slow. Moreover, a single performance criterion is being considered, namely execution time.

[Sashima and Hiraki, 2000]
A. Sashima and K Hiraki. Learning to learn by modular neural networks. In Proceedings of the Twenty-Second Annual Conference of the Cognitive Science Society (CogSci-2000), 2000.
Comment: A one-page paper discussing the need for modular representation in the context of learning to learn. The motivation is in human learning where it is clear that as they encounter tasks, humans learn not only knowledge of current tasks but also biases for learning future tasks. The idea is that if each module learns a reusable basic function at the initial task, mixture of modules can learn various complex functions at future tasks. Hence, the generality and reusability of each module enables learning to learn. An example using neural networks trained on 2 simple functions A and B is given, where it is shown that the networks trained on AB (i.e., A first then B) can correctly approximate function B, whilst the networks trained on BB (i.e., B twice) can not approximate it.

[Schaffer, 1993]
C. Schaffer. Selecting a classification method by cross-validation. Machine Learning, 13(1):135-143, 1993.
Abstract: If we lack relevant problem-specific knowledge, cross-validation methods may be used to select a classification method empirically. We examine this idea here to show in what senses cross-validation does and does not solve the selection problem. As illustrated empirically, cross-validation may lead to higher average performance than application of any single classification strategy, and it also cuts the risk of poor performance. On the other hand, cross-validation is no more or less a form of bias than simpler strategies, and applying it appropriately ultimately depends in the same way on prior knowledge. In fact, cross-validation may be seen as a way of applying partial information about the applicability of alternative classification strategies.
Comment: This paper investigates the simplest approach to meta-learning: trying all algorithms and selecting the one that seems to perform best. This is the base-line that meta-learning should beat (certainly with respect to efficiency, but if possible also wrt accuracy). Schaeffer's experiments with three base learners (C4.5, C4.5rules, and back-propagation neural networks) on five UCI datasets comfirm two hypotheses, namely that the average performance of the technique is better than that of the best base algorithm, and that its performance is nearly as good as the best base algorithm on each individual problem. On the other hand, he notes that cross-validation is just another form of bias, and applying it appropriately ultimately depends in the same way on prior knowledge. He also notes that the use of more base learners will increase the chance that one might accidentally look good and therefore decrease the advantage of the cross-validation.

[Schaffer, 1994]
Cullen Schaffer. Cross-validation, stacking and bi-level stacking: Meta-methods for classification learning. In P. Cheeseman and R. W. Oldford, editors, Selecting Models from Data: Artificial Intelligence and Statistics IV, pages 51-59. Springer-Verlag, 1994.

[Seewald and Fürnkranz, 2001]
Alexander K. Seewald and Johannes Fürnkranz. Grading classifiers. Technical Report OEFAI-TR-2001-01, Austrian Research Institute for Artificial Intelligence, Wien, Austria, 2001. Submitted for publication.
Abstract: In this paper, we introduce grading, a novel meta-classification scheme. While stacking uses the predictions of the base classifiers as meta-level attributes, we use ``graded'' predictions (i.e., predictions that have been marked as correct or incorrect) as meta-level classes. For each base classifier, one meta classifier is learned whose task is to predict when the base classifier will err. Hence, just like stacking may be viewed as a generalization of voting, grading may be viewed as a generalization of selection by cross-validation and therefore fills a conceptual gap in the space of meta-classification schemes. Grading may also be interpreted as a technique for turning the error-characterizing technique introduced by Bay and Pazzani (2000) into a powerful learning algorithm by resorting to an ensemble of meta-classifiers. Our experimental evaluation shows that this step results in a performance gain that is quite comparable to that achieved by stacking, while both, grading and stacking outperform their simpler counter-parts voting and selection by cross-validation.

[Sleeman et al., 1995]
D. Sleeman, M. Rissakis, S. Craw, N. Graner, and S. Sharma. Consultant-2: Pre and post-processing of machine learning applications. International Journal of Human Computer Studies, 43:43-63, 1995.
Abstract: The knowledge acquisition bottleneck in the development of large knowledge-based applications has not yet been resolved. One approach which has been advocated is the systematic use of Machine Learning (ML) techniques. However, ML technology poses difficulties to domain experts and knowledge engineers who are not familiar with it. This paper discusses Consultant-2, a system which makes a first step towards providing system support for a pre- and post-processing methodology where a cyclic process of experiments with an ML tool, its data, data description language and parameters attempts to optimize learning performance. Consultant-2 has been developed to support the use of Machine Learning Toolbox (MLT), an integrated architecture of 10 ML tools, and has evolved from a series of earlier systems. Consultant-0 and Consultant-1 had knowledge only about how to choose an ML algorithm based on the nature of the domain data. Consultant-2 is the more sophisticated. It, additionally, has knowledge about how ML experts and domain experts pre-process domain data before a run with the ML algorithm, and how they further manipulate the data and reset parameters after a run of the selected ML algorithm, to achieve a more acceptable result. How these several KBs were acquired and encoded is described. In fact,this knowledge has been acquired by interacting both with the ML algorithhm developers and with domain experts who had been using the MLT toolbox on real-world tasks. A major aim of the MLT project was to enable a domain expert to use the toolbox directly: i.e. without necessarily having to involve either a ML specialist or a knowledge engineer. Consultant's principal goal was to provide specific advice to ease this process.
Comment: An expert system called Consultant, is presented here. The system is built to support the use of a Machine Learning Toolbox, an integrated architecture of ten machine learning tools. It relies heavily on close interaction with the user; it poses several questions trying to determine the nature of the application and the nature of the data. It does not examine the data. At the end of the interaction a list of possible algorithms is presented and the user may select one of them. The system is not expandable (i.e. it can not incorporate new algorithms) and in the end relies heavily on the user for selecting the appropriate algorithm.

[Soares and Brazdil, 2000a]
C. Soares and P. Brazdil. Ranking classification algorithms with dataset selection: Using accuracy and time results. In R.S. Michalski and P.B. Brazdil, editors, Proceedings of the Fifth International Workshop on Multistrategy Learning (MSL 2000), pages 126-135, 2000.
Comment: presents a IBL-based ranking method that uses general, statistical and information-theoretic measures to characterize datasets; presents a way of combining success rate and time in a performance measure; presents results based on meta-data for 6 algorithms on 16 datasets
Keywords: Meta-Learning, Ranking, Multicriterion Evaluation

[Soares and Brazdil, 2000b]
C. Soares and P. Brazdil. Zoomed ranking: Selection of classification algorithms based on relevant performance information. In D.A. Zighed, J. Komorowski, and J. Zytkow, editors, Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD2000), pages 126-135. Springer, 2000.
Comment: presents a IBL-based ranking method that uses general, statistical and information-theoretic measures to characterize datasets; presents a way of combining success rate and time in a performance measure; presents results based on meta-data for 6 algorithms on 16 datasets
Keywords: Meta-Learning, Ranking, Multicriterion Evaluation

[Soares et al., 2000a]
C. Soares, P. Brazdil, and J. Costa. Measures to compare rankings of classification algorithms. In H.A.L. Kiers, J.-P. Rasson, P.J.F. Groenen, and M. Schader, editors, Data Analysis, Classification and Related Methods, Proceedings of the Seventh Conference of the International Federation of Classification Societies IFCS, pages 119-124. Springer, 2000.
Comment: discussion of three measures to evaluate rankings
Keywords: Ranking Evaluation, Rank Correlation

[Soares et al., 2000b]
C. Soares, J. Costa, and P. Brazdil. Distance to reference: A simple measure to evaluate rankings of supervised classification algorithms. In JOCLAD 2000: VI Jornadas de Classifica c c~ao e Análise de Dados, pages 61-66, 2000.
Comment: presents a multicriterion ranking evaluation measure
Keywords: Multicriterion Evaluation, Ranking Evaluation

[Soares et al., 2000c]
C. Soares, J. Costa, and P. Brazdil. A simple and intuitive measure for multicriteria evaluation of classification algorithms. In J. Keller and C. Giraud-Carrier, editors, ECML 2000 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, pages 87-96, 2000.
Comment: presents a multicriterion ranking evaluation measure
Keywords: Multicriterion Evaluation, Ranking Evaluation

[Soares et al., 2001]
C. Soares, J. Costa, and P. Brazdil. Improved statistical support for matchmaking: Rank correlation taking rank importance into account. In JOCLAD 2001: VII Jornadas de Classifica c c~ao e Análise de Dados, pages 72-75, 2001.
Comment: presents a weighted rank correlation coefficient
Keywords: Rank Correlation, Weighted Correlation

[Soares, 1999]
C. Soares. Ranking classification algorithms on past performance. Master's thesis, Faculty of Economics, University of Porto, 1999.

[Sohn, 1999]
S.Y. Sohn. Meta analysis of classification algorithms for pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(11):1137-1144, 1999.
Abstract: Various classification algorithms became available due to a surge of interdisciplinary research interests in the areas of data mining and knowledge discovery. We develop a statistical meta-model which compares the classification performances of several algorithms in terms of data characteristics. This empirical model is expected to aid decision making processes of finding the best classification tool in the sense of providing the minimum classification error among alternatives.
Comment: This paper details the construction of a statistical meta-model to predict the expected classification performance of 11 learning algorithms as a function of (subsets of) 13 data characteristics. The selected algorithms are: Fisher's linear discriminant, a quadratic discriminant function, a logistic discriminant function, kNN, backpropagation, LVQ, Kohonen, RBF, indCART, C4.5 and bayesian decision tree. The 13 data characteristics include basic descriptive statistics (e.g., number of classes) as well as multivariate statistics (e.g., mean skewness) and derived statistics (e.g., square root of the ratio of the number of feature variables to the number of training examples). 17 to 18 of the 22 datasets used in StatLog are used here to fit the statistical meta-model. A number of useful conclusions regarding the impact of certain characteristics on performance are drawn. The meta-model's performance is high and it can be used to rank algorithms by means of Spearman's rank correlation.

[Sykacek, 1999]
Peter Sykacek. Metalevel learning - is more than model selection necessary?. In C. Giraud-Carrier and B. Pfahringer, editors, Proceedings of the ICML-99 Workshop on Recent Advances in Meta-Learning and Future Work, pages 66-73, Ljubljana, Slovenia, 1999.

[Tcheng et al., 1989]
D. Tcheng, B. Lambert, S. Lu, and Larry Rendell. Building robust learning systems by combining induction & optimization. In Proceedings of the 11th International Joint Conference on AI, pages 806-812, 1989.
Comment: In this work the CRL/ISO system is presented. A system that uses optimization in order to search in the inductive bias space. The CRL component is a learning system that manages a set of diverse inductive biases and produces hybrid concept representations. The ISO component is the optimization component that searches in the inductive bias space for an optimum bias. The system has been used in a specific application of an engineering problem, with good results but it has not been tested in other fields.

[Ting and Witten, 1997]
Kai Ming Ting and Ian H. Witten. Stacked generalization: When does it work?. In Proceedings of the 15th International Joint Conference on Artificial Intelligence (IJCAI-97), pages 866-873, Nagoya, Japan, 1997. Morgan Kaufmann.
Abstract: Stacked generalization is a general method of using a high-level model to combine lowerlevel models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a `black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higher-level model, and the kind of attributes that should be used as its input. We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms.

[Ting and Witten, 1999]
Kai Ming Ting and Ian H. Witten. Issues in stacked generalization. Journal of Artificial Intelligence Research, 10:271-289, 1999.
Abstract: Stacked generalization is a general method of using a high-level model to combine lower level models to achieve greater predictive accuracy. In this paper we address two crucial issues which have been considered to be a `black art' in classification tasks ever since the introduction of stacked generalization in 1992 by Wolpert: the type of generalizer that is suitable to derive the higher-level model, and the kind of attributes that should be used as its input. We find that best results are obtained when the higher-level model combines the confidence (and not just the predictions) of the lower-level ones. We demonstrate the effectiveness of stacked generalization for combining three different types of learning algorithms for classification tasks. We also compare the performance of stacked generalization with majority vote and published results of arcing and bagging.

[Todorovski and Dzeroski, 1999]
Ljupco Todorovski and Saso Dzeroski. Experiments in meta-level learning with ILP. In Proceedings of the 3rd European Conference on Principles of Data Mining and Knowledge Discovery (PKDD-99), pages 98-106. Springer-Verlag, 1999.
Abstract: When considering new datasets for analysis with machine learning algorithms, we encounter the problem of choosing the algorithm which is best suited for the task at hand. The aim of meta-level learning is to relate the performance of different machine learning algorithms to the characteristics of the dataset. The relation is induced on the basis of empirical data about the performance of machine learning algorithms on the different datasets. In the paper, an Inductive Logic Programming (ILP) framework for meta-level learning is presented. The performance of three machine learning algorithms (the tree learning system C4.5, the rule learning system CN2 and the k-NN nearest neighbour classifier) were measured on twenty datasets from the UCI repository in order to obtain the dataset for meta-learning. The results of applying ILP on this meta-learning problem are presented and discussed.
Comment: This paper suggests the use of inductive logic programming (ILP) to deal with the fact that datasets have varying number of attributes. While conventional statistical approaches for data characterisation have to use summary statistics over all attributes, ILP algorithms allow to access the information of each individual attribute.

[Todorovski and Dzeroski, 2000]
Ljupco Todorovski and Saso Dzeroski. Combining multiple models with meta decision trees. In D. A. Zighed, J. Komorowski, and J. ZZytkow, editors, Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD-2000), pages 54-64, Lyon, France, 2000. Springer-Verlag.
Abstract: The paper introduces meta decision trees (MDTs), a novel method for combining multiple models. Instead of giving a prediction, MDT leaves specify which model should be used to obtain a prediction. We present an algorithm for learning MDTs based on the C4.5 algorithm for learning ordinary decision trees (ODTs). An extensive experimental evaluation of the new algorithm is performed on twenty-one data sets, combining models generated by five learning algorithms: two algorithms for learning decision trees, a rule learning algorithm, a nearest neighbor algorithm and a naive Bayes algorithm. In terms of performance, MDTs combine models better than voting and stacking with ODTs. In addition, MDTs are much more concise than ODTs used for stacking and are thus a step towards comprehensible combination of multiple models.
Comment: This work aims at directly predicting which classifier is best to classify an individual example. To this end, it uses information about statistical properties of the predicted class distribution as attributes and predicts the right algorithm from this information. The approach is not modular (in the sense that any algorithm could be used at the meta-level) but implemented as a modification to the decision tree learner C4.5.

[Todorovski et al., 2000]
L. Todorovski, P. Brazdil, and C. Soares. Report on the experiments with feature selection in meta-level learning. In P. Brazdil and A. Jorge, editors, Proceedings of the Data Mining, Decision Support, Meta-Learning and ILP Workshop at PKDD2000, pages 27-39, 2000.
Comment: study on feature selection in meta-level learning using ranking and algorithm selection based on general, statistical and information-theoretic measures
Keywords: Meta-Learning, Feature Selection, Ranking, Algorithm Selection

[van Someren, to appear]
Maarten van Someren. Model class selection and construction: Beyond the procustean approach to machine learning applications. In V. Karkaletsis G. Paliouras and C.D. Spyropoulos, editors, Machine Learning and Applications. Springer-Verlag, to appear.
Comment: A most insightful paper that should be considered by anyone who is serious about meta-learning. The paper discusses the issue of model class selection in machine learning. Its goal is to find a way to assign each learning problem to a fitting rather than a Procustean bed. A number of methods for model class selection are outlined: 1) the monkey and the toolbox (i.e., select all possible models, learn within each model, evaluate using cross-validation), 2) use prior knowledge (e.g., linear separability, noise, relevance, variance, etc), 3) use 'cheap' properties of the data (i.e., relate properties of the distribution to the performance of learning techniques), 4) divide-and-conquer (i.e., model combination, apply different methods to different subsets of the data). Under the 3rd method, 3 types of studies are characterised: 1) type 1 studies that focus on model evaluation (i.e., relate individual dataset to the performance of individual methods; result = table of dataset x technique x performance; difficult to generalise), 2) type 2 studies that relate properties of data to the effect of learning methods (i.e., StatLog approach; problem of meta-feature selection; no insight as to why techniques work well on some datasets but not others; purely empirical a-theoretical approach), 3) type 3 studies that relate properties of both datasets and methods to the effect of learning methods. Type 3 studies are the ultimate goal. (Interestingly, landmarking seems to get closer to a type 3 study than DCT). The paper also addresses the issue of data transformation and its relation to model class selection.

[Vilalta and Oblinger, 2000]
Ricardo Vilalta and Daniel Oblinger. A quantification of distance bias between evaluation metrics in classification. In Pat Langley, editor, Machine Learning, Proceedings of the 17th International Conference. Morgan Kaufmann, 2000.

[Widmer, 1996]
Gerhard Widmer. Recognition and exploitation of contextual clues via incremental meta-learning. In L. Saitta, editor, Proceedings of the 13th International Conference on Machine Learning (ICML'96), 1996.
Abstract: Daily experience shows that in the real world, the meaning of many concepts heavily depends on some implicit context, and changes in that context can cause more or less radical changes in the concepts. Incremental concept learning in such domains requires the ability to recognize and adapt to such changes. This paper presents a solution for incremental learning tasks where the domain provides explicit clues as to the current context (e.g., attributes with characteristic values). We present a general two-level learning model, and its realization in a system named MetaL(B), that can learn to detect certain types of contextual clues, and can react accordingly when a context change is suspected. The model consists of a base level learner that performs the regular on-line learning and classification task, and a meta-learner that identifies potential contextual clues. Context learning and detection occur during regular on-line learning, without separate training phases for context recognition. Experiments with synthetic domains as well as a `real-world' problem show that MetaL(B) is robust in a variety of dimensions and produces substantial improvement over simple object-level learning in situations with changing contexts. The meta-learning framework is very general, and a number of instantiations and extensions of the model are conceivable. Some of these are briefly discussed.

[Widmer, 1997]
Gerhard Widmer. Tracking context changes through meta-learning. Machine Learning, 27(3):259-286, 1997.
Abstract: The article deals with the problem of learning incrementally (`on-line') in domains where the target concepts are context-dependent, so that changes in context can produce more or less radical changes in the associated concepts. In particular, we concentrate on a class of learning tasks where the domain provides explicit clues as to the current context (e.g., attributes with characteristic values). A general two-level learning model is presented that effectively adjusts to changing contexts by trying to detect (via `meta-learning') contextual clues and using this information to focus the learning process. Context learning and detection occur during regular on-line learning, without separate training phases for context recognition. Two operational systems based on this model are presented that differ in the underlying learning algorithm and in the way they use contextual information: MetaL(B) combines meta-learning with a Bayesian classifier, while MetaL(IB) is based on an instance-based learning algorithm. Experiments with synthetic domains as well as a number of `real-world' problems show that the algorithms are robust in a variety of dimensions, and that meta-learning can produce substantial improvement over simple object-level learning in situations with changing contexts.

[Wolpert, 1992]
David H. Wolpert. Stacked generalization. Neural Networks, 5(2):241-260, 1992.
Comment: Stacking is a meta-classification scheme which, instead of training a meta-learner how to select the best algorithm for a given problem, trains a meta-classifier to combine the predictions of multiple base classifiers. The training set for the meta classifier is constructed by using the predictions of the base classifiers as the features for the meta classifier. This paper lies the theoretical foundation for this approach.