# Project MetaL: On-line Bibliography on Meta-Learning

## Theoretical Comparisons

This page is part of the On-line Bibliography on Meta-Learning, which is provided within the ESPRIT METAL Project (26.357) A Meta-Learning Assistant for Providing User Support in Machine Learning Mining. This part of the bibliography covers theoretical studies of different types of learning algorithms.

If you think of a paper that should be here, please let us know.

17 references, last updated Fri Mar 9 13:02:50 MET 2001

- [Blumer et al., 1987]
- Anselm Blumer, Andrzej
Ehrenfeucht, David Haussler, and Manfred K. Warmuth.
Occam's razor.
Information Processing Letters, 24:377-380, 1987.
- [Cheesman, 1990]
- Peter Cheesman.
On finding the most probable model.
In J. Schrager and P. Langley, editors, Computational Models of Discovery
and Theory Formation, chapter 3, pages 73-95. Morgan Kaufmann, 1990.
**Comment:**The Bayesian view on model selection. - [Domingos, 1999]
- Pedro Domingos.
The role of Occam's Razor in knowledge discovery.
Data Mining and Knowledge Discovery, 3(4):409-425, December 1999.
**Comment:**Good summary of previous work on Occam's Razor in Machine Learning. Domingos claims that not the complexity but the search effort needed to find a theory is relevant. - [Floyd and Warmuth, 1995]
- Sally Floyd and
Manfred Warmuth.
Sample compression, learnability, and the vapnik-chervonenkis dimension.
Machine Learning, 21:269-304, 1995.
**Abstract:**Within the framework of pac-learning, we explore the learnability of concepts from samples using the paradigm of sample compression schemes. A sample compression scheme of size k for a concept class C subseteq 2^{X}consists of a compression function and a reconstruction function. The compression function receives a finite sample set consistent with some concept in C and chooses a subset of k examples as the compression set. The reconstruction function forms a hypothesis on X from a compression set of k examples. For any sample set of a concept in C the compression set produced by the compression function must lead to a hypothesis consistent with the whole original sample set when it is fed to the reconstruction function. We demonstrate that the existence of a sample compression scheme of fixed-size for a class C is sufficient to ensure that the class C is pac-learnable. Previous work has shown that a class is pac-learnable if and only if the Vapnik-Chervonenkis (VC) dimension of the class is finite. In the second half of this paper we explore the relationship between sample compression schemes and the VC dimension. We define maximum and maximal classes of VC dimension d. For every maximum class of VC dimension d, there is a sample compression scheme of size d, and for sufficiently-large maximum classes there is no sample compression scheme of size less than d. We discuss briefly classes of VC dimension d that are maximal but not maximum. It is an open question whether every class of VC dimension d has a sample compression scheme of size O(d).**Keywords:**Sample compression, Vapnik-Chervonenkis dimension, pac-learning - [Fürnkranz, 1999]
- Johannes Fürnkranz.
Separate-and-conquer rule learning.
Artificial Intelligence Review, 13(1):3-54, February 1999.
**Abstract:**This paper is a survey of inductive rule learning algorithms that use a separate-and-conquer strategy. This strategy can be traced back to the AQ learning system and still enjoys popularity as can be seen from its frequent use in Inductive Logic Programming systems. We will put this wide variety of algorithms into a single framework and analyze them along three different dimensions, namely their search, language and overfitting avoidance biases. - [Gordon and desJardin, 1995]
- F. Gordon and
M. desJardin.
Evaluation and selection of biases.
Machine Learning, 20:5-22, 1995.
**Comment:**The model of an algorithm actually defines the ``search space'' or ``hypothesis space'', such as k-DNF or k-CNF forms, linear discriminant functions, rules etc. The algorithm searches this space for the right hypothesis, i.e. for the hypothesis that better fits the data. The algorithm determines the order of visiting the states in this space. For example, between two algorithms that both search in DNF space, one might start the search from DNF forms that contain the complete set of features, while the other might start from sets consisting of only one feature. Hence, the wrong choice of algorithm may result in slow convergence towards the right hypothesis, or may even end at a suboptimal solution due to a local minimum. A wrong choice of model can have a more severe impact: A hypothesis appropriate for the problem at hand might be ignored because it is not contained in the model's search space. A formalization of the bias spaces is given by Gordon and dJardin, where various levels of bias are described. At the lowest level we have a specific hypothesis space and a way to search it. Above it we have the representational space having as states various different representations, and various search spaces defined for each of the states of the represantational space. The problem of selecting the appropriate bias then becomes a problem of searching those spaces. - [Haussler et al., 1994]
- David Haussler,
Michael Kearns, and Robert E. Schapire.
Bounds on the sample complexity of bayesian learning using information theory
and the vc dimension.
Machine Learning, 14:83-113, 1994.
- [Langley, 1999]
- Pat Langley.
Tractable average-case analysis of naive Bayesian classifiers.
In I. Bratko and S. Dzeroski, editors, Machine Learning, Proceedings of
the 16th International Conference. Morgan Kaufmann, 1999.
- [Pfahringer, 1995]
- Bernhard Pfahringer.
A new MDL measure for robust rule induction (extended abstract).
In Nada Lavrac and Stefan Wrobel, editors, Machine Learning: ECML-95
(Proc. European Conf. on Machine Learning, 1995), Lecture Notes in
Artificial Intelligence 912, pages 331 -- 334, Berlin, Heidelberg, New York,
1995. Springer Verlag.
- [Quinlan, 1995]
- R. Quinlan.
Mdl and categorical theories (continued).
In A. Prieditis and S. Russel, editors, Machine Learning Proc. of 12th
International Conference. Morgan Kaufmann, 1995.
- [Schaffer, 1993]
- Cullen Schaffer.
Overfitting avoidance as bias.
Machine Learning, 10:153-178, 1993.
- [Schaffer, 1994]
- C. Schaffer.
A conservation law for generalization of performance.
In W. Cohen and H. Hirsh, editors, Machine Learning, Proceedings of the
11th International Conference. Morgan Kaufmann, 1994.
**Comment:**A less mathematical description of the ``no-free-lunch'' theorems saying that all learning algorithms are equal over all possible test tests for a given training set. - [Shinohara, 1994]
- A. Shinohara.
Complexity of computing generalized VC-dimensions.
In F. Bergadano and L. de Raedt, editors, Machine Learning -
ECML-94. LNAI 784, Springer Verlag, 1994.
- [Wolpert, 1993]
- David H. Wolpert.
On overfitting avoidance as bias.
Technical Report SFI TR 92-03-5001, The Santa Fe Institute, Santa Fe, NM,
1993.
- [Wolpert, 1996a]
- David H. Wolpert.
The existence of a priori distinctions between learning algorithms.
Neural Computation, 8, 1996.
**Comment:**Describes scenarios that are not covered in the companion paper on the ``no free lunch'' theorems. - [Wolpert, 1996b]
- David H. Wolpert.
The lack of a priori distinctions between learning algorithms.
Neural Computation, 8:1381-1390, 1996.
**Comment:**A detailed description of the so-called ``no free lunch'' theorems. - [Wolpert, 2000]
- D.H. Wolpert.
Any two learning algorithms are (almost) exactly identical.
In Proceedings of the ICML-2000 Workshop on What Works Well Where,
2000.
**Comment:**One of the ramifications of the no-free-lunch (NFL) theorems is that averaged over all targets, the generalisation performance of any two learning algorithms is exactly identical. This paper further shows that for almost every single target, the generalisation error of any two learning algorithms is almost exactly identical (for zero-one loss). This emphasizes just how important prior information about the target is. Unless you restrict the set of allowed targets to an extremely small set, any pair of algorithms considered will perform essentially the same. This is appears most relevant to meta-learning.