NeuroCOLT

Neural Networks and Computational Learning Theory

 

About NeuroCOLT

Papers Archive

Research Areas

Partners

Coordinator

Events

info@neurocolt.org

 

NeuroCOLT workshop
on
Generalisation Bounds Less than 0.5
Windsor, 29 April - 2 May 2002
Cumberland Lodge

"PAC-Bayesian Bounds for Kernel Classifiers based on Compression and Margin"

Thore Graepel


The PAC-Bayesian framework is a candidate framework for generalisation error bounds below one because it provides aposteriori bounds that depend on the training sample observed. The resulting bounds are low if the prior belief as expressed by a prior distribution over hypothesis space is supported by the data. Two basic quantities have been argued to have an influence on the generalisation performance of classifiers: compression ratio and margin. The compression ratio is inherently related to example-based classification algorithms that make use of (a subset of) the training sample directly for the classification of test examples such as, for example, the k-nearest-neighbours classifier. The margin characterises hypothesis-based classifiers that derive the classification from thresholding a real-valued function. It turns out that kernel classifiers share important properties of both example-based and hypothesis-based classification algorithms and can thus be characterised by compression ratio and margin. In this talk I give an introduction to the PAC-Bayesian framework and show how it can be used to derive bounds based on both compression ratio and margin for kernel classifiers. For the margin bound a PAC-Bayesian prior is defined over weight space, and the margin appears as a measure of the fraction of hypotheses with empirical risk below a certain threshold. In the case of compression the prior is defined over the space of expansion coefficients of the weight vector in terms of training inputs. Empirical results indicate that the bounds provide values below one for realistic sample size and may be used for model selection. Not surprisingly, however, they are unable to predict the generalisation error accurately for real-world examples.