NeuroCOLT

Neural Networks and Computational Learning Theory

 

About NeuroCOLT

Papers Archive

1994 1995
1996 1997
1998 1999
2000 2001

Books

info@neurocolt.org

NeuroCOLT Technical Report NC-TR-00-079

Text Classification using String Kernels

Huma Lodhi John Shawe-Taylor
Nello Cristianini Chris Watkins

Department of Computer Science
Royal Holloway, University of London

 

Abstract
We introduce a novel kernel for comparing two text documents. The kernel is an inner product in the feature space consisting of all subsequences of length $k$. A subsequence is any ordered sequence of $k$\ characters occurring in the text though not necessarily contiguously. The subsequences are weighted by an exponentially decaying factor of their full length in the text, hence emphasising those occurrences which are close to contiguous. A direct computation of this feature vector would involve a prohibitive amount of computation even for modest values of $k$, since the dimension of the feature space grows exponentially with $k$. The paper describes how despite this fact the inner product can be efficiently evaluated by a dynamic programming technique. A preliminary experimental comparison of the performance of the kernel compared with a standard word feature space kernel~ \cite{Joachims98} is made showing encouraging results.

Download Compressed Postscript