|
NeuroCOLT
Technical Report NC-TR-00-079
Text Classification using String Kernels
Huma
Lodhi John Shawe-Taylor
Nello Cristianini Chris Watkins
Department
of Computer Science
Royal Holloway, University of London
Abstract
We introduce a novel kernel for comparing two text documents. The
kernel is an inner product in the feature space consisting of all
subsequences of length $k$. A subsequence is any ordered sequence
of $k$\ characters occurring in the text though not necessarily contiguously.
The subsequences are weighted by an exponentially decaying factor
of their full length in the text, hence emphasising those occurrences
which are close to contiguous. A direct computation of this feature
vector would involve a prohibitive amount of computation even for
modest values of $k$, since the dimension of the feature space grows
exponentially with $k$. The paper describes how despite this fact
the inner product can be efficiently evaluated by a dynamic programming
technique. A preliminary experimental comparison of the performance
of the kernel compared with a standard word feature space kernel~
\cite{Joachims98} is made showing encouraging results.
Download Compressed
Postscript
|