NeuroCOLT
workshop
on
Applications of Learning to Text and Images
Windsor, 30 April - 2 May 2001
Cumberland
Lodge
"Machine
Learning for Natural Language"
Michael Collins
Online
Presentation (Postscript file)
Many problems in natural language processing require learning
of a mapping from one discrete structure to another. Examples
are parsing (learning a function from strings to trees); named
entity extraction (learning a function from strings to an underlying
segmentation identifying people, places, locations and so on);
and machine translation (learning a function from sentences in
a source language to sentences in a target language). This talk
will concentrate on problems of this kind. A very common approach
to these tasks is to use some kind of generative model, for example
a hidden markov model, or a stochastic context-free grammar (SCFG).
Another method is to use "history-based models", as introduced
by researchers at IBM. These approaches are probabilistic, in
that they attempt to model a joint or conditional probability
distribution over possible structures.
Methods from machine learning and computational learning theory
have had relatively little application in natural language, and
are appealing in several ways -- both from a theoretical perspective
(e.g., distribution-free properties) and in practical terms (e.g.,
the opportunities that kernels or boosting can provide in terms
of the feature spaces they can utilize). There are, however, several
challenges (and research oppotunities!) in applying these methods
to natural language. NLP problems involve rich, discrete structures.
Feature spaces are often very high dimensional, and very sparse.
The tasks I will talk about are not classification problems, instead
involving the choice of correct structure from a large number
of candidate alternatives. As a case study, I will talk in some
detail about natural language parsing. In the early 1990s a rich
source of supervised training data was released to the NLP community:
the Penn treebank -- roughly 1 million words of Wall Street Journal
text annotated for syntactic structure. Since then there has been
much work, and steady progress, on machine learning for parsing.
I'll explain how simple SCFGs can be applied to the task, and
why they perform poorly. I'll then describe models which are still
generative in nature, but are more sensitive to lexical information,
giving a substantial increase in performance. Finally, I'll describe
more recent work on learning methods which promise to be considerably
more flexible in the features they can incorporate. I'll describe
how three such approaches -- boosting, support vector machines
and markov random fields -- can be applied to parsing. In particular,
I will describe kernels for various NLP structures such as trees,
hidden markov models and dependency structures. To conclude the
talk I will return to related problems such as named entity extraction
or machine translation, and describe how similar methods might
be applied to these tasks. I'll also talk about "question answering"
as a motivating application, and explain why I think it's a natural
progression from standard Information Retrieval (*document* retrieval)
to retrieving results that are more specific to a user's query
(*sub-document* retrieval).
Joint
work with Steve Abney, Nigel Duffy, Rob Schapire and Yoram Singer.