NeuroCOLT
workshop
on
Applications of Learning to Text and Images
Windsor, 30 April - 2 May 2001
Cumberland
Lodge
"Matrix
Decomposition Methods in Information Retrieval"
Thomas Hofmann, Brown University
Online
Presentation
Many problems in information retrieval and information filtering
involve data that can be represented in form of a sparse matrix
with binary values or frequency counts. This includes document-term
frequencies, user ratings on a set of items, and adjacency matrices
encoding the hyperlink graph or citation structure in document
repositories. There are a number of generic questions that typically
occur in this context. Most prominently, one would like to overcome
the sparseness problem, i.e., reliably estimate probabilities
for unobserved or rare events. In addition, the derivation of
low-dimensional data representations and the identification of
latent factors is often of considerable interest as a preprocessing
step for subsequent processing as well as for visualization. This
talk will introduce and discuss methods for matrix decomposition
and dimension reduction that address these questions. Several
example applications from information retrieval will be used to
illustrate the fruitfulness of this class of methods and to demonstrate
the effectiveness of decomposition techniques. The latter will
include (i) estimating document-specific language models in ad
hoc retrieval, (ii) deriving topic-centered document representations
for document categorization, (iii) decomposing user preferences
for collaborative filtering, (iv) learning stochastic models for
hyperlink and paper citation graphs. Algorithmic and scalability
issues will also be discussed in detail.