Recent pubs of Manfred K. Warmuth

A bit outdated: Check out Manfred's recent home page

We have developed a framework for deriving and analysing simple on-line learning algorithms. A learning problem is given by a parameterized class of predictors, a divergence measure in the labelling domain (loss function) and a divergence measure in the parameter domain. My favorite divergences are Bregman divergences.

A tutorial that shows the usefulness of the Bregman divergences

A more recent tutorial (July 25, Dagstuhl, Germany) that contrasts the two main families of algorithms: the geometric and information theoretic algorithms. The first one is motivated by using the squared Eucledian distance as the parameter divergence and the second family by the relative entropy A paper that contains some of the details of how we use Bregman divergences for proving relative loss bounds. In particular we make the connection between Bregman divergences and the exponential family of distributions We did one case of density estimation with an exponential family optimally In the following paper we prove relative expected loss bounds for the last trial. We use the leave-one-out loss to do this The first paper in which we derive learning algorithms in a simple framework trading off the divergence in the parameter domain with the divergence in the labelling domain. The focus is on linear regression. We analyse the gradient descent algorithm (Widrow Hoff) and a corresponding algorithm that does muliplicative updates to its weights which is called the exponentiated gradient algorithm. In the following paper we discuss when gradient descent and related algorithms can be much worse than the new exponentiated gradient algorithm. In the following paper we prove relative loss bounds for single neurons with one output. This includes one-class logistic regression. We also introduce a notion of "matching loss" associated with any increasing transfer function. When this loss is used the the algorithms can be analysed. We have generalized the above to multiclass outputs including multiclass logistic regression. In this paper we also make a connection between the matching loss and Bregmann divergences. We also have applied our methods for proving relative loss bounds to classification with a linear threshold function. We analyze the Perceptron algorithm as well as normalized Winnow. The bounds are expressed in terms of an everage margin of the examples. Previous methods (including the Perceptron Convergence Theorem) used worst-case margins. A paper that gives a partial Baysian interpretation for Winnow-like updates for learning disjunctions.

Bregman divergences can also be used to derive Boosting algorithms.

An older paper for finding a distribution w.r.t. equality constraints. Among all distributions that satisfy the constraints, the one of maximum entropy (i.e. min. relative entropy to the uniform distribution) is targeted. This amounts to a projection onto the constraints where the divergence function is the relative entropy. (The algorithms actually do an approximate projection motivated by gradient descent.) My most recent work on boosting. We start with a linear programming problem for maximizing the margin and then add a relative entropy as a regularization.

In the following paper (preliminary version) we explain the most simple learning algorithms in terms of link functions and prove relative loss bound for continuous time versions of the algorithms. We also discuss Bregman divergences at a high level and introduce the dual updates.

In the following two papers we develop and analyze algorithms for a simple mixture problem which can be used to predict stockmarket portfolios.

Below are some examples where we applied our method for deriving learning algorithms for more complicated models.

The following paper gives an alternate method for deriving on-line learning algorithms that does not use Bregman divergences. Under the assumption that the current trial is the last one, the algorithm predicts with the minimax optimal parameter.

The following four papers prove bounds when the target is shifting.

The first is a new paper where we do well compared to a shifting sequence of experts that are from a small pool. We solve an open problem posed by Yoav Freund (won the prize of $500):

Original paper where we do well compared to a shifting sequence of experts:

The following papers focus on the expert model. Here the loss of the algorithm is compared to the loss of the best expert.

In the following two papers we prove relative loss bounds for
Temporal Difference Learning

The first paper gives an alternate for TD(lambda) for which we
can prove strong relative loss bound

The second paper derives a new second-order algorithm
that is essentially a generalization of on-line linear least squares

When learning for example orthogonal rectangles then a subsample of up to four examples (the left-, right-, top- and bottom- most pos. example) represent a hypothesis that is consistent with the whole sample. This is an example of a compression scheme for the concept class of orthogonal rectangles, i.e. a mapping from samples to subsamples that represent a hypothesis consistent with the whole original sample.
One of my favorite conjectures is the following (posted as an open problem on the COLT web page): For any concept class of VC dimension d there is compression scheme of size d. In the case of orthogonal rectangles the VC dimension is four and the size the simple compression scheme is four as well. The following paper proofs a special case of this conjecture and gives lots of background discussion.

The orginal unpublished manuscript that introduces compression schemes and gives sample complexity bounds.