Recent pubs of Manfred K. Warmuth
We have developed a framework for deriving
and analysing simple on-line learning algorithms.
A learning problem is given by a parameterized
class of predictors, a divergence measure in the
labelling domain (loss function)
and a divergence measure in the parameter domain.
My favorite divergences are Bregman divergences.
A tutorial that shows the usefulness of the Bregman divergences
A more recent tutorial (July 25, Dagstuhl,
Germany) that contrasts the two main families of
algorithms: the geometric and
information theoretic algorithms.
The first one is motivated by using the
squared Eucledian distance as the parameter
divergence and the second family by the relative
"Proving Relative Loss Bounds for On-Line
Learning Algorithms Using Bregman
COLT 00, June 28 - July 1, 2000, Stanford
University.(with C. Gentile)
[PDF of references]
A paper that contains some of the details of how we use Bregman divergences
for proving relative loss bounds.
In particular we make the connection between Bregman divergences
and the exponential family of distributions
"On-line Learning - Methods and Open Problems"
Dagstuhl, Germany, July 25, 2001.
We did one case of density estimation with an exponential
"Relative Loss Bounds for On-line Density Estimation with the Exponential
Family of Distributions",
in a special issue of the
Journal of Machine Learning on
Theoretical Advances in On-line Learning, Game
Theory and Boosting,
edited by Yoram Singer, Vol. 43(3), pp. 211--246,
(with K. Azoury)
In the following paper we prove relative
expected loss bounds for the last trial.
We use the leave-one-out loss to do this
"The Minimax Algorithm for Gaussian Density Estimation,"
in COLT 00, June 28 - July 1, 2000, pp.
100-106, Stanford University (with Eiji Takimoto)
The first paper in which we derive learning algorithms in a simple framework
trading off the divergence in the parameter domain with the divergence
in the labelling domain. The focus is on linear regression. We analyse
the gradient descent algorithm (Widrow Hoff) and a corresponding algorithm
that does muliplicative updates to its weights which is called the exponentiated
"Relative Expected Instantaneous Loss Bounds,"
in a special issue
of Journal of Computer and Systems Sciences
Vol. 64-1, pp. 76-102, February 2002
(with Juergen Forster)
In the following paper we discuss when gradient descent and related algorithms
can be much worse than the new exponentiated gradient algorithm.
"Additive Versus Exponentiated Gradient Updates for Linear Prediction,"
in Journal of Information and Computation, vol. 132, no. 1, pp.
1-64, January 1997, an extended abstract appeared in STOC 95 (with J.
In the following paper we prove relative loss bounds for single neurons
with one output. This includes one-class logistic regression. We also introduce
a notion of "matching loss" associated with any increasing transfer function.
When this loss is used the the algorithms can be analysed.
"The Perceptron Algorithm vs. Winnow: Linear vs. Logarithmic Mistake Bounds
When Few Input Variables are Relevant," in the special issue of Artificial
Intelligence on Relevance, Vol. 97(1--2), pp. 325--343, December 1997.
(with J. Kivinen and P. Auer)
We have generalized the above to multiclass outputs including multiclass
logistic regression. In this paper we also make a connection between the
matching loss and Bregmann divergences.
"Relative Loss Bounds for Single Neurons," IEEE Transactions on Neural
Networks, Vol. 10(6), pp.
1291--1304, November 1999 (with D.P. Helmbold and J. Kivinen)
We also have applied our methods for proving relative loss bounds to classification
with a linear threshold function. We analyze the Perceptron algorithm as
well as normalized Winnow. The bounds are expressed in terms of an everage
margin of the examples. Previous methods (including the Perceptron Convergence
Theorem) used worst-case margins.
"Relative Loss Bounds for Multidimensional Regression Problems."
Journal of Machine Learning
Vol. 45(3), pp. 301-329
(with J. Kivinen) [PDF file]
A paper that gives a partial Baysian interpretation for Winnow-like updates
for learning disjunctions.
"Linear Hinge Loss and Average Margin," in NIPS 98,
pp. 225-231, December 1998 (with C. Gentile)
"Direct and Indirect Algorithms for On-line Learning of Disjunctions",
In a special issue of Theoretical Computer Science for EUROCOLT 99,
Vol. 284(1), pp. 109-142, July 2002
(with D. P. Helmbold and S. Panizza)
Bregman divergences can also be used to derive Boosting algorithms.
An older paper for finding a distribution w.r.t. equality constraints.
Among all distributions that satisfy the constraints, the one of maximum
entropy (i.e. min. relative entropy to the uniform distribution) is targeted.
This amounts to a projection onto the constraints where the divergence
function is the relative entropy. (The algorithms actually do an approximate
projection motivated by gradient descent.)
"Boosting as Entropy Projection," in COLT 99,
pp. 134-144, July 1999, UC Santa Cruz
(with J. Kivinen)
My most recent work on boosting. We start with a
linear programming problem for maximizing the margin and then add a relative
entropy as a regularization.
"Bounds on Approximate Steepest Descent for Likelihood Maximization in
Exponential Families," IEEE Transaction on Information Theory,
40, No. 4, pp. 1215--1220 (July 1994) (with N. Cesa-Bianchi and A. Krogh
"Barrier Boosting," in COLT 00, pp. 170-179,
June 28 - July 1, 2000, Stanford University
(with G. Raetsch, S. Mika, T.
Onoda, S. Lemm, K. R. Mueller)
In the following paper (preliminary version) we explain the most simple
learning algorithms in terms of link functions and prove relative loss
bound for continuous time versions of the algorithms. We also discuss Bregman
divergences at a high level and introduce the dual updates.
"Continuous Versus Discrete-Time Non-linear Gradient Descent: Relative
Loss Bounds and Convergence," in Fifth International Symposium
on Artificial Intelligence and Mathematics,
Florida, January 1997 (with
A. K. Jagota)
In the following two papers we develop and analyze algorithms for a
simple mixture problem which can be used to predict stockmarket portfolios.
"A comparison of new and old algorithms for a mixture estimation problem,"
Journal of Machine Learning, Vol. 27(1), pp. 97-119, 1997 (with Helmbold,
D.R., Schapire, R.E., and Singer)
"On-line portfolio selection using multiplicative updates," Mathematical
Finance, 8(4), pp. 325--347, 1998 (with D.P. Helmbold, R.E. Schapire,
and Y. Singer)
Below are some examples where we applied our method for deriving learning
algorithms for more complicated models.
"Training Algorithms for Hidden Markov Models Using Entropy Based Distance
Function," in Advances in Neural Information Processing Systems 9 (NIPS `96),
Morgan Kaufmann Publishers, pp. 641-647 (with Y. Singer)
Slides of talk on the new learning algorithms for Hidden Markov Models.
"Batch and On-line Parameter Estimation of
Gaussian Mixtures Based on the Joint Entropy",
in Advances in Neural Information Processing Systems 11 (NIPS'98),
MIT Press, Cambridge, MA, pp. 578--584
(with Y. Singer)
The following paper gives an alternate
method for deriving on-line learning
algorithms that does not use Bregman
divergences. Under the assumption that
the current trial is the last one, the
algorithm predicts with the minimax optimal
"The Last-step Minimax Algorithm",
ALT 00, Springer Verlag Lecture notes in AI,
vol. 1968, pp. 279--290, December 2000 (with Eiji Takimoto)
The following four papers prove bounds when the target is shifting.
The first is a new paper where we do
well compared to a shifting sequence of
experts that are from a small pool.
We solve an open problem posed by Yoav
Freund (won the prize of $500):
"Tracking a Small Set of Experts by Mixing
(with Olivier Bousquet)
[PDF of Talk]
Journal of Machine Learning Research,
Vol. 3(Nov):363-396, 2002
[PDF of Paper]
Original paper where we do well compared
to a shifting sequence of experts:
"Tracking the Best Expert,"
Journal of Machine Learning, Vol. 32(2), pp. 151-178,
August 1998 (with Mark Herbster)
"Tracking the Best Disjunction," Journal of Machine Learning Vol.
32(2), pp. 127-150, August 1998 (with P. Auer)
"Tracking the Best Linear Predictor,"
to appear in
Journal of Machine Learning Research,
Vol. 1, pp. 281-309, September 2001 (with Mark Herbster)
The following papers focus on the expert model. Here the loss of the
algorithm is compared to the loss of the best expert.
"Averaging Expert Predictions," in EUROCOLT 99,
Springer Verlag, pp. 153--167, March 1999
(with J. Kivinen)
"Predicting Nearly as Well as the Best Pruning of a Planar Decision Graph",
Theoretical Computer Science
288(2): 217-235 (2002)
(with E. Takimoto)
"Theory and Applications of Predictors that Specialize," in Twentyninth Annual
ACM Symposium on Theory of Computing (STOC
97),, pp. 334--343
(with Y. Freund, R.E. Schapire and Y. Singer)
"How to Use Expert Advice," in Journal of the ACM, Vol. 44(3), pp.
427-485, May 1997 (with N. Cesa-Bianchi, Y. Freund, D.P. Helmbold, D.
Haussler, and R.E. Schapire) [Postscript
"Sequential Prediction of Individual Sequences Under General Loss Functions,"
Transactions on Information Theory. Vol. 44(5), pp. 1906-1925, September
1998 (with D. Haussler and J. Kivinen) [Postscript
"The Weighted Majority Algorithm," in Information and Computation,
Vol. 108, No. 2, pp. 212-261 (February 1, 1994). An extended abstract appeared
in COLT 89 (with N. Littlestone) [Postscript
In the following two papers we prove relative loss bounds for
Temporal Difference Learning
The first paper gives an alternate for TD(lambda) for which we
can prove strong relative loss bound
The second paper derives a new second-order algorithm
"On the Worst-case Analysis of Temporal-difference Learning Algorithms,"
Journal of Machine Learning, Vol. 22 (1,2,3), pp. 95-121 (with
R. Schapire) [Postscript]
that is essentially a generalization of on-line linear least squares
"Relative Loss Bounds for Temporal-Difference Learning,"
Journal of Machine Learning,
Vol. 51-1, pp. 23-50,
(with Juergen Forster)
When learning for example
orthogonal rectangles then a subsample
of up to four examples (the left-, right-,
top- and bottom- most pos. example) represent a
hypothesis that is consistent with
the whole sample.
This is an example of a
for the concept class of orthogonal rectangles,
i.e. a mapping from samples to subsamples
that represent a hypothesis consistent with
the whole original sample.
One of my favorite conjectures is the
(posted as an open problem on the COLT web page):
For any concept class of VC dimension d
there is compression scheme of size d. In
the case of orthogonal rectangles the VC
dimension is four and the size the simple
compression scheme is four as well.
The following paper proofs a special case
of this conjecture and gives lots of
The orginal unpublished manuscript that introduces
compression schemes and gives sample
"Sample Compression, Learnability, and the Vapnik-Chervonenkis Dimension",
Machine Learning, Vol.21 (3), pp. 269--304, December 1995 (with Sally Floyd)
"Relating Data Compression and
unpublished manuscript, June 10, 1986
(with Nick Littlestone)