Recent pubs of Manfred K. Warmuth
We have developed a framework for deriving
and analysing simple online learning algorithms.
A learning problem is given by a parameterized
class of predictors, a divergence measure in the
labelling domain (loss function)
and a divergence measure in the parameter domain.
My favorite divergences are Bregman divergences.
A tutorial that shows the usefulness of the Bregman divergences

"Proving Relative Loss Bounds for OnLine
Learning Algorithms Using Bregman
Divergences",
COLT 00, June 28  July 1, 2000, Stanford
University.(with C. Gentile)
[PDF file]
[PDF of references]
A more recent tutorial (July 25, Dagstuhl,
Germany) that contrasts the two main families of
algorithms: the geometric and
information theoretic algorithms.
The first one is motivated by using the
squared Eucledian distance as the parameter
divergence and the second family by the relative
entropy

"Online Learning  Methods and Open Problems"
Dagstuhl, Germany, July 25, 2001.
[PDF]
A paper that contains some of the details of how we use Bregman divergences
for proving relative loss bounds.
In particular we make the connection between Bregman divergences
and the exponential family of distributions

"Relative Loss Bounds for Online Density Estimation with the Exponential
Family of Distributions",
in a special issue of the
Journal of Machine Learning on
Theoretical Advances in Online Learning, Game
Theory and Boosting,
edited by Yoram Singer, Vol. 43(3), pp. 211246,
June 2001
(with K. Azoury)
[Postscript
file]
We did one case of density estimation with an exponential
family optimally

"The Minimax Algorithm for Gaussian Density Estimation,"
in COLT 00, June 28  July 1, 2000, pp.
100106, Stanford University (with Eiji Takimoto)
[Postscript]
In the following paper we prove relative
expected loss bounds for the last trial.
We use the leaveoneout loss to do this

"Relative Expected Instantaneous Loss Bounds,"
in a special issue
of Journal of Computer and Systems Sciences
for COLT00,
Vol. 641, pp. 76102, February 2002
(with Juergen Forster)
[Postscript]
The first paper in which we derive learning algorithms in a simple framework
trading off the divergence in the parameter domain with the divergence
in the labelling domain. The focus is on linear regression. We analyse
the gradient descent algorithm (Widrow Hoff) and a corresponding algorithm
that does muliplicative updates to its weights which is called the exponentiated
gradient algorithm.

"Additive Versus Exponentiated Gradient Updates for Linear Prediction,"
in Journal of Information and Computation, vol. 132, no. 1, pp.
164, January 1997, an extended abstract appeared in STOC 95 (with J.
Kivinen)
[Postscript
file]
In the following paper we discuss when gradient descent and related algorithms
can be much worse than the new exponentiated gradient algorithm.

"The Perceptron Algorithm vs. Winnow: Linear vs. Logarithmic Mistake Bounds
When Few Input Variables are Relevant," in the special issue of Artificial
Intelligence on Relevance, Vol. 97(12), pp. 325343, December 1997.
(with J. Kivinen and P. Auer)
[Postscript
file]
In the following paper we prove relative loss bounds for single neurons
with one output. This includes oneclass logistic regression. We also introduce
a notion of "matching loss" associated with any increasing transfer function.
When this loss is used the the algorithms can be analysed.

"Relative Loss Bounds for Single Neurons," IEEE Transactions on Neural
Networks, Vol. 10(6), pp.
12911304, November 1999 (with D.P. Helmbold and J. Kivinen)
[Postscript
file]
We have generalized the above to multiclass outputs including multiclass
logistic regression. In this paper we also make a connection between the
matching loss and Bregmann divergences.

"Relative Loss Bounds for Multidimensional Regression Problems."
Journal of Machine Learning
Vol. 45(3), pp. 301329
(with J. Kivinen) [PDF file]
We also have applied our methods for proving relative loss bounds to classification
with a linear threshold function. We analyze the Perceptron algorithm as
well as normalized Winnow. The bounds are expressed in terms of an everage
margin of the examples. Previous methods (including the Perceptron Convergence
Theorem) used worstcase margins.

"Linear Hinge Loss and Average Margin," in NIPS 98,
pp. 225231, December 1998 (with C. Gentile)
[Postscript
file]
A paper that gives a partial Baysian interpretation for Winnowlike updates
for learning disjunctions.

"Direct and Indirect Algorithms for Online Learning of Disjunctions",
In a special issue of Theoretical Computer Science for EUROCOLT 99,
Vol. 284(1), pp. 109142, July 2002
(with D. P. Helmbold and S. Panizza)
\item
[Postscript
file]
Bregman divergences can also be used to derive Boosting algorithms.

"Boosting as Entropy Projection," in COLT 99,
pp. 134144, July 1999, UC Santa Cruz
(with J. Kivinen)
[PDF file]
An older paper for finding a distribution w.r.t. equality constraints.
Among all distributions that satisfy the constraints, the one of maximum
entropy (i.e. min. relative entropy to the uniform distribution) is targeted.
This amounts to a projection onto the constraints where the divergence
function is the relative entropy. (The algorithms actually do an approximate
projection motivated by gradient descent.)

"Bounds on Approximate Steepest Descent for Likelihood Maximization in
Exponential Families," IEEE Transaction on Information Theory,
Vol.
40, No. 4, pp. 12151220 (July 1994) (with N. CesaBianchi and A. Krogh
[Postscript
file]
My most recent work on boosting. We start with a
linear programming problem for maximizing the margin and then add a relative
entropy as a regularization.

"Barrier Boosting," in COLT 00, pp. 170179,
June 28  July 1, 2000, Stanford University
(with G. Raetsch, S. Mika, T.
Onoda, S. Lemm, K. R. Mueller)
[PDF file]
In the following paper (preliminary version) we explain the most simple
learning algorithms in terms of link functions and prove relative loss
bound for continuous time versions of the algorithms. We also discuss Bregman
divergences at a high level and introduce the dual updates.

"Continuous Versus DiscreteTime Nonlinear Gradient Descent: Relative
Loss Bounds and Convergence," in Fifth International Symposium
on Artificial Intelligence and Mathematics,
Florida, January 1997 (with
A. K. Jagota)
[Postscript
file]
In the following two papers we develop and analyze algorithms for a
simple mixture problem which can be used to predict stockmarket portfolios.

"A comparison of new and old algorithms for a mixture estimation problem,"
Journal of Machine Learning, Vol. 27(1), pp. 97119, 1997 (with Helmbold,
D.R., Schapire, R.E., and Singer)
[Postscript
file]

"Online portfolio selection using multiplicative updates," Mathematical
Finance, 8(4), pp. 325347, 1998 (with D.P. Helmbold, R.E. Schapire,
and Y. Singer)
[Postscript
file]
Below are some examples where we applied our method for deriving learning
algorithms for more complicated models.

"Training Algorithms for Hidden Markov Models Using Entropy Based Distance
Function," in Advances in Neural Information Processing Systems 9 (NIPS `96),
Morgan Kaufmann Publishers, pp. 641647 (with Y. Singer)
[Postscript
file]

Slides of talk on the new learning algorithms for Hidden Markov Models.
[Postscript
file]

"Batch and Online Parameter Estimation of
Gaussian Mixtures Based on the Joint Entropy",
in Advances in Neural Information Processing Systems 11 (NIPS'98),
MIT Press, Cambridge, MA, pp. 578584
(with Y. Singer)
[PDF file]
The following paper gives an alternate
method for deriving online learning
algorithms that does not use Bregman
divergences. Under the assumption that
the current trial is the last one, the
algorithm predicts with the minimax optimal
parameter.

"The Laststep Minimax Algorithm",
ALT 00, Springer Verlag Lecture notes in AI,
vol. 1968, pp. 279290, December 2000 (with Eiji Takimoto)
[Postscript]
The following four papers prove bounds when the target is shifting.
The first is a new paper where we do
well compared to a shifting sequence of
experts that are from a small pool.
We solve an open problem posed by Yoav
Freund (won the prize of $500):

"Tracking a Small Set of Experts by Mixing
Past Posteriors,"
Colt 2001
(with Olivier Bousquet)
[PDF of Talk]
Journal of Machine Learning Research,
Vol. 3(Nov):363396, 2002
[PDF of Paper]
Original paper where we do well compared
to a shifting sequence of experts:

"Tracking the Best Expert,"
Journal of Machine Learning, Vol. 32(2), pp. 151178,
August 1998 (with Mark Herbster)
[PDF file]

"Tracking the Best Disjunction," Journal of Machine Learning Vol.
32(2), pp. 127150, August 1998 (with P. Auer)
[Postscript
file]

"Tracking the Best Linear Predictor,"
to appear in
Journal of Machine Learning Research,
Vol. 1, pp. 281309, September 2001 (with Mark Herbster)
[Postscript
file]
The following papers focus on the expert model. Here the loss of the
algorithm is compared to the loss of the best expert.

"Averaging Expert Predictions," in EUROCOLT 99,
Springer Verlag, pp. 153167, March 1999
(with J. Kivinen)
[PDF file]

"Predicting Nearly as Well as the Best Pruning of a Planar Decision Graph",
Theoretical Computer Science
288(2): 217235 (2002)
(with E. Takimoto)
[Postscript file]

"Theory and Applications of Predictors that Specialize," in Twentyninth Annual
ACM Symposium on Theory of Computing (STOC
97),, pp. 334343
(with Y. Freund, R.E. Schapire and Y. Singer)
[Postscript file]

"How to Use Expert Advice," in Journal of the ACM, Vol. 44(3), pp.
427485, May 1997 (with N. CesaBianchi, Y. Freund, D.P. Helmbold, D.
Haussler, and R.E. Schapire) [Postscript
file]

"Sequential Prediction of Individual Sequences Under General Loss Functions,"
IEEE
Transactions on Information Theory. Vol. 44(5), pp. 19061925, September
1998 (with D. Haussler and J. Kivinen) [Postscript
file]

"The Weighted Majority Algorithm," in Information and Computation,
Vol. 108, No. 2, pp. 212261 (February 1, 1994). An extended abstract appeared
in COLT 89 (with N. Littlestone) [Postscript
file]
In the following two papers we prove relative loss bounds for
Temporal Difference Learning
The first paper gives an alternate for TD(lambda) for which we
can prove strong relative loss bound

"On the Worstcase Analysis of Temporaldifference Learning Algorithms,"
Journal of Machine Learning, Vol. 22 (1,2,3), pp. 95121 (with
R. Schapire) [Postscript]
The second paper derives a new secondorder algorithm
that is essentially a generalization of online linear least squares

"Relative Loss Bounds for TemporalDifference Learning,"
Journal of Machine Learning,
Vol. 511, pp. 2350,
(with Juergen Forster)
[Postscript]
When learning for example
orthogonal rectangles then a subsample
of up to four examples (the left, right,
top and bottom most pos. example) represent a
hypothesis that is consistent with
the whole sample.
This is an example of a
compression scheme
for the concept class of orthogonal rectangles,
i.e. a mapping from samples to subsamples
that represent a hypothesis consistent with
the whole original sample.
One of my favorite conjectures is the
following
(posted as an open problem on the COLT web page):
For any concept class of VC dimension d
there is compression scheme of size d. In
the case of orthogonal rectangles the VC
dimension is four and the size the simple
compression scheme is four as well.
The following paper proofs a special case
of this conjecture and gives lots of
background discussion.

"Sample Compression, Learnability, and the VapnikChervonenkis Dimension",
Machine Learning, Vol.21 (3), pp. 269304, December 1995 (with Sally Floyd)
[Postscript
file]
The orginal unpublished manuscript that introduces
compression schemes and gives sample
complexity bounds.

"Relating Data Compression and
Learnability,"
unpublished manuscript, June 10, 1986
(with Nick Littlestone)
[PDF file]