Next: Acknowledgements Up: Evaluating Regularizers for Estimating Previous: 6 Results for separate

7 Conclusions and future research

For applications that can afford the computing cost, Dirichlet mixture regularizers are clearly the best choice. In fact, they are so close to the theoretical optimum for regularizers, that there doesn't seem to be much point in looking for better regularizers. Other evaluations of regularizers, based on searches in biological contexts, have also found Dirichlet mixtures to be superior [17,11], validating the more information-theoretic approach taken here.

For applications in which there is little data to train a regularizer, pseudocounts are probably the best choice, as they perform reasonably well with few parameters. Dirichlet mixtures and substitution matrices have comparable numbers of parameters and so require comparable amounts of training data. If the regularizers do not need to be re-evaluated frequently, then Dirichlet mixtures are the preferred choice.

Although most applications (such as training hidden Markov models or building profiles from multiple alignments) do not require frequent evaluation of regularizers, there are some applications (such as Gibbs sampling) that require recomputing the regularizers inside an inner loop. For these applications, the substitution matrix plus pseudocounts plus scaled counts is probably the best choice, as it has only about 0.03 bits more excess entropy than the Dirichlet mixtures, but does not require evaluating Gamma functions.

One weakness of the empirical analysis done in this report is that all the data was taken from the BLOCKS database, which contains only highly conserved blocks. While this leads us to have high confidence in the alignment, it also means that the regularizers do not have to do much work. The appropriate regularizers for more variable columns may look somewhat different, though one would expect the pseudocount and substitution-matrix methods to degrade more than the Dirichlet mixtures, which naturally handle high variability.

To get significantly better performance than a Dirichlet mixture regularizer, we need to incorporate more information than just the sample of amino acids seen in the context. There are two ways to do this: one uses more information about the column (such as solvent accessibility or secondary structure) and the other uses more information about the sequence (such as a phylogenetic tree relating it to other sequences).

Using extra information about a column could improve the performance of a regularizer up to the ``full'' row shown in Table 1, but no more, since the full row assumes that the extra information uniquely identifies the column. There is about 0.6 bits that could be gained by using such information (relative to a sample size of 5), far more than difference between the best regularizer and a crude zero-offset regularizer.

Incorporating sequence-specific information may yield even larger gains than using column-specific information. Based on preliminary work at UCSC, there may be a full bit per column to be gained by taking into account phylogenetic relationships among sequences in a multiple alignment.

Another way to use sequence-specific information would be to use modified regularizers for residues that are in contact, adjusting the probabilities for one amino acid based on what is present in the contacting position.

Next: Acknowledgements Up: Evaluating Regularizers for Estimating Previous: 6 Results for separate

karplus@cse.ucsc.edu