Evolutionary trace method –
overview
INTRODUCTION
“It is a fundamental
axiom of biology that the three-dimensional structure of a protein determines
its function. Understanding function through structure is a primary goal of
structural biology.” [14] Proteins have a variety of functional and structural
roles, such as catalysis, binding of small molecules (ligands) or large
molecules (proteins, DNA, RNA) and switching.
Catalytic proteins are
known as enzymes and are responsible for the regulation of the rate of certain
biochemical processes. During a reaction they often undergo modification and
structural changes, which are reversed at the end of the catalyzed process. An
example of a catalytic protein is aspartate aminotransferase. This enzyme is involved in the formation
of oxaloacetic acid for glutamic
acid (otherwise known as glutamate). Aspartate aminotransferase interacts with a phosphate cofactor and
lysine to create a molecule, which undergoes rearrangement to give the final
product [14].
Other proteins perform
their function through binding of small molecules. For example, the potassium
channel is a membrane protein, which is responsible for the transportation of K+
ions across the membrane. The potassium channel has a selectivity filter, which
is specifically designed for accurate and ligand-specific binding to potassium
ions [15]. The accuracy of binding is achieved through the coordination of
backbone oxygen atoms, which interact with the K+ ion and keep it in
place within the channel.
Some proteins can also
bind large molecules, such as other proteins, DNA or RNA. A well-known example
of one such protein is DNA polymerase I, which is a complex of subunits with
different function (a polymerase, a 5’-3’ exonuclease
and 3’-5’ exonuclease function), binds DNA, and is
involved in the DNA replication process through the polymerization and editing
of newly synthesized DNA strands.
Finally, some proteins
also act as switches. An example of such a protein is the GTPase,
which is responsible for the binding and hydrolysis of GTP into GDP. GTPase interacts not only with GTP, but also with other
proteins, which are involved in the regulation of the hydrolytic process [14].
As evident from the above
examples, proteins play a fundamental role in biological processes. Therefore,
understanding their function is of great importance. Since structure and
function seem to be closely related, for many years research has been focused
on understanding how the three dimensional structure of functional sites (such
as active and binding sites) affects the function of proteins. This is a matter
of special interest, also because a large number of proteins exist, for which
structure has been determined through experimental or other studies, but no
known function has been found. Therefore, understanding the correlation of
protein sequence and structure to function can allow us to resolve a variety of
biological problems, such as functional annotation of proteins, drug discovery,
etc. If we could predict function and functional sites on a large scale solely
based on sequence and structural information, we would be able to develop new
drugs more easily, as well as design proteins for various functional purposes.
The next section of this study focuses on an overview of the physical, chemical
and structural properties of functional sites and attempts to provide some
motivation for the interest in prediction of functional sites.
BACKGROUND ON PHYSICAL, CHEMICAL AND STRUCTURAL
PROPERTIES OF FUNCTIONAL SITES
Given the diverse nature
of functional sites, as mentioned in the previous section, proteins interact
with a variety of molecules, such as small ligands, other proteins, DNA, RNA,
carbohydrates, etc. Certain unique chemical, physical and structural features characterize
each of these interactions and this section focuses on the description of those
features.
Protein-protein
interactions and interfaces are very common among proteins. Many studies have
been focused on the examination of the properties and characteristics of those
interfaces. Protein-protein interfaces tend to exhibit structural
complementarity, where, for example, one of the participating structures is
concave and the other - convex in the area of their interaction. This idea has
been used in a number of algorithms for the identification of protein-protein
interfaces and functional sites [1]. Unfortunately, it fails to account for the
diversity in the nature of protein-protein interactions. Complementarity is
more of typical of protein-small ligand interactions and does not characterize
as well other protein-molecule interfaces. A related problem arises from the
need to be able to identify sites, where a small ligand such as an inhibitor is
able to attach, even though it might not be a typical binding area for a small
molecule. In drug design, for example, one often needs to design an inhibitor
of a particular protein-protein interaction. An algorithm relying on the
complementarity property would not be able to identify an appropriate binding
site, because under normal circumstances the protein does not bind small
ligands [1].
Therefore, as expected,
protein-protein interfaces vary largely in their size. A study by Lo Conte
shows, based on an experiment with 75 protein-protein interfaces from a set of
proteins with diverse functions, that the average size, or rather the size of
an average protein-protein interface is approximately 1600 +/- 400 Å2 and the interface does
not require major structural conformational changes [2]. Some large interfaces,
however, can reach a size of between 2000 and 4600 Å2 and require a large
conformational change in order for the protein-protein interaction to occur
[2]. The study also concludes that on average the participating proteins
contribute equally to the overall area of the interface (approximately 50%
each).
From a chemical point of
view and based on observations, protein-protein interfaces are primarily
composed of non-polar residues, and represent approximately 53 % of all
residues [2], which is not higher than the average percentage of non-polar
residues found on protein surfaces of globular proteins. The percentage of
polar non-charged residues is somewhat higher for protein-protein surfaces than
the observed average for the surfaces of globular proteins. Thus, within an
interface hydrophobic residues are juxtaposed to hydrophobic residues on the
opposite surface and can interact through Van der Waals interactions to stabilize the interface. On the other
hand, the presence of polar non-charged and charged residues accounts for the
existence of approximately 10 +/- 5 hydrogen bonds between backbone atoms and
side-chain atoms, as well as a small number of salt bridges [3].
The same study ([2])
looks at the distribution of all 20 amino acids on protein-protein interfaces
and concludes that such interfaces are somewhat rich in aromatic residues, such
as Trp, Phe, His, and Tyr, which is not typical of protein surfaces. The study
also finds a higher concentration of aliphatic residues, such as Leu, Val, Met,
etc. Charged residues seem to be rarer with the exception of Arg, which has the
highest single residue contribution to protein-protein interfaces [2]. Finally,
some specific protein-protein interfaces such as protease-inhibitor interfaces
have a higher concentration of Cys residues in the
area of the interface, which could be attributed to disulfide bridges.
The majority of the
features discussed so far have been primarily concerned with large
protein-protein interfaces. Unlike those interfaces, which are characterized by
their relatively planar, large and relatively easily accessible functional
sites [4], interfaces between proteins and small ligands have a much rougher
surface [1]. A study by Frank Pettit ([1]) established that there is a
difference between the concavity and the roughness of a functional site and
attempted to use a fractal dimension (estimating the relationship between the
surface area of a protein and the area accessible by a rolling sphere) measure
to detect and predict functional sites. However, their results are based on a
very small test set and as in the case of the algorithms based on protein
complementarity in large proteins, the use of the
fractal dimension algorithm is limited only to interfaces between proteins and
small ligands, which are naturally characterized by rougher surfaces.
Interestingly enough, in
the case of protein-DNA interactions, a study by Katalin
Nadassy [5] shows that even though the size of an
interface can range between 1120 and 5800 Å2,
the functional modules’ area is approximately 1600 +/- 400 Å2 (the same as the
standard size of a protein-protein interface) and is formed between
approximately 24 +/- 6 amino acids and 12 +/- 3 nucleotides [5]. Enzymes that
interact with DNA show a slightly larger interface area, needed primarily for
the active site. However, protein-DNA interfaces have a somewhat different
distribution of the 20 amino acids on the protein surface at the area of
interaction as compared to protein-protein interfaces. The protein surface at
the interface has a larger concentration of charged and generally polar
residues, especially positively charged residues, such as Arg and
Protein-RNA interfaces
are not expected to differ largely from protein-DNA interfaces. However, as Lichtarge [4] points out, studies have shown that even
though one does expect to see a higher concentration of Arg on the surface of
the protein-RNA interface, a larger percentage of aromatic residues, such as
Tyr, Trp, His, and Phe, is also observed. It could be speculated
that stacking of aromatic residues and bases can help stabilize the protein-RNA
complex.
EVOLUTIONARY TRACE METHOD - DESCRIPTION
As seen from the above
observations, protein functional sites have a number of similar features.
However, they are also quite unique. Therefore, purely structural methods for
the prediction of functional sites in proteins are not fully capable of
generalizing enough to be accurate and applicable on a large scale, as seen
from the isolated cases of the complementarity-based and surface
roughness-based algorithms.
On the other hand,
sequence-based algorithms have been used for a long time for finding conserved
sequence motifs and mapping those onto function, through the use of proteins
with known structure and function.
In order to be able to
explore the information more fully, one can incorporate both sequence and
structure in a functional site prediction method. One such method that relies
on both sequence and structural information is the evolutionary trace method.
The evolutionary trace
method was first described in 1996 by Olivier Lichtarge
and has had been applied in a variety of studies since [16]. In its most basic
form it requires a multiple sequence alignment of a protein family and an
evolutionary tree, based on sequence identity, which can approximate the
functional classification of the protein sequences.
The first step in the
method involves the subdivision of the protein sequences into groups, based on
the evolutionary tree. At trace 1, the only group that exists is the group that
contains all sequences and encompasses the whole phylogenetic tree. At trace 2,
the protein family is divided into two groups, based on the separation of the
tree into 2 distinct branches. The sequences on the same branch belong to one
group and the sequences on the other branch belong to the other group. At each
trace and therefore at each subdivision, the method refers back to the multiple
sequence alignment and reports the conservation of residues within the
subgroup. If at a particular position of the alignment the residue at that
position is identical in all sequences (invariant) within the subgroup, as well
as the other subgroups for that trace, but varies between subgroups, the
residue is named a trace or class-specific residue and is assigned a rank. The
rank of a residue represents “the minimum number of branches that the tree must
be divided for it to become a trace residue.”[4]. The
same procedure is applied to all residues until they are all assigned an
evolutionary rank.
The next step of the
process is to take the highest ranked residues and look at their spatial
distribution on a three-dimensional structure of a protein with a known
structure from the family. The idea behind using the highest ranked residues is
that low rank is supposed to represent evolutionary functional importance.
Thus, residues of vital importance to the function of all proteins in the
family are likely to remain invariant throughout evolution and not undergo any
mutation events. Residues that are important to function, but contribute more
to the functional specificity of a particular subgroup of proteins in the
family are likely to be conserved, but not invariant. Therefore, even though
they might undergo mutations, the rate of mutation is still minimal. And
finally, residues that are not important to function at all are also not under
evolutionary pressure to remain intact and can be subjected to a higher rate of
mutation [4].
The final step of the method is to look for clustering of low ranked residues. If any such cluster is found, it is presumed to be a functional site, which can be experimentally tested through site-directed mutagenesis.
For a visual depiction of
the process, which the evolutionary trace method goes through in the search for
functional sites, I have provided an illustration included in an overview of
the method by Lichtarge and Sowa [4] (the original
caption has been also included):
Fig. 1:
The ET method. (a) All of the sequences in a protein family are aligned
and a tree is generated to illustrate the relatedness of individual family
members. The tree can then be delineated into groups (i)
approximating functional classes (in this case, three classes). For each class,
a consensus sequence is created and these are then compared to form the ET
sequence. Residue positions that are invariant within each class, but that vary
among them are called class-specific or trace residues (labeled X in the ET
sequence, colored red) and those that are class-specific at rank i = 1 are denoted by amino acid single-letter code
in the ET sequence and colored blue. The number of classes into which the tree
has to be divided for a residue to become class-specific is called the rank of
that residue. Finally, trace residues are mapped onto the three-dimensional
structure of a family member, with clusters of trace residues indicating a
functional site [yellow line in panel (b)]. (b) The process described in (a) can be repeated from
rank 1 to N (N = total number of sequences), so that each residue
position is assigned a rank. Residues with lower numbered ranks are considered
to be more important than those with higher numbered ranks (Lichtarge
2002).
EVOLUTIONARY TRACE METHOD – AN EXAMPLE WITH EXPERIMENTAL
VALIDATION
This
section focuses on the description of a study, which uses the evolutionary
trace method to predict a functional site and site-directed mutagenesis of the
predicted functionally important residues to confirm the validity of the
prediction. It is one of the best and most elegant examples of how the method
could be used.
The
original study by Sowa and Lichtarge [6] was focused
on the RGS family of proteins. The interest in this particular family arose
from previous work with heterotrimeric G proteins, known as Gαβγ
proteins, which are a part of a well-known biological pathway. The process can
be briefly described in several steps. An extracellular signal, such as a
hormone, binds to a transmembrane receptor, such as a GPCR protein. In the
process the receptor is activated, it catalyzes the transformation of GDP to
GTP in a G protein and the either the Gα-GTP bound complex or the Gβγ are
responsible for the interaction with an effector protein and the process
results in an amplification of the primary signal [6]. However, in order for the Gα
to be able to bind GTP, it needs to be able to quickly separate from the bound
GTP and become inactive again before it can participate in the next cycle. The
rate of separation of the Gα from the GTP is regulated through the rate of
hydrolysis of GTP, which in turn is regulated by the RGS family of proteins. Of
specific interest to the study was how the specificity of the RGS-
Gα interaction is achieved, given that multiple RGS
proteins can coexist with multiple types of Gα proteins
and yet maintain high level of specificity [6]. The authors speculated that the
interaction between the RGS protein, the Gα
and an effector protein is responsible for the regulation of the process.
The evolutionary trace method was an excellent
approach to the testing of such a hypothesis, since it could look for potential
functionally important residues. A cluster of such residues could serve as a
binding site for the effector protein. The trace method found two large
clusters on the surface of the RGS domain. The first cluster consisted of 10 of
the 11 residues that form the RGS- Gα
interface. However, none of the ten residues seemed to affect or control the
specificity of RGS - Gα
binding. The figure
below is from the original results of the study [6].
An evolutionary priviliged
surface on the RGS domain
Fig.
2. An
evolutionarily privileged surface on the RGS domain. (A) The secondary structure elements are shown
with the ET-identified residues at rank 20 (invariant residues, colored red;
class-specific residues of higher rank, colored blue). Class-specific resides
forming RGS site 2 are not contiguous in the primary sequence, yet cluster
spatially in the structure. (B) A surface on the RGS domain is
identified containing both invariant and class-specific residues, including
10 of the 11 RGS-G
contact residues (Sowa 2000).
The trace method additionally identified a second
cluster of 5 residues that were spatially located close to the RGS-Gα
interface, but did not participate directly in the interaction between the two
proteins. Given their close proximity to the RGS- Gα
binding site (the cluster is situated right above the interface), one could
speculate that the cluster of residues serves as a binding site for a molecule
that can regulate the RGS activity [6].
A cluster of class-specific residues at the RGS-G
interface
Fig.
3. A cluster of
class-specific residues at the RGS-G
interface. (A) ET-identified
residues cluster above the RGS-G
binding interface. The RGS protein is shown in white with ET-identified
residues colored according to Fig. 1, while G
is shown in yellow. (B) The trace-identified residues are found in the
helices
3
(r77),
5
(r115 and r117),
6
(r141 and r134), and in the
5-
6
connecting loop (r121, r122, and r124). In addition to these five surface
residues, two additional class-specific residues (r123 and r127) are buried
within the RGS domain (Sowa 2000).
In order to be able to investigate the importance
of the newly found cluster of residues the study focused next on the effect of
the gamma subunit of the cGMP phosphodiesterase
(PDEγ). PDEy is a known effector protein that
interacts with the RGS domain and affects RGS GAP activity. Interestingly
enough, a comparison between different RGS domains that interact with the
PDEγ ligand showed a large biochemical difference in the nature of the
residues found at the position identified to be a part of the cluster,
depending on whether PDEγ inhibited or enhanced the effect of the RGS
protein domain [6]. For example, at residue position 77 (on the RGS4 domain), identified
to be a part of the cluster, in all RGS domains whose activity was inhibited by
PDEγ, one could find a basic or hydrophobic residue (
Based on the above results, the close
proximity of the cluster on the RGS domain to the RGS- Gα
interface, as well as the identification of a similarly significant cluster of
residues on the Gα surface, the authors hypothesized that the newly
found cluster of residues serves as a direct binding site for an effector
protein, such as PDEγ, that can regulate the activity and binding
specificity of the RGS domains.
The next step of the process was to experimentally validate
the above stated hypothesis. The validation was performed about a year later on
the same group of proteins. It should be noted that the initial RGS test
domains included RGS4 (from rat), RGS7 (mouse), RGS9 (bovine) and RGS16
(human), as well as some other domains. RGS9 was the only one that showed
enhanced activity after binding PDEγ. Its closest homologue in sequence was
RGS7 with a 48% sequence identity. Therefore, the experimental study focused on
those two domains. The idea behind the experiment was to perform site-directed
mutagenesis on the RGS7 domain and study the changes in function and binding to
PDEγ. The mutagenesis was restricted to residues, identified as members of
the cluster found by the evolutionary trace method and different from the
residues at the corresponding positions in RGS9 [7].
Sequence alignment of selected RGS domains
Fig.
4: Sequence alignment of selected RGS domains. Selected portions of bovine RGS9, mouse RGS7, rat RGS4
and human RGS16 are aligned for easy comparison of residue numbers. Red = Trace
residues, dark red boxes = Trace residues mutated in RGS7 mutant constructs to
those in RGS9. Lower case letters are used for generic identification of
corresponding sequence positions in different RGS proteins (Sowa 2001).
The results of the experimental
mutagenesis show directly that residues b, c, and e (as shown above in the
sequence alignment) are directly involved in the interaction with the PDEγ molecule.
Specifically, as shown in the figure below (figure and caption from the
original paper), when the regular residues were kept in place in RGS9, as
expected RGS9 was showing low levels of activity in the unbound state high
levels of activity in the PDEγ-bound state. Similarly, if the regular
residues were kept in place in RGS7, as expected RGS7 was showing high levels
of activity in the unbound state and low levels of activity in the
PDEγ-bound state. However, when residues b and c in RGS7 were changed to
their corresponding RGS9 residues, in the unbound state RGS7 would adopt the
same activity as the RGS9 in the unbound state. Finally, if additionally
residue e in the mutant RGS7 domain was also changed, RGS7 would exhibit the
same activity as the RGS9 PDEγ-bound state [7].
A model for regulation of RGS activity via positions b,c and e
Fig.5:
A model for regulation of RGS activity via positions bc
and e. a, Trace residues form
a pathway including the
5/
6
connecting loop, position b, located N-terminal to the
5/
6
loop, and position e, located C-terminal to the loop, which may allow changes
at bc to influence RGS catalytic activity at the G
binding interface. The Gly at e in RGS9 allows
greater backbone freedom than the Ser in RGS7, allowing for greater influence
of PDE on the
5/
6
loop. b–g, Trace residues at positions b and c are located in a
position where they could influence the conformation of the
5/
6
connecting loop (shown as the line connecting b to e; low GAP activity = red
line; high GAP activity = green line; RGS9 residues = dark red circles; RGS7
residues = white circles; RGS7* = mutant RGS7), and thus modulate the
activity of the RGS domain (graphs to the right of the drawings). b, In the absence of Gt
bound effector, the GAP activity of the RGS9 catalytic core domain is low. c, When Gt
is bound to PDE, the activity of RGS9 is enhanced. d,
RGS7 has a high activity when Gt
is not bound to PDE. e, When PDE is bound to Gt
,
the activity of RGS7 is inhibited. f,
Changes at positions b and c in RGS7 to their corresponding residues from RGS9
result in a protein that is similar to PDE-inhibited RGS7. g,
When the RGS7 residue at position e is switched to its corresponding RGS9
residue in conjunction with the bc change, the
resulting protein behaves similar to RGS9 bound to the Gt
–PDE complex (Sowa 2001).
These results directly confirm that some
of the residues, identified by the evolutionary trace method serve as a
functional site for the binding of an effector protein, involved in the
regulation of the RGS domain specificity and activity.
REVIEW OF SOME MODIFIED
VERSIONS OF THE BASIC EVOLUTIONARY TRACING METHOD
This
section of the review focuses on an overview of some modified versions of the
basic evolutionary trace method, as well as some similar methods and provides a
critical assessment of their success in comparison to the basic method.
1.
Evolutionary tracing with allowed gaps in the multiple sequence
alignment
This method was described in a study by Madabushi et al. [8] and it focused on the
application of a modified version of the basic evolutionary trace method. The
modification involved the allowing of gaps within the multiple sequence
alignment. The logic behind this modification is related to the fact that the
basic method does not deal with gaps and as the paper for this study points
out, the elimination of gaps serves as a bottleneck to the application of the
basic method [8].
In the basic method as sequences are selected for
the creation of the multiple sequence alignment, an additional difficulty is
introduced, because one needs to focus on the introduction of a minimal number
of gaps into the alignment. If such gaps are introduced, the basic method
leaves out the corresponding residue positions, based on the logic that if a
gap exists in the alignment at a certain position, then the residues at that
position could not have been conserved and are therefore not functionally
important. However, it has been also observed that in order to get better
representation of the family, it is necessary to find a large number of
divergent sequences that are members of the family. Even though, one might
achieve better sequence representation that way, the coverage of the alignment
decreases as more divergent sequences are introduced. One can arbitrarily
remove the sequences that introduce the largest number of gaps, but the
application of this idea is pretty arbitrary and there is no theoretical
validation for the choice of sequences to be removed from the multiple sequence
alignment [8].
For the above stated reasons the modification of
the basic method is a reasonable step toward the improvement of the method
itself and the results obtained from its application. In this case, the
particular modification of the basic method involved the treatment of gaps as a
21st type of amino acid and the allowing of gap-tolerant multiple
alignments. Another reason for the importance of such a modification is that
sometimes gaps tend to occur in blocks and a deletion (or an insertion) of a
particular residue from (or into) a block of sequences from a group can have
functional significance [8].
As the results from a comparative study between the
predictions from the application of the basic method and modified gap-tolerant
method show, the gap-tolerant method achieved an overall higher rate of
identification of statistically significant clusters. The measure of
statistical significance was based either on the number of identified clusters,
the size of the largest cluster, or a combination of those two statistics.
This modification is generally useful, also because
it applies to the method no matter what the protein family of interest is. In
the case when the evolutionary trace method is applied to a family (or a subset
of a family) of proteins with a high sequence homology, where gaps are less
important and prevalent, the gap-tolerant method basically reduces to the basic
method and the need for taking gaps into account is eliminated.
2.
Weighted evolutionary tracing
This method was described
in a study by Landgraf et al. [9] and focused
on a modification of the basic evolutionary trace method particularly applicable
to the identification of functional clusters of residues in family, or
subfamily of proteins from a set of highly homologous sequences, where the
residue variability is of primary importance. The logic behind such a
modification is that if one is interested in the specificity of a set of
sequences within the family of proteins, a given residue position the residue
could be conserved within the family, but highly variable outside of it. Then,
one could infer that the particular residue position is functionally important
to the given subfamily and contributes to the functional specificity of the
subset of proteins. Therefore, if there were any changes or mutations to a
residue at a given residue position within the set of highly homologous
sequences, one would want to assign a higher weight to the sequences that the
mutation came from, because they would be contributing to the variability at
that residue position and would most likely be important.
The proposed and modified method
did allow gaps in the multiple sequence alignment, but assigned a maximum
substitution penalty, based on the measure of variability, which involved the
use of the Gonnet substitution matrix [9].
The method was tested on a
set of sequences that represent the heregulin family of proteins, a subset of
the family of EGF-like growth and differentiation factors. The application of
weighted evolutionary trace method identified two distinct clusters of
residues, which are supposed to represent binding sites that are specific to
the heregulin family and “reflect differences between hrg (heregulin) ligands
and the EGF-like ligands as a whole. Besides the preference for a different
subset of receptors (HER2, 3 and 4 versus EGFR), hrg also shows a strong preference
for the interaction with receptor heterodimers versus the homodimeric
interaction seen between EGF and EGFR.” [9]
Even though the weighted
evolutionary tracing method proved useful in this particular study, it should
be noted that it is not general enough to be applicable to any type of sequence
data and thus cannot be applied on a large scale. As noted by the authors their
method is useful mostly when one is interested in a subfamily with a high level
of sequence similarity between the sequences used in the multiple sequence
alignment. The method could also be of use in the search for clusters that
identify functional sites, specific to a particular subfamily and contributing
to the specificity of the functional site.
This method was described
in a study on the dimerization of G-protein-coupled receptors (GPCRs) by Dean et al. [10]. This specific
modification relied on one of the weaknesses of the basic evolutionary trace
method. Besides the lack of treatment of gaps, the original method also could
not objectively determine the size of an identified cluster. It relies on a
visual evaluation of the clustering results. “The user must recognize, by eye,
clusters of top-ranked residues in 3D space and visually estimate their
significance based on the level of scattered signal throughout the protein.”
[8]
The subjectivity of such
assessment can lead to error in the cluster analysis and the modification
presented by this method attempts to reduce this type of error. It introduces
another measure of statistical significance of the identified clusters,
different from the measures mentioned earlier in this review.
In order to estimate the
“transition point between ordered clustering around the functional sites and
random scattering over the surface” [10], this method relied on two different
Both
This method was described
in a study by Hannenhalli and Russell [11] and it
used a slightly different approach from the original evolutionary trace method.
The modified method also uses multiple sequence alignments, but it groups
proteins in the multiple sequence alignment, based on certain criteria, looking
for patterns of residue variation. It uses those patterns to connect them to
functional specificity and identify residue positions related to functional
specificity, based on the use of an HMM [4].
The Hannenhalli
method was applied to 4 different types of protein families (nucleotidyl
cyclases, protein kinases, serine proteases and lactate dehydrogenases),
looking for positions that give specificity to the protein subfamilies. The
groupings for the data were derived based on the PFAM and SWISSPROT databases
and showed a high success in the assignment of subtypes: the method could
correctly assign subtypes at a rate of 91.2 % for 2593 sequences at a 20%
sequence similarity threshold and 94% - at a 30% sequence similarity threshold
[11].
Similarly to the basic
evolutionary trace method, this method could be applied on a large scale and
shows potential for its application to genome-scale and proteome-scale studies.
The method differs from the basic evolutionary trace method in that it handles
non-identical positions by means of an HMM and amino acid exchange matrices.
“Incorporation of exchange matrix data will permit amino acids not seen in the
current set of known sequences from a sub-type, if they have sufficiently
similar physicochemical properties” [11].
The method seems to be
closer to another method, developed by Kimmen Sjolander, but it does not include the use of phylogenetic
information. As pointed out by the author of the paper, this method is going to
be mostly useful in the analysis of superfamilies, in
the characterization of the sub-type of a protein sequence with an unknown
subtype that has low sequence similarity to other family members, and finally
in characterization of the sub-types of orphan protein family members [11].
This method is described in
a study by Landgraf, Xenarios
and Eisenberg [12] and is an extension of the basic evolutionary trace method.
Similarly to the basic method, the 3D cluster analysis method makes use of a
multiple sequence alignment and a representative structure of the protein
family of interest. Unlike the basic method though, it does not use a
phylogenetic tree approximation of the functional classification of the
proteins. The authors of the method justify their choice, based on the
hypothesis that an evolutionary tree does not adequately represent functional
relationships within protein families. They speculate that in a phylogenetic
tree “similarity relationships of a highly conserved residue cluster could
dominate” [12] and overshadow functional clusters, related to secondary
structures, as well as that similarity relationships for a number of functions
are averaged out in the phylogenetic tree. Therefore, they do not believe that
an evolutionary tree provides useful input information for the detection of
clusters in three-dimensional space.
Below, I include a brief
visual depiction of the steps of the method along with a short description
(from the original paper) [12]:
Basic steps in 3D cluster
analysis
Fig. 6. Basic steps
in 3D cluster analysis. The
extraction of regional alignments for each residue in the reference structure
links structural information to the sequence alignment. (I) For
each residue x, all structurally adjacent residues within a given radius
(e.g. 10 Å) are identified. (II) The identified positions (highlighted as gray
blocks) are extracted from the global alignment A. These blocks are joined to
form a regional alignment with N sequences. (III) Two similarity
matrices of dimension N × N are generated, a global similarity
matrix (M) representing the relationship of all full-length sequences
and a regional similarity matrix (M(x)) representing the
relationship of all sequences in the regional alignment, A(x) (Landgraf 2001).
The
two similarity matrices – global and regional – are used in the final steps of
the algorithm to generate a similarity deviation score for each residue, used
to estimate deviations between similarity relationships from the global and the
regional environment of the residue, as well as a regional conservation score,
used to estimate the difference in conservation on a global and regional scale.
The
results from the test set of 35 protein families showed that the method could
detect 72% of interface residues at false positive rate of 6% and an e-value
threshold of 10-20. The authors could also conclude that additional
information was gained from the use of a 3D structure of a representative
protein as a part of the input [12].
On
the other hand, the lack of a phylogenetic tree does not necessarily aid the
identification process. Similarly, its use does not necessarily prevent the identification
of functional sites. It is argued, as mentioned earlier, that the use of a
phylogenetic tree overshadows secondary functions. However, examples of the
application of the basic evolutionary trace method exist, which show that
secondary functions could also be detected through the use of a phylogenetic
tree for functional classification. RGS domains, whose primary function is to
bind G proteins, bind also an effector protein that regulates their interaction
with the G protein [6]. Even though one of these functions is secondary to the
other, both are detected by the basic trace method.
This method was described
in a study by Aloy and authors [13] and it focused on
a modification of the original evolutionary trace method, characterized by the
search for functional site clusters of invariant polar residues. The method
would begin a search for the identification of a functional site cluster of
residues at the lowest level of sequence identity in the multiple sequence
alignment. If no clusters were identified, one could modify the sequence
alignment and remove sequences in order to achieve a higher level of sequence
identity until a functional site was found [13]. Identified sites were deemed
significant, if there was at least 50% overlap with known active sites. The
results from a test on 86 proteins with a sequence identity of 30% or lower
showed that in 79% of proteins, there was at least a 50% overlap between the
cluster for a predicted functional site and the actual active site. In 15% of
the proteins, there was less than 50% overlap and in 6% - no overlap [4].
The Aloy
method did not rely on an approach significantly different from the basic
method. However, it was not as general as the basic approach, because of the
focus on functional sites of only invariant polar residues. It was also not as
useful for proteins with a higher level of sequence identity (above 30%), for
which only 14% of the predicted sites had more than 50% overlap with the actual
active sites, and 58% of the predicted clusters had no overlap with actual
active sites [13].
FUTURE WORK
IN EVOLUTIONARY TRACING
As Lichtarge points out in a review of the evolutionary
tracing method [4], there are 3 main types of criteria that could be used to
determine the success of a method for prediction of functional sites:
As seen from a number of
the above described methods, evolutionary tracing has been and can be
successful in all of the above criteria, given that a more uniform measure for
statistical significance of the identified clusters is adopted. The main focus
of future work could be directed towards functional annotation and drug design.
Given that a very small percentage of available sequences have experimentally
determined structures and that functional annotation for a large number of
proteins is based on homology modeling [4], correct annotation would be of
primary importance and value. Some large scale studies performed with the
evolutionary trace method show promise in that regard. The combination of
sequence, structural and functional information will hopefully lead us to a
better understanding of the proteome at large.
REFERENCES:
[1] F.K. Pettit and J.U. Bowie , Protein surface roughness and small molecular binding sites. J Mol Biol 285 (1999), pp. 1377–1382.
[2] L.L. Conte, C. Chothia and J. Janin , The atomic structure of protein–protein recognition sites. J Mol Biol 285 (1999), pp. 2177–2198.
[3] A.H. Elcock, D. Sept and J.A. McCammon , Computer simulation of protein-protein interactions. J Phys Chem B 105 (2001), pp. 1504–1518.
[4] O. Lichtarge and M.E.
Sowa, Evolutionary predictions of binding surfaces and interactions. Curr Opin Struct Biol 12 (2002), pp.21-27
[5] K. Nadassy, S.J. Wodak and J. Janin , Structural features of protein-nucleic acid recognition sites. Biochemistry 38 (1999), pp. 1999–2017.
[6] M.E. Sowa, W. He, T.G. Wensel and O. Lichtarge , A regulator of G protein signaling interaction surface linked to effector specificity. Proc Natl Acad Sci USA 97 (2000), pp. 1483–1488.
[7] M.E. Sowa, W. He, K.C. Slep, M.A. Kercher, O. Lichtarge and T.G. Wensel, Prediction and confirmation of a site critical for effector regulation of RGS domain activity. Nat Struct Biol 8 (2001), pp. 234–237.
[8] Madabushi
S, Yao H, Marsh M, Kristensen
DM,
[9] R. Landgraf, D. Fischer and D. Eisenberg, Analysis of heregulin symmetry by weighted evolutionary tracing. Protein Eng 12 (1999), pp. 943–951.
[10] M.K. Dean, C. Higgs, R.E. Smith, R.P. Bywater, C.R. Snell, P.D. Scott, G.J.G. Upton, T.J. Howe and C.A. Reynolds, Dimerization of G-protein coupled receptors. J Med Chem 44 (2001), pp. 4595–4614.
[11] S.S. Hannenhalli and R.B. Russell, Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 303 (2000), pp. 61–76.
[12] R. Landgraf, I. Xenarios and D. Eisenberg, Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol 307 (2001), pp. 1487–1502.
[13] P. Aloy, E. Querol, F.X. Aviles and M.J. Sternberg, Automated structure- based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol 311 (2001), pp. 395–408.
[14] G. Petsko and D. Ringe, Protein Structure and Function, New Science Press Ltd. 2004
[15] R. MackKinnon, Potassium channels, FEBS 555 (2003), pp. 62-65.
[16] O. Lichtarge, H.R. Bourne and F.E. Cohen, An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 257 (1996), pp. 342–358.