Evolutionary trace method – overview

 

 


INTRODUCTION

 

“It is a fundamental axiom of biology that the three-dimensional structure of a protein determines its function. Understanding function through structure is a primary goal of structural biology.” [14] Proteins have a variety of functional and structural roles, such as catalysis, binding of small molecules (ligands) or large molecules (proteins, DNA, RNA) and switching.

Catalytic proteins are known as enzymes and are responsible for the regulation of the rate of certain biochemical processes. During a reaction they often undergo modification and structural changes, which are reversed at the end of the catalyzed process. An example of a catalytic protein is aspartate aminotransferase. This enzyme is involved in the formation of oxaloacetic acid for glutamic acid (otherwise known as glutamate). Aspartate aminotransferase interacts with a phosphate cofactor and lysine to create a molecule, which undergoes rearrangement to give the final product [14].  

Other proteins perform their function through binding of small molecules. For example, the potassium channel is a membrane protein, which is responsible for the transportation of K+ ions across the membrane. The potassium channel has a selectivity filter, which is specifically designed for accurate and ligand-specific binding to potassium ions [15]. The accuracy of binding is achieved through the coordination of backbone oxygen atoms, which interact with the K+ ion and keep it in place within the channel.

Some proteins can also bind large molecules, such as other proteins, DNA or RNA. A well-known example of one such protein is DNA polymerase I, which is a complex of subunits with different function (a polymerase, a 5’-3’ exonuclease and 3’-5’ exonuclease function), binds DNA, and is involved in the DNA replication process through the polymerization and editing of newly synthesized DNA strands.

Finally, some proteins also act as switches. An example of such a protein is the GTPase, which is responsible for the binding and hydrolysis of GTP into GDP. GTPase interacts not only with GTP, but also with other proteins, which are involved in the regulation of the hydrolytic process [14].

As evident from the above examples, proteins play a fundamental role in biological processes. Therefore, understanding their function is of great importance. Since structure and function seem to be closely related, for many years research has been focused on understanding how the three dimensional structure of functional sites (such as active and binding sites) affects the function of proteins. This is a matter of special interest, also because a large number of proteins exist, for which structure has been determined through experimental or other studies, but no known function has been found. Therefore, understanding the correlation of protein sequence and structure to function can allow us to resolve a variety of biological problems, such as functional annotation of proteins, drug discovery, etc. If we could predict function and functional sites on a large scale solely based on sequence and structural information, we would be able to develop new drugs more easily, as well as design proteins for various functional purposes. The next section of this study focuses on an overview of the physical, chemical and structural properties of functional sites and attempts to provide some motivation for the interest in prediction of functional sites.

 

BACKGROUND ON PHYSICAL, CHEMICAL AND STRUCTURAL PROPERTIES OF FUNCTIONAL SITES

 

Given the diverse nature of functional sites, as mentioned in the previous section, proteins interact with a variety of molecules, such as small ligands, other proteins, DNA, RNA, carbohydrates, etc. Certain unique chemical, physical and structural features characterize each of these interactions and this section focuses on the description of those features.

Protein-protein interactions and interfaces are very common among proteins. Many studies have been focused on the examination of the properties and characteristics of those interfaces. Protein-protein interfaces tend to exhibit structural complementarity, where, for example, one of the participating structures is concave and the other - convex in the area of their interaction. This idea has been used in a number of algorithms for the identification of protein-protein interfaces and functional sites [1]. Unfortunately, it fails to account for the diversity in the nature of protein-protein interactions. Complementarity is more of typical of protein-small ligand interactions and does not characterize as well other protein-molecule interfaces. A related problem arises from the need to be able to identify sites, where a small ligand such as an inhibitor is able to attach, even though it might not be a typical binding area for a small molecule. In drug design, for example, one often needs to design an inhibitor of a particular protein-protein interaction. An algorithm relying on the complementarity property would not be able to identify an appropriate binding site, because under normal circumstances the protein does not bind small ligands [1]. 

Therefore, as expected, protein-protein interfaces vary largely in their size. A study by Lo Conte shows, based on an experiment with 75 protein-protein interfaces from a set of proteins with diverse functions, that the average size, or rather the size of an average protein-protein interface is approximately 1600 +/- 400 Å2 and the interface does not require major structural conformational changes [2]. Some large interfaces, however, can reach a size of between 2000 and 4600 Å2 and require a large conformational change in order for the protein-protein interaction to occur [2]. The study also concludes that on average the participating proteins contribute equally to the overall area of the interface (approximately 50% each).

From a chemical point of view and based on observations, protein-protein interfaces are primarily composed of non-polar residues, and represent approximately 53 % of all residues [2], which is not higher than the average percentage of non-polar residues found on protein surfaces of globular proteins. The percentage of polar non-charged residues is somewhat higher for protein-protein surfaces than the observed average for the surfaces of globular proteins. Thus, within an interface hydrophobic residues are juxtaposed to hydrophobic residues on the opposite surface and can interact through Van der Waals interactions to stabilize the interface. On the other hand, the presence of polar non-charged and charged residues accounts for the existence of approximately 10 +/- 5 hydrogen bonds between backbone atoms and side-chain atoms, as well as a small number of salt bridges [3].

The same study ([2]) looks at the distribution of all 20 amino acids on protein-protein interfaces and concludes that such interfaces are somewhat rich in aromatic residues, such as Trp, Phe, His, and Tyr, which is not typical of protein surfaces. The study also finds a higher concentration of aliphatic residues, such as Leu, Val, Met, etc. Charged residues seem to be rarer with the exception of Arg, which has the highest single residue contribution to protein-protein interfaces [2]. Finally, some specific protein-protein interfaces such as protease-inhibitor interfaces have a higher concentration of Cys residues in the area of the interface, which could be attributed to disulfide bridges.

The majority of the features discussed so far have been primarily concerned with large protein-protein interfaces. Unlike those interfaces, which are characterized by their relatively planar, large and relatively easily accessible functional sites [4], interfaces between proteins and small ligands have a much rougher surface [1]. A study by Frank Pettit ([1]) established that there is a difference between the concavity and the roughness of a functional site and attempted to use a fractal dimension (estimating the relationship between the surface area of a protein and the area accessible by a rolling sphere) measure to detect and predict functional sites. However, their results are based on a very small test set and as in the case of the algorithms based on protein complementarity in large proteins, the use of the fractal dimension algorithm is limited only to interfaces between proteins and small ligands, which are naturally characterized by rougher surfaces.

Interestingly enough, in the case of protein-DNA interactions, a study by Katalin Nadassy [5] shows that even though the size of an interface can range between 1120 and 5800 Å2, the functional modules’ area is approximately 1600 +/- 400 Å2 (the same as the standard size of a protein-protein interface) and is formed between approximately 24 +/- 6 amino acids and 12 +/- 3 nucleotides [5]. Enzymes that interact with DNA show a slightly larger interface area, needed primarily for the active site. However, protein-DNA interfaces have a somewhat different distribution of the 20 amino acids on the protein surface at the area of interaction as compared to protein-protein interfaces. The protein surface at the interface has a larger concentration of charged and generally polar residues, especially positively charged residues, such as Arg and Lys [5]. This observation is not completely unexpected, because the positively charged residues can interact with the negatively charged phosphate groups and stabilize and strengthen the protein-DNA complex. Hydrogen bonds can be found at a rate of approximately 1 bond per 125 Å2.

Protein-RNA interfaces are not expected to differ largely from protein-DNA interfaces. However, as Lichtarge [4] points out, studies have shown that even though one does expect to see a higher concentration of Arg on the surface of the protein-RNA interface, a larger percentage of aromatic residues, such as Tyr, Trp, His, and Phe, is also observed. It could be speculated that stacking of aromatic residues and bases can help stabilize the protein-RNA complex.

 

EVOLUTIONARY TRACE METHOD - DESCRIPTION

 

As seen from the above observations, protein functional sites have a number of similar features. However, they are also quite unique. Therefore, purely structural methods for the prediction of functional sites in proteins are not fully capable of generalizing enough to be accurate and applicable on a large scale, as seen from the isolated cases of the complementarity-based and surface roughness-based algorithms.

On the other hand, sequence-based algorithms have been used for a long time for finding conserved sequence motifs and mapping those onto function, through the use of proteins with known structure and function.

In order to be able to explore the information more fully, one can incorporate both sequence and structure in a functional site prediction method. One such method that relies on both sequence and structural information is the evolutionary trace method.

The evolutionary trace method was first described in 1996 by Olivier Lichtarge and has had been applied in a variety of studies since [16]. In its most basic form it requires a multiple sequence alignment of a protein family and an evolutionary tree, based on sequence identity, which can approximate the functional classification of the protein sequences.

The first step in the method involves the subdivision of the protein sequences into groups, based on the evolutionary tree. At trace 1, the only group that exists is the group that contains all sequences and encompasses the whole phylogenetic tree. At trace 2, the protein family is divided into two groups, based on the separation of the tree into 2 distinct branches. The sequences on the same branch belong to one group and the sequences on the other branch belong to the other group. At each trace and therefore at each subdivision, the method refers back to the multiple sequence alignment and reports the conservation of residues within the subgroup. If at a particular position of the alignment the residue at that position is identical in all sequences (invariant) within the subgroup, as well as the other subgroups for that trace, but varies between subgroups, the residue is named a trace or class-specific residue and is assigned a rank. The rank of a residue represents “the minimum number of branches that the tree must be divided for it to become a trace residue.”[4]. The same procedure is applied to all residues until they are all assigned an evolutionary rank.

The next step of the process is to take the highest ranked residues and look at their spatial distribution on a three-dimensional structure of a protein with a known structure from the family. The idea behind using the highest ranked residues is that low rank is supposed to represent evolutionary functional importance. Thus, residues of vital importance to the function of all proteins in the family are likely to remain invariant throughout evolution and not undergo any mutation events. Residues that are important to function, but contribute more to the functional specificity of a particular subgroup of proteins in the family are likely to be conserved, but not invariant. Therefore, even though they might undergo mutations, the rate of mutation is still minimal. And finally, residues that are not important to function at all are also not under evolutionary pressure to remain intact and can be subjected to a higher rate of mutation [4].

The final step of the method is to look for clustering of low ranked residues. If any such cluster is found, it is presumed to be a functional site, which can be experimentally tested through site-directed mutagenesis.

For a visual depiction of the process, which the evolutionary trace method goes through in the search for functional sites, I have provided an illustration included in an overview of the method by Lichtarge and Sowa [4] (the original caption has been also included):


 

The ET method

Fig. 1: The ET method. (a) All of the sequences in a protein family are aligned and a tree is generated to illustrate the relatedness of individual family members. The tree can then be delineated into groups (i) approximating functional classes (in this case, three classes). For each class, a consensus sequence is created and these are then compared to form the ET sequence. Residue positions that are invariant within each class, but that vary among them are called class-specific or trace residues (labeled X in the ET sequence, colored red) and those that are class-specific at rank i = 1 are denoted by amino acid single-letter code in the ET sequence and colored blue. The number of classes into which the tree has to be divided for a residue to become class-specific is called the rank of that residue. Finally, trace residues are mapped onto the three-dimensional structure of a family member, with clusters of trace residues indicating a functional site [yellow line in panel (b)]. (b) The process described in (a) can be repeated from rank 1 to N (N = total number of sequences), so that each residue position is assigned a rank. Residues with lower numbered ranks are considered to be more important than those with higher numbered ranks (Lichtarge 2002).

 

EVOLUTIONARY TRACE METHOD – AN EXAMPLE WITH EXPERIMENTAL VALIDATION

 

          This section focuses on the description of a study, which uses the evolutionary trace method to predict a functional site and site-directed mutagenesis of the predicted functionally important residues to confirm the validity of the prediction. It is one of the best and most elegant examples of how the method could be used.

          The original study by Sowa and Lichtarge [6] was focused on the RGS family of proteins. The interest in this particular family arose from previous work with heterotrimeric G proteins, known as Gαβγ proteins, which are a part of a well-known biological pathway. The process can be briefly described in several steps. An extracellular signal, such as a hormone, binds to a transmembrane receptor, such as a GPCR protein. In the process the receptor is activated, it catalyzes the transformation of GDP to GTP in a G protein and the either the Gα-GTP bound complex or the Gβγ are responsible for the interaction with an effector protein and the process results in an amplification of the primary signal [6].  However, in order for the Gα to be able to bind GTP, it needs to be able to quickly separate from the bound GTP and become inactive again before it can participate in the next cycle. The rate of separation of the Gα from the GTP is regulated through the rate of hydrolysis of GTP, which in turn is regulated by the RGS family of proteins. Of specific interest to the study was how the specificity of the RGS- Gα interaction is achieved, given that multiple RGS proteins can coexist with multiple types of Gα proteins and yet maintain high level of specificity [6]. The authors speculated that the interaction between the RGS protein, the Gα and an effector protein is responsible for the regulation of the process.

          The evolutionary trace method was an excellent approach to the testing of such a hypothesis, since it could look for potential functionally important residues. A cluster of such residues could serve as a binding site for the effector protein. The trace method found two large clusters on the surface of the RGS domain. The first cluster consisted of 10 of the 11 residues that form the RGS- Gα interface. However, none of the ten residues seemed to affect or control the specificity of RGS - Gα binding. The figure below is from the original results of the study [6].

 

An evolutionary priviliged surface on the RGS domain

Fig. 2.   An evolutionarily privileged surface on the RGS domain. (A) The secondary structure elements are shown with the ET-identified residues at rank 20 (invariant residues, colored red; class-specific residues of higher rank, colored blue). Class-specific resides forming RGS site 2 are not contiguous in the primary sequence, yet cluster spatially in the structure. (B) A surface on the RGS domain is identified containing both invariant and class-specific residues, including 10 of the 11 RGS-Galpha contact residues (Sowa 2000).

 

The trace method additionally identified a second cluster of 5 residues that were spatially located close to the RGS-Gα interface, but did not participate directly in the interaction between the two proteins. Given their close proximity to the RGS- Gα binding site (the cluster is situated right above the interface), one could speculate that the cluster of residues serves as a binding site for a molecule that can regulate the RGS activity [6].                                                   

 

A cluster of class-specific residues at the RGS-G interface

Fig. 3.   A cluster of class-specific residues at the RGS-Galpha interface. (A) ET-identified residues cluster above the RGS-Galpha binding interface. The RGS protein is shown in white with ET-identified residues colored according to Fig. 1, while Galpha is shown in yellow. (B) The trace-identified residues are found in the helices alpha 3 (r77), alpha 5 (r115 and r117), alpha 6 (r141 and r134), and in the alpha 5-alpha 6 connecting loop (r121, r122, and r124). In addition to these five surface residues, two additional class-specific residues (r123 and r127) are buried within the RGS domain (Sowa 2000).

 

In order to be able to investigate the importance of the newly found cluster of residues the study focused next on the effect of the gamma subunit of the cGMP phosphodiesterase (PDEγ). PDEy is a known effector protein that interacts with the RGS domain and affects RGS GAP activity. Interestingly enough, a comparison between different RGS domains that interact with the PDEγ ligand showed a large biochemical difference in the nature of the residues found at the position identified to be a part of the cluster, depending on whether PDEγ inhibited or enhanced the effect of the RGS protein domain [6]. For example, at residue position 77 (on the RGS4 domain), identified to be a part of the cluster, in all RGS domains whose activity was inhibited by PDEγ, one could find a basic or hydrophobic residue (Lys, His, Arg, Leu). However, in the RGS9 domain, whose activity was enhanced by PDEγ, one could instead find a negatively charged residue at position 77 (Glu). Similarly, at position 117, in all of the RGS domains with inhibited activity PDEγ effect, one could find a negatively charged residue. The RGS9 domain with its enhanced activity PDEγ effect was the only one that had a hydrophobic residue at position 117.

          Based on the above results, the close proximity of the cluster on the RGS domain to the RGS- Gα interface, as well as the identification of a similarly significant cluster of residues on the Gα surface, the authors hypothesized that the newly found cluster of residues serves as a direct binding site for an effector protein, such as PDEγ, that can regulate the activity and binding specificity of the RGS domains.

          The next step of the process was to experimentally validate the above stated hypothesis. The validation was performed about a year later on the same group of proteins. It should be noted that the initial RGS test domains included RGS4 (from rat), RGS7 (mouse), RGS9 (bovine) and RGS16 (human), as well as some other domains. RGS9 was the only one that showed enhanced activity after binding PDEγ. Its closest homologue in sequence was RGS7 with a 48% sequence identity. Therefore, the experimental study focused on those two domains. The idea behind the experiment was to perform site-directed mutagenesis on the RGS7 domain and study the changes in function and binding to PDEγ. The mutagenesis was restricted to residues, identified as members of the cluster found by the evolutionary trace method and different from the residues at the corresponding positions in RGS9 [7].

 

Sequence alignment of selected RGS domains

Fig. 4: Sequence alignment of selected RGS domains. Selected portions of bovine RGS9, mouse RGS7, rat RGS4 and human RGS16 are aligned for easy comparison of residue numbers. Red = Trace residues, dark red boxes = Trace residues mutated in RGS7 mutant constructs to those in RGS9. Lower case letters are used for generic identification of corresponding sequence positions in different RGS proteins (Sowa 2001).

 

 

 

 

          The results of the experimental mutagenesis show directly that residues b, c, and e (as shown above in the sequence alignment) are directly involved in the interaction with the PDEγ molecule. Specifically, as shown in the figure below (figure and caption from the original paper), when the regular residues were kept in place in RGS9, as expected RGS9 was showing low levels of activity in the unbound state high levels of activity in the PDEγ-bound state. Similarly, if the regular residues were kept in place in RGS7, as expected RGS7 was showing high levels of activity in the unbound state and low levels of activity in the PDEγ-bound state. However, when residues b and c in RGS7 were changed to their corresponding RGS9 residues, in the unbound state RGS7 would adopt the same activity as the RGS9 in the unbound state. Finally, if additionally residue e in the mutant RGS7 domain was also changed, RGS7 would exhibit the same activity as the RGS9 PDEγ-bound state [7].

 

 

A model for regulation of RGS activity via positions b,c and e

Fig.5: A model for regulation of RGS activity via positions bc and e. a, Trace residues form a pathway including the alpha5/alpha6 connecting loop, position b, located N-terminal to the alpha5/alpha6 loop, and position e, located C-terminal to the loop, which may allow changes at bc to influence RGS catalytic activity at the Galpha binding interface. The Gly at e in RGS9 allows greater backbone freedom than the Ser in RGS7, allowing for greater influence of PDE on the alpha5/alpha6 loop. b–g, Trace residues at positions b and c are located in a position where they could influence the conformation of the alpha5/alpha6 connecting loop (shown as the line connecting b to e; low GAP activity = red line; high GAP activity = green line; RGS9 residues = dark red circles; RGS7 residues = white circles; RGS7* = mutant RGS7), and thus modulate the activity of the RGS domain (graphs to the right of the drawings). b, In the absence of Gtalpha bound effector, the GAP activity of the RGS9 catalytic core domain is low. c, When Gtalpha is bound to PDE, the activity of RGS9 is enhanced. d, RGS7 has a high activity when Gtalpha is not bound to PDE. e, When PDE is bound to Gtalpha, the activity of RGS7 is inhibited. f, Changes at positions b and c in RGS7 to their corresponding residues from RGS9 result in a protein that is similar to PDE-inhibited RGS7. g, When the RGS7 residue at position e is switched to its corresponding RGS9 residue in conjunction with the bc change, the resulting protein behaves similar to RGS9 bound to the Gtalpha–PDE complex (Sowa 2001).

 

          These results directly confirm that some of the residues, identified by the evolutionary trace method serve as a functional site for the binding of an effector protein, involved in the regulation of the RGS domain specificity and activity.

 

 

REVIEW OF SOME MODIFIED VERSIONS OF THE BASIC EVOLUTIONARY TRACING METHOD

 

          This section of the review focuses on an overview of some modified versions of the basic evolutionary trace method, as well as some similar methods and provides a critical assessment of their success in comparison to the basic method.

         

1.     Evolutionary tracing with allowed gaps in the multiple sequence alignment

 

This method was described in a study by Madabushi et al. [8] and it focused on the application of a modified version of the basic evolutionary trace method. The modification involved the allowing of gaps within the multiple sequence alignment. The logic behind this modification is related to the fact that the basic method does not deal with gaps and as the paper for this study points out, the elimination of gaps serves as a bottleneck to the application of the basic method [8].

In the basic method as sequences are selected for the creation of the multiple sequence alignment, an additional difficulty is introduced, because one needs to focus on the introduction of a minimal number of gaps into the alignment. If such gaps are introduced, the basic method leaves out the corresponding residue positions, based on the logic that if a gap exists in the alignment at a certain position, then the residues at that position could not have been conserved and are therefore not functionally important. However, it has been also observed that in order to get better representation of the family, it is necessary to find a large number of divergent sequences that are members of the family. Even though, one might achieve better sequence representation that way, the coverage of the alignment decreases as more divergent sequences are introduced. One can arbitrarily remove the sequences that introduce the largest number of gaps, but the application of this idea is pretty arbitrary and there is no theoretical validation for the choice of sequences to be removed from the multiple sequence alignment [8].

For the above stated reasons the modification of the basic method is a reasonable step toward the improvement of the method itself and the results obtained from its application. In this case, the particular modification of the basic method involved the treatment of gaps as a 21st type of amino acid and the allowing of gap-tolerant multiple alignments. Another reason for the importance of such a modification is that sometimes gaps tend to occur in blocks and a deletion (or an insertion) of a particular residue from (or into) a block of sequences from a group can have functional significance [8].

As the results from a comparative study between the predictions from the application of the basic method and modified gap-tolerant method show, the gap-tolerant method achieved an overall higher rate of identification of statistically significant clusters. The measure of statistical significance was based either on the number of identified clusters, the size of the largest cluster, or a combination of those two statistics.

This modification is generally useful, also because it applies to the method no matter what the protein family of interest is. In the case when the evolutionary trace method is applied to a family (or a subset of a family) of proteins with a high sequence homology, where gaps are less important and prevalent, the gap-tolerant method basically reduces to the basic method and the need for taking gaps into account is eliminated.

 

2.     Weighted evolutionary tracing

 

This method was described in a study by Landgraf et al. [9] and focused on a modification of the basic evolutionary trace method particularly applicable to the identification of functional clusters of residues in family, or subfamily of proteins from a set of highly homologous sequences, where the residue variability is of primary importance. The logic behind such a modification is that if one is interested in the specificity of a set of sequences within the family of proteins, a given residue position the residue could be conserved within the family, but highly variable outside of it. Then, one could infer that the particular residue position is functionally important to the given subfamily and contributes to the functional specificity of the subset of proteins. Therefore, if there were any changes or mutations to a residue at a given residue position within the set of highly homologous sequences, one would want to assign a higher weight to the sequences that the mutation came from, because they would be contributing to the variability at that residue position and would most likely be important. 

The proposed and modified method did allow gaps in the multiple sequence alignment, but assigned a maximum substitution penalty, based on the measure of variability, which involved the use of the Gonnet substitution matrix [9].

The method was tested on a set of sequences that represent the heregulin family of proteins, a subset of the family of EGF-like growth and differentiation factors. The application of weighted evolutionary trace method identified two distinct clusters of residues, which are supposed to represent binding sites that are specific to the heregulin family and “reflect differences between hrg (heregulin) ligands and the EGF-like ligands as a whole. Besides the preference for a different subset of receptors (HER2, 3 and 4 versus EGFR), hrg also shows a strong preference for the interaction with receptor heterodimers versus the homodimeric interaction seen between EGF and EGFR.” [9]

Even though the weighted evolutionary tracing method proved useful in this particular study, it should be noted that it is not general enough to be applicable to any type of sequence data and thus cannot be applied on a large scale. As noted by the authors their method is useful mostly when one is interested in a subfamily with a high level of sequence similarity between the sequences used in the multiple sequence alignment. The method could also be of use in the search for clusters that identify functional sites, specific to a particular subfamily and contributing to the specificity of the functional site.

 

  1. Monte Carlo enhancement of the evolutionary tracing method

 

This method was described in a study on the dimerization of G-protein-coupled receptors (GPCRs) by Dean et al. [10]. This specific modification relied on one of the weaknesses of the basic evolutionary trace method. Besides the lack of treatment of gaps, the original method also could not objectively determine the size of an identified cluster. It relies on a visual evaluation of the clustering results. “The user must recognize, by eye, clusters of top-ranked residues in 3D space and visually estimate their significance based on the level of scattered signal throughout the protein.” [8]

The subjectivity of such assessment can lead to error in the cluster analysis and the modification presented by this method attempts to reduce this type of error. It introduces another measure of statistical significance of the identified clusters, different from the measures mentioned earlier in this review.

In order to estimate the “transition point between ordered clustering around the functional sites and random scattering over the surface” [10], this method relied on two different Monte Carlo envelope-based techniques. The first technique looks at the trace residues neighbors, determined by the trace method, in the close proximity around the cluster, based on a pre-defined radius distance from the their alpha carbons to the alpha carbons of trace residues, which are members of the cluster. The second technique looks at trace residue neighbors and identifies members, based on whether each pair of residues can contact a water molecule, rolled on the Van der Waals surface of the protein.

Both Monte Carlo techniques were used for the assessment of the significance and robustness of the clustering, as determined by the trace method. The results confirmed the non-randomness of the identified clusters. The results did not include a measure of the improvement over results from the basic method, since the original method had not been applied to the 700 GPCRs used in this study. However, in terms of future work, the Monte Carlo modification does provide a more rigorous measure of assessment of the significance of clustering and could be used in future studies.

 

  1. Method for the identification of determinants of known functional domains

 

This method was described in a study by Hannenhalli and Russell [11] and it used a slightly different approach from the original evolutionary trace method. The modified method also uses multiple sequence alignments, but it groups proteins in the multiple sequence alignment, based on certain criteria, looking for patterns of residue variation. It uses those patterns to connect them to functional specificity and identify residue positions related to functional specificity, based on the use of an HMM [4].

The Hannenhalli method was applied to 4 different types of protein families (nucleotidyl cyclases, protein kinases, serine proteases and lactate dehydrogenases), looking for positions that give specificity to the protein subfamilies. The groupings for the data were derived based on the PFAM and SWISSPROT databases and showed a high success in the assignment of subtypes: the method could correctly assign subtypes at a rate of 91.2 % for 2593 sequences at a 20% sequence similarity threshold and 94% - at a 30% sequence similarity threshold [11].

Similarly to the basic evolutionary trace method, this method could be applied on a large scale and shows potential for its application to genome-scale and proteome-scale studies. The method differs from the basic evolutionary trace method in that it handles non-identical positions by means of an HMM and amino acid exchange matrices. “Incorporation of exchange matrix data will permit amino acids not seen in the current set of known sequences from a sub-type, if they have sufficiently similar physicochemical properties” [11].

The method seems to be closer to another method, developed by Kimmen Sjolander, but it does not include the use of phylogenetic information. As pointed out by the author of the paper, this method is going to be mostly useful in the analysis of superfamilies, in the characterization of the sub-type of a protein sequence with an unknown subtype that has low sequence similarity to other family members, and finally in characterization of the sub-types of orphan protein family members [11].

 

  1. 3D cluster analysis method

 

This method is described in a study by Landgraf, Xenarios and Eisenberg [12] and is an extension of the basic evolutionary trace method. Similarly to the basic method, the 3D cluster analysis method makes use of a multiple sequence alignment and a representative structure of the protein family of interest. Unlike the basic method though, it does not use a phylogenetic tree approximation of the functional classification of the proteins. The authors of the method justify their choice, based on the hypothesis that an evolutionary tree does not adequately represent functional relationships within protein families. They speculate that in a phylogenetic tree “similarity relationships of a highly conserved residue cluster could dominate” [12] and overshadow functional clusters, related to secondary structures, as well as that similarity relationships for a number of functions are averaged out in the phylogenetic tree. Therefore, they do not believe that an evolutionary tree provides useful input information for the detection of clusters in three-dimensional space.

Below, I include a brief visual depiction of the steps of the method along with a short description (from the original paper) [12]:

 

Basic steps in 3D cluster analysis  

Fig. 6. Basic steps in 3D cluster analysis. The extraction of regional alignments for each residue in the reference structure links structural information to the sequence alignment. (I) For each residue x, all structurally adjacent residues within a given radius (e.g. 10 Å) are identified. (II) The identified positions (highlighted as gray blocks) are extracted from the global alignment A. These blocks are joined to form a regional alignment with N sequences. (III) Two similarity matrices of dimension N × N are generated, a global similarity matrix (M) representing the relationship of all full-length sequences and a regional similarity matrix (M(x)) representing the relationship of all sequences in the regional alignment, A(x) (Landgraf 2001).

 

          The two similarity matrices – global and regional – are used in the final steps of the algorithm to generate a similarity deviation score for each residue, used to estimate deviations between similarity relationships from the global and the regional environment of the residue, as well as a regional conservation score, used to estimate the difference in conservation on a global and regional scale.

          The results from the test set of 35 protein families showed that the method could detect 72% of interface residues at false positive rate of 6% and an e-value threshold of 10-20. The authors could also conclude that additional information was gained from the use of a 3D structure of a representative protein as a part of the input [12].

          On the other hand, the lack of a phylogenetic tree does not necessarily aid the identification process. Similarly, its use does not necessarily prevent the identification of functional sites. It is argued, as mentioned earlier, that the use of a phylogenetic tree overshadows secondary functions. However, examples of the application of the basic evolutionary trace method exist, which show that secondary functions could also be detected through the use of a phylogenetic tree for functional classification. RGS domains, whose primary function is to bind G proteins, bind also an effector protein that regulates their interaction with the G protein [6]. Even though one of these functions is secondary to the other, both are detected by the basic trace method.

 

  1. Modified version of the basic evolutionary trace method with a focus on invariant polar residues

 

This method was described in a study by Aloy and authors [13] and it focused on a modification of the original evolutionary trace method, characterized by the search for functional site clusters of invariant polar residues. The method would begin a search for the identification of a functional site cluster of residues at the lowest level of sequence identity in the multiple sequence alignment. If no clusters were identified, one could modify the sequence alignment and remove sequences in order to achieve a higher level of sequence identity until a functional site was found [13]. Identified sites were deemed significant, if there was at least 50% overlap with known active sites. The results from a test on 86 proteins with a sequence identity of 30% or lower showed that in 79% of proteins, there was at least a 50% overlap between the cluster for a predicted functional site and the actual active site. In 15% of the proteins, there was less than 50% overlap and in 6% - no overlap [4].

The Aloy method did not rely on an approach significantly different from the basic method. However, it was not as general as the basic approach, because of the focus on functional sites of only invariant polar residues. It was also not as useful for proteins with a higher level of sequence identity (above 30%), for which only 14% of the predicted sites had more than 50% overlap with the actual active sites, and 58% of the predicted clusters had no overlap with actual active sites [13].

 

 

 

FUTURE WORK IN EVOLUTIONARY TRACING

 

          As Lichtarge points out in a review of the evolutionary tracing method [4], there are 3 main types of criteria that could be used to determine the success of a method for prediction of functional sites:

 

  1. Can it serve as a guiding tool for experimental studies, such as mutational and protein engineering studies?
  2. Are results from the method statistically significant?
  3. Can it be applied on a large scale to the whole proteome?

 

As seen from a number of the above described methods, evolutionary tracing has been and can be successful in all of the above criteria, given that a more uniform measure for statistical significance of the identified clusters is adopted. The main focus of future work could be directed towards functional annotation and drug design. Given that a very small percentage of available sequences have experimentally determined structures and that functional annotation for a large number of proteins is based on homology modeling [4], correct annotation would be of primary importance and value. Some large scale studies performed with the evolutionary trace method show promise in that regard. The combination of sequence, structural and functional information will hopefully lead us to a better understanding of the proteome at large.

 

 

 

REFERENCES:

 

[1] F.K. Pettit and J.U. Bowie , Protein surface roughness and small molecular binding sites. J Mol Biol 285 (1999), pp. 1377–1382.

[2] L.L. Conte, C. Chothia and J. Janin , The atomic structure of protein–protein recognition sites. J Mol Biol 285 (1999), pp. 2177–2198.

[3] A.H. Elcock, D. Sept and J.A. McCammon , Computer simulation of protein-protein interactions. J Phys Chem B 105 (2001), pp. 1504–1518.

[4] O. Lichtarge and M.E. Sowa, Evolutionary predictions of binding surfaces and interactions. Curr Opin Struct Biol 12 (2002), pp.21-27

[5] K. Nadassy, S.J. Wodak and J. Janin , Structural features of protein-nucleic acid recognition sites. Biochemistry 38 (1999), pp. 1999–2017.

[6] M.E. Sowa, W. He, T.G. Wensel and O. Lichtarge , A regulator of G protein signaling interaction surface linked to effector specificity. Proc Natl Acad Sci USA 97 (2000), pp. 1483–1488.

[7] M.E. Sowa, W. He, K.C. Slep, M.A. Kercher, O. Lichtarge and T.G. Wensel, Prediction and confirmation of a site critical for effector regulation of RGS domain activity. Nat Struct Biol 8 (2001), pp. 234–237.

[8] Madabushi S, Yao H, Marsh M, Kristensen DM, Philippi A, Lichtarge O., Structural clusters of evolutionary trace residues are statistically significant and common in proteins. J Mol Biol (2002), 316, pp.139-154.

[9] R. Landgraf, D. Fischer and D. Eisenberg, Analysis of heregulin symmetry by weighted evolutionary tracing. Protein Eng 12 (1999), pp. 943–951.

[10] M.K. Dean, C. Higgs, R.E. Smith, R.P. Bywater, C.R. Snell, P.D. Scott, G.J.G. Upton, T.J. Howe and C.A. Reynolds, Dimerization of G-protein coupled receptors. J Med Chem 44 (2001), pp. 4595–4614.

[11] S.S. Hannenhalli and R.B. Russell, Analysis and prediction of functional sub-types from protein sequence alignments. J Mol Biol 303 (2000), pp. 61–76.

[12] R. Landgraf, I. Xenarios and D. Eisenberg, Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol 307 (2001), pp. 1487–1502.

[13] P. Aloy, E. Querol, F.X. Aviles and M.J. Sternberg, Automated structure- based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. J Mol Biol 311 (2001), pp. 395–408.

[14] G. Petsko and D. Ringe, Protein Structure and Function, New Science Press Ltd. 2004

[15] R. MackKinnon, Potassium channels, FEBS 555 (2003), pp. 62-65.

[16] O. Lichtarge, H.R. Bourne and F.E. Cohen, An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 257 (1996), pp. 342–358.