Previous Article | Next Article 
Journal of Bacteriology, July 2003, p. 3990-3993, Vol. 185, No. 14
0021-9193/03/$08.00+0 DOI: 10.1128/JB.185.14.3990-3993.2003
Copyright © 2003, American Society for Microbiology. All Rights Reserved.
Pretty Good Guessing: Protein Structure Prediction at CASP5
Rosemarie Swanson* and Jerry Tsai
Biochemistry & Biophysics Department, Texas A&M University, College Station, Texas 77843-2128

INTRODUCTION
In this special issue of the
Journal of Bacteriology, bacteriologists
look into the smallest organisms even deeper than before, down
to the molecular level. The focus is on experimentally determined
molecular structures. However, structure prediction from amino
acid sequence data is becoming a usable source of protein structure
information as well.
Interest in protein structure prediction is old, but success is new. About 9 years ago, John Moult and others organized the first effort known as Critical Assessment of Protein Structure Prediction (CASP). They arranged with experimentalists to provide amino acid sequence information for soon-to-be-determined protein structures and invited the protein prediction community to try their methods on these target unknowns. Predictors submitted their results to the organizers for evaluation against the true structures when they became available. The format of using a community-wide experiment and a meeting to present the evaluations to the predictors propelled the improvement of methods. Last December the fifth evaluation meeting of the biennial CASP effort (CASP5) was held at Asilomar Conference Grounds in Pacific Grove, Calif. (7).
The success of the best of the predictors in the last two CASP evaluations (7, 8) warrants mention of the methods and results here. Methods for prediction are different for easy and hard cases. The choice of method depends on the degree of similarity between the amino acid sequence of the unknown and the sequences of known structures.

THE HARDEST TEST
Even though they have the worst agreement with the experimental
results, the most exciting predictions are the successes in
the "new fold" category, where the sequence of the unknown has
no significant similarity to the sequence of any known structure.
Five of the eighty-odd domains available for prediction fell
into this category in CASP5. In this most difficult category,
the evaluator considered that at least one "excellent" prediction
was made for each target. Of 165 predictors who attempted these
difficult targets, nine had a prediction among the best ten
(out of hundreds) for three or more of the targets. So some
techniques consistently perform better than the rest.
In the new fold category, a respectable result means that the predicted chain has the same kinds of pieces in the same relative orientations, not that the pieces superimpose on each other. The degree of agreement might be similar to that between photographs of the same person at age 20 and at age 80. In fact, predicting a new fold is like drawing a face that the artist has never seen. And in fact, structure prediction methods are like the methods used by police artists, in an important sense. A witness is shown a gallery of faces and asked to pick out parts from them that individually resemble parts of the suspect's face. The police artist then combines the parts into a whole that resembles the witness's memory of the face. The most successful methods of structure prediction for new folds similarly rely on the assembly of a unique whole from fragments selected from a gallery of protein structures.

ONE OF THE GOOD METHODS
In a coarse description of the most successful method of new
fold prediction, the first step is to obtain secondary structure
(helix, beta strand, etc.) predictions for the unknown and to
divide the sequence of the unknown into short fragments (nine
amino acids). Then known structures (the equivalent of the gallery
of faces) are searched for fragments that are similar in secondary
structure and/or sequence profile to the unknown's fragments.
A library of these fragments from known structures is constructed
(the equivalent of the collection of witness-selected individual
features). The starting guess for the unknown structure is a
completely extended chain (equivalent to the blank paper), but
randomly selected suitable fragments repeatedly replace sections
of the extended chain. After each fragment placement ("move"),
the chain is checked for collisions and other bad and good features,
and the move is rejected or accepted. After a large number (thousands)
of fragment placements, a folded chain has been created (the
equivalent of a single face). In contrast to the limited number
of faces an artist could produce, however, tens of thousands
of candidate structures are produced. The candidate structures
are clustered according to their structural similarity to each
other, and the centers of the few largest clusters are selected
as the best candidate structures. Final adjustments to the candidates
are made to make the models more physically realistic. The method's
increasing power lies in the improving selection of the contents
of the fragment library and in the improving rules for accepting
or rejecting a fragment placement. (For further detail and other
methods, see reference
7.)
In CASP5, the method just described was used effectively not only for new folds but also for loop regions in unknowns where a structure for a related sequence was available. The loops were modeled by the new fold method, but otherwise the prediction was closely guided by the template ("comparative modeling"). Why use a template?

OLD FOLDSEASIER WORK, BETTER ANSWERS
Predictors do get a more accurate answer (the same face at ages
20 and 35) in cases where a template existswhere a structure
for a protein with a similar fold has already been determined
by experiment. About four-fifths of the sequences provided to
the CASP5 predictors turned out to have templates. Identifying
such a template shades from easy to difficult. At the easy end
(comparative modelingabout half the unknowns), the template
can be identified by similarity between the sequence of the
unknown and the sequence of the template. Sophisticated methods
may be invoked to detect that similarity. In difficult cases
("fold recognition"), sequence similarity is too low to provide
an unambiguous choice of template, and a different method has
to be used.

THE "GLASS SLIPPER" APPROACH
When sequence-based comparisons do not unambiguously identify
a unique template in the Protein Data Bank (PDB), predictors
can nevertheless proceed by the method pioneered by Cinderella's
prince: search the PDB for a structure that fits the sequence.
Predictors using this approach do not search the whole PDB but
a representative subset of structures. They may not find a fit,
in which case they may be dealing with a new fold, which requires
the methods described above.
But how does one determine whether a sequence fits a given structure? Different amino acids prefer different surroundings. Statistically the most obvious distinction is that hydrophobic amino acids prefer to have other hydrophobic amino acids as neighbors, and charged or polar amino acids prefer to be on the surface of a protein, in contact with water. Threading a sequence through a structural template positions amino acids relative to each other. Predictors evaluate whether the resulting neighbor clusters are consistent with what is known about amino acid neighborhoods in experimentally determined structures and decide whether the structure is a viable template or not. An important consideration is how to align the unknown's sequence with the structural template. A very effective method for generating and testing different alignments using a "genetic algorithm" approach was presented at CASP5.
The bad news is that all the procedures described here are computationally complex. The good news is that web servers provide public access to some of the expertly implemented procedures.

AUTOMATION
Very welcome at the meeting were the good results of CAFASP3
(Critical Assessment of Fully Automated Protein Structure Prediction,
third evaluation). Individual fully automated servers started
from a submitted amino acid sequence and without human intervention
carried out a series of tasks that ended with a set of alpha-carbon
coordinates for the sequence. (Of course, a server's procedures
are based on an automation of the successful procedures of its
human creators.) Meta-servers, the next level up, did not themselves
create predictions but operated on the results of individual
prediction servers. They outperformed individual servers because
different methods implemented in individual servers have different
strengths and do well on some but not all targets. As one evaluator
said, there are many different ways to be wrong but only one
way to be right. By a consensus approach, a meta-server pulls
out the best answers from a collection. The success of meta-servers
depends on having a variety of independent methods available
from individual servers, so that even though each individual
server has weaknesses its contribution still improves the overall
results. Meta-servers were up to 60% more likely than individual
servers to choose the correct structural class (
5,
6) of a sequence
and up to 30% more likely to score correct answers higher than
incorrect answers.
Could meta-meta servers do even better? Yes. 3D-Jury is a meta-meta server that collected and analyzed results from all the other servers in CAFASP3. Although it was not entered in CAFASP3, 3D-Jury itself would have scored highest among all servers. The best servers did better than two-thirds of the human prediction groups. A few predictors compared the results of automatic predictions with human-aided predictions that were made either by an expert human or by trained but inexperienced humans. The interesting result was that sometimes human intervention helped and sometimes it harmed. In the experience of one extremely good team, the easier the unknown was, the less likely human intervention was to improve a prediction.

HOW GOOD AN ANSWER?
"Unknowns" have been described as easy or difficult. The easiest
unknowns (the comparative modeling category) are those that
have a significant amount (

1/3 or more) of sequence identity
with a protein for which a three-dimensional (3D) structure
is already known. In the best of these cases, alpha-carbon positions
were predicted with better than a 0.9
-Å root mean square
difference from the experimental answer. The median was around
2 Å. In other measuring terms, at 50% sequence identity,
95% of backbone rotation angles (phi and psi angles) were correct
within 30°. Of course, predictors, evaluators, and users
of predictions set the bar higher in cases like this and want
to know the positions of side chain atoms as well as main chain
atoms ("homology modeling" subcategory: 27 unknowns, nine servers
evaluated). Side chain predictions are more difficult and less
accurate than high-identity main chain predictionsthe
level of accuracy is more like 50% of side chains having their
first bond rotation angle correct within 40°.

PROBLEM AREASOLD ENEMIES AND NEW FRIENDS
Predicting the details of side chain conformations is still
a challenging problem, as mentioned above. Another side chain-related
area of importance and difficulty is developing methods to predict
long-range contacts (side chains that are far apart in sequence
but near in space). Improving a comparative modeling main chain
prediction beyond agreement with a homologous structure ("refinement")
is also difficult to do. But people new to the field might be
surprised to learn that a fourth significant difficulty is recognizing
which predicted model is the best. In CASP5, prediction groups
were permitted to submit up to five models for each unknown,
ranked according to believed correctness. However, it was sometimes
the case that the truly best model was not the top-ranked model
of the five. An area of active investigation is the question
of how to reliably identify the best from a group of likely
models. A fifth area of surprising difficulty is the question
of correct sequence-to-structure alignment. As one of the judges
said, alignment is still a hard problem. Why? Probably for the
same reason that structure prediction is a hard problem. The
fold of a protein is the result of a lot of weak and not very
specific interactions. The general message of a stretch of amino
acids may be "form a helix," but the surrounding contacts may
alter where the helix begins or ends. The same amino acid may
play a somewhat different structural role in different members
of a family.
New directions of effort include predicting sequences and predicting disorder.

PREDICTING AMINO ACID SEQUENCES
The availability of sequence variants enables aligning a set
of related sequences to obtain a sequence profile. A profile-based
search is better able to identify a structural template in the
PDB than a single-sequence search is. Some CASP attendees implemented
an interesting twist on this idea: design amino acid sequences
for a structure! This is related to the "inverse folding problem,"
first discussed in 1991 (
2): what sequences are consistent with
a given structure? (Think "Imelda Marcos"design shoes
to fit a particular foot.) For structures in the PDB that do
not have a good number of natural homolog sequences available,
predict a large set of amino acid sequences that are consistent
with a given structure, form a profile from these, and use this
profile to search genomes for sequences which are compatible
with that structure. In each of 40+ genomes, this reverse procedure
identified a previously unrecognized structure template for
one or more sequences (
4).

DISORDER AS STRUCTURE
Another new facet to the concept of structure prediction at
CASP5 was that naturally or conditionally disordered regions
are a predictable and functional structural category also encoded
by amino acid sequences (
3). A survey of the literature had
suggested that functionally important regions of disorder exist
in many proteins. Nineteen unknowns for CASP5 had at least 5%
disorder. One (intentionally chosen) was completely disordered,
and three others had disordered regions of between 15 and 40%
of their total length. Six groups attempted disorder predictions,
with success rates in identifying disordered regions up to 100%
in favorable cases. Disordered regions, by organizing only in
the presence of a ligand, could provide high binding specificity
without tight (and therefore hard to disengage) binding, because
the binding energy would be taken up in organizing the disordered
region. Such behavior could be important in regulatory settings
(
3).

OBTAINING A PREDICTION
As a public service (
1), members of the CASP community have
started a "ten-most-wanted" list (TMW), important sequences
whose structures are desired by members of the biological community,
to be worked on by a number of predictors. The first round is
in progress (January 2003). A second round will start when the
first is finished (
http://www.doe-mbi.ucla.edu/TMW [excellent
one-page introduction to TMW];
http://tmw.llnl.gov [gives more
detail, including where to go to find out how to suggest a sequence
of interest]).
Automated sites.
The BioInfo site (overview at http://bioinfo.pl; 3D-Jury at http://BioInfo.PL/Meta) is the most successful meta-server. A sequence can be submitted there, and the submitter will be able to download automatically generated results to examine and interpret. Turnaround time is 7 days or less.
A successful individual server site in CASP5 (with a particularly clear user interface) was ORNL-Prospect (http://compbio.ornl.gov/PROSPECT/), offering secondary structure predictions, 3D predictions by threading a sequence through candidate structures, and a 3D prediction pipeline for using a comprehensive set of tools and more submitter-provided information than sequence alone. The pipeline offers all-atom models and evaluates their quality. The output is well presented.
The 3D Jigsaw server (http://www.bmm.icnet.uk/servers/3djigsaw/) returns side chain-containing models, not just alpha-carbon models.
The 3dpssm server (http://www.sbg.bio.ic.ac.uk/
3dpssm/) was a high-scoring server in CAFASP2.
Other servers.
Links to 18 meta-servers and individual servers are listed on the BioInfo page mentioned above.
One of the first efforts to offer structure prediction as a service (originally at Heidelberg) has evolved into the site at http://cubic.bioc.columbia.edu/pp/.
The PDB site is http://www.rcsb.org/pdb/.

TRUTH IN ADVERTISING
Some quantitative information in this article is preliminary
because it comes from the meeting reports, not from the refereed
reports that are to follow from the meeting (
7). Further, in
describing methodology we showcased the hardest area of structure
prediction because we consider its recent success a triumph.
This demanding area still requires in-house expertise for good
results, but we anticipate that in a year or two public servers
will be available for such prediction methods. In the hardest
area of prediction there are other promising and different techniques
(
7) besides the prominent one that we describe. Also, because
automation is still under development, many of the excellent
CAFASP3 servers are not publicly accessible yetanother
reason to try the BioInfo site, which does have access to them.
An area that we gave less coverage to is comparative modeling,
the prediction of structures from existing information about
other structures that are expected to be quite similar. Good
public servers are already available for comparative modeling
predictions (URLs above). A very good review (
9) of the meeting
by the comparative modeling assessor was published after we
had submitted this commentary. Finally, and most importantly,
predictors do not want users to blindly accept predictions.
Getting an automated prediction is easy; understanding and evaluating
it take experience and effort. But we are saying that it can
now be worth that effort.
The views expressed in this Commentary do not necessarily reflect the views of the journal or of ASM.

FOOTNOTES
* Corresponding author. Mailing address: Biochemistry & Biophysics Department, Texas A&M University, 2128 TAMU, College Station, TX 77843-2128. Phone: (979) 845-6842. Fax: (979) 845-9274. E-mail:
rosmar{at}tamu.edu.


REFERENCES
1 - Abbott, A. 2001. Computer modelers seek out 'Ten Most Wanted' proteins. Nature 409:6816.
2 - Bowie, J. U., R. Luthy, and D. Eisenberg. 1991. A method to identify protein sequences that fold into a known three-dimensional structure. Science 253:164-170.[Abstract/Free Full Text]
3 - Dunker, A. K., C. J. Brown, and Z. Obradovic. 2002. Identification and functions of usefully disordered proteins. Adv. Protein Chem. 62:25-49.[Medline]
4 - Larson, S. M., A. Garg, J. R. Desjarlais, and V. S. Pande. 2003. Increased detection of structural templates using alignments of designed sequences. Proteins 51:390-396.[CrossRef][Medline]
5 - Murzin, A. G., S. E. Brenner, T. Hubbard, and C. Chothia. 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536-540.[CrossRef][Medline]
6 - Orengo, C. A., F. M. Pearl, and J. M. Thornton. 2003. The CATH domain structure database. Methods Biochem. Anal. 44:249-271.[Medline]
7 - Proteins. Fifth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction. Proteins, in press.
8 - Proteins. Fourth Meeting on the Critical Assessment of Techniques for Protein Structure Prediction. 2001. Proteins 45, Supplement 5.
9 - Tramantano, A. 2003. Of men and machines. Nat. Struct. Biol. 10:87-90.[CrossRef][Medline]
Journal of Bacteriology, July 2003, p. 3990-3993, Vol. 185, No. 14
0021-9193/03/$08.00+0 DOI: 10.1128/JB.185.14.3990-3993.2003
Copyright © 2003, American Society for Microbiology. All Rights Reserved.