19 Mar 2001

Biol 6312

Prediction of Protein Structure

1. Overview

Key Reference:

Protein structure prediction in the postgenomic era
[Review article]
David T Jones
Current Opinion in Structural Biology 2000, 10:371-379. Medline

There continues to be a large gap between the number of proteins of known amino acid sequence and the number of proteins of known 3-d structure. This gap may ever be eliminated.

But protein structure is essential for understanding the function of the protein:

  1. mechanism of protein folding
  2. mechanism of enzyme catalysis
  3. analysis of stability
  4. interactions with other molecules
    1. proteins
    2. ligands
    3. substrates
    4. inhibitors

Can the prediction of protein structure from sequence be improved enough to eliminate the need to crystallize each protein?

Probably not, in the near future, but predictions can generate useful and generally realiable information

There are 3 levels of analysis in the overall prediction scheme;

  1. Motif recognition in the primary sequence
  2. Secondary structure prediction
  3. Teriary structure/fold prediction

A starting point is often to search for proteins with sequences that are similar to the protein under study. This usually involves a BLAST search:

BLAST (Basic Local Alignment Search Tool)

Altschul SF, Madden TL, Schaffer AA, Zhang JH, Zhang Z, Miller W, Lipman DJ:
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res 1997, 25: 3389–3402. MEDLINE  Full text

From this analysis one might learn about the function of similar proteins, and whether a 3-d structure exists for any of them. Depending on the degree of amino acid identity of the closest "neighbors", one might surmise the function of the protein.

Protein Motif Recognition

Prosite is a database of sequence motifs for functions such as:

post-translational modifications of N- or C-termini, signal sequences for localization, sites of lipid attachment, sites of phosphorylation, or markers of particular types of enzymes

Example: phosphorylase kinase   KRKQISVR   Reference: Kemp & Pearson (1990) TiBS 15:342-346

PROSITE Website

To scan a protein sequence against the Protsite database go here

There are databases of sequence motifs. These can generate aligned sequences from their websites.

PRINTS  a Protein Fingerprint database

BLOCKS

ProDom

Prediction of Secondary Structure

In some cases this is preliminary to prediction of tertiary structure.

a) Stastical methods, first developed by Peter Chou and Gerald Fasman

Based on the tendencies of particular amino acid to be found in the different types of secondary structure. They are considered to be about 60% accurate.

Server for Chou-Fasman Prediction

Original references are too early for on-line abstracts etc.

Chou PY, Fasman GD.
Empirical predictions of protein conformation.
Annu Rev Biochem. 1978;47:251-76. Review.

Chou PY, Fasman GD.
Prediction of protein conformation.
Biochemistry. 1974 Jan 15;13(2):222-45.

Chou PY, Fasman GD.
Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins.
Biochemistry. 1974 Jan 15;13(2):211-22.

Garnier (GOR1) Predicter of secondary structure

Garnier J, Osguthorpe DJ, Robson B.
Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins.
J Mol Biol. 1978 Mar 25;120(1):97-120.

Garnier J, Gibrat JF, Robson B.
GOR method for predicting protein secondary structure from amino acid sequence.
Methods Enzymol. 1996;266:540-53.

b) neural net predictions of Secondary Structure:

They use a training set of known proteins, and usually utilize multiple sequence alignments during the prediction. They are considered to be 70-80% accurate.

PredictProtein (Now at Columbia U.)

Rost B, Schneider R, Sander C.
Progress in protein structure prediction?
Trends Biochem Sci. 1993 Apr;18(4):120-3.

PsiPred David T. Jones

Jones DT:
Protein secondary structure prediction based on position-specific scoring matrices.
J Mol Biol 1999, 292: 195–202. Full text MEDLINE

Membrane-spanning segments of integral membrane proteins can be predicted in similar ways. We will discuss this later.

Prediction of Tertiary Structure

a) ab initio methods are those based on physical principles, e.g. energetics. They assume that a protein will fold to the lowest available, or accessible, free energy state.

b) modeling by homology to a similar protein of known structure.

2. Prediction of Secondary Structure

A. Stastical methods for the prediction of Secondary Structure

Chou-Fasman Method

the tendencies of each type of amino acid to be found in each of the 3 types of secondary structure (helix, sheet, loop) are calculated from a database of high-resolution structures (2 Å):

Original database, 1974 contained 15 proteins (2473 amino acids)

Revised, 1978, containing 29 proteins (4741 amino acids)

Simply increasing the database did not increase the accuracy of the predictions.

is the propensity for Ala to be found in an alpha-helix

      Where the i's correspond to the amino acid type (20) and j's are the secondary structures (3)

e.g.:

  = the number of Ala residues found in alpha-helix (in the database)

  = the number of Ala residues (in the database)

  = the number of amino acids found in alpha-helix (in the database)

  = the number of amino acid residues in the database

So, the top of the propensity expression represents the fraction of all Ala residues that are alpha-helical,

while the bottom represents the fraction of ALL residues that are alpha-helical. Therefore, this ratio represents the tendency of each amino amino relative to the average amino acid.

A propensity of >1 indicates more likely than chance, while <1 indicates less likely than chance. A value of 1.0 indicates no prediction.

Since secondary structure is usually formed by several consecutive residues, it is more meaningful to take running averages of 5 or more amino acids at a time. This is called the window. A prediction will tend to be most accurate when the window matches the size of the actual segment of secondary structure.

The entire length of the protein is analyzed for each set of secondary structure propensities (helix, sheet, turn). The final prediction is made by comparing the 3 sets of values.

These calculations can be done by a spreadsheet using the "LOOKUP" function (called in Excel). This will be demonstrated later for transmembrane spans, later.

Special applications:

1) Regions that are highly likely to be alpha helical or beta-sheet are candidates for conformational changes.

2) The effects of mutations can be predicted by changing the sequence.

3) Turns can be predicted (with fair accuracy) on a residue basis, because they are not extended structures

Limitations: Accuracy of this method seems to be limited because of the limited range of propensities. Most types of amino acids can be found often in any secondary structure. Few are really excluded from certain secondary structures.

Extension of this approach. Position specific propensities. This works well with turns or with the termini of helices/sheets.

Protein Sci 1994 Dec;3(12):2207-16
A revised set of potentials for beta-turn formation in proteins.
Hutchinson EG, Thornton JM

B. Neural net predictions

Turns can be predicted this way:

Protein Sci 1999 May;8(5):1045-55
Prediction of the location and type of beta-turns in proteins using neural networks.
Shepherd AJ, Gorse D, Thornton JM

In general these predictions are obtained from servers through the world wide web.

PredictProtein from EMBL or Columbia University

PredictProtein (Now at Columbia U.)

Rost B, Schneider R, Sander C.
Progress in protein structure prediction?
Trends Biochem Sci. 1993 Apr;18(4):120-3.

Marcotte EM, Pellegrini M, Thompson MJ, Yeates TO, Eisenberg D:
A combined algorithm for genome-wide prediction of protein function.
Nature 1999, 402: 83–86. MEDLINE

Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA:
Protein interaction maps for complete genomes based on gene fusion events.
Nature 1999, 402: 86–90. MEDLINE

Bowie JU, Lüthy R, Eisenberg D:
A method to identify protein sequences that fold into a known three-dimensional structure.
Science 1991, 253: 164–170. MEDLINE

Jones DT:
GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences.
J Mol Biol 1999, 287: 797–815. Full text MEDLINE

Moult J, Hubbard T, Fidelis K, Pedersen JT:
Critical assessment of methods of protein structure prediction (CASP): round III.
Proteins 1999, S3: 2–6.  MEDLINE (No abstract)

Fischer D, Barret C, Bryson K, Elofsson A, Godzik A, Jones D, Karplus KJ, Kelley LA, MacCallum RM, Pawowski K et al.: CAFASP-1: critical assessment of fully automated structure prediction methods.
Proteins 1999, S3: 209–217 MEDLINE

Petersen TN, Lundegaard C, Nielsen M, Bohr H, Bohr J, Brunak S, Gippert GP, Lund O.
Prediction of protein secondary structure at 80% accuracy.
Proteins. 2000 Oct 1;41(1):17-20. MEDLINE

CAFASP: Critical Assessment of Fully Automated Structure Prediction Website

3. Predictions of Tertiary Structure

These sites offer www access to predictions of tertiary structure:

3D-PSSM Imperial Cancer Research Fund, London

Fugue University of Cambridge

SAMT-99 UC Santa Cruz

In general there are two approaches

a) Try to model an amino acid sequence by homology or by compatibility to known structures

   Identification of topological fold is often the goal.

b) Try to fold an amino acid sequence based on physical principles

A) Modelling Approach

Look for sequences that have &Mac249;>30% sequence identity with a protein of known structure

(Sequences of 15-30% identity can be attempted)

Basic principles:

1) Buried amino acid residues are hydrophobic

2) Surface amino acid residues are polar

3) Within a family of homologous proteins, buried and active site residues are conserved.

4) Within a family of homologous proteins, surface residues are variable.

5) Elements of secondary structure will be more highly conserved than amino acid sequence.

3 steps in the procedure

1) Sequence alignment

2) Build sequence into secondary structure

3) Energy minimize to improve tertiary structure

If no homologous protein can be identified by sequence comparisons, the compatibility of the a.a. sequence of the target can be determined for representations of all known folds (templates).

This is called threading.

Example:

A 3D-1D Substitution Matrix for Protein Fold Recognition that Includes Predicted Secondary Structure of the
Sequence

J Mol Biol 1997 Apr 11;267(4):1026-38  (Full text)
A 3D-1D substitution matrix for protein fold recognition that includes predicted secondary structure of the sequence.
Rice DW, Eisenberg D

They have derived a scoring matrix from a database of 119 pairs of proteins of known structure with the same fold, but with <30% sequence identity.

882 elements = 7 x 3 x 2 x 7 x 3

7 classes of amino acids (Cys; Trp; Arg,Lys; Tyr,Phe; Ile,Leu,Val,Met; Ala,Gly,Ser,Thr,Pro; Asp,Glu,Asn,Gln,His)

3 types of secondary structure (helix, sheet, turn)

2 locations (buried, exposed)

and in the target sequence 7 classes of a.a. and 3 types of secondary structure (from PredictProtein)

First: obtain secondary structure prediction from PredictProtein.

Second, Calculate score for each of the 119 folds:

Example:

The highest score is for a Trp, predicted to be in helix, that matches a buried Trp in a helix____Score=4.5

A basic residue predicted to be in a sheet that matches an exposed basic residue in a sheet___Score=2.3

The same basic residue that matches an exposed basic residue in a helix would score -9 (Lowest score)

2. Ab initio approach (physical principles)

This approach can work even in the absence of homology to known structures, but overall the reliability is low.

LINUS is one example: Local Independently Nucleated Units of Structure

50 amino acids are folded at a time, in an overlapping fashion: 1-50, 26-75, ...

It is based on the idea that actual proteins fold by forming local secondary structure first.

Side chains are simplified. Only 3 interactions are used:

1 repulsive: steric

2 attractive: H-bonds and hydrophobic

Then the calculation of all possibilities for the search of the lowest free energy

Proteins 1995 Jun;22(2):81-99
LINUS: a hierarchic procedure to predict the fold of a protein.
Srinivasan R, Rose GD

Proc. Natl. Acad. Sci. USA Vol. 96, Issue 25, 1425814263, December 7, 1999 (Full text)
A physical basis for protein secondary structure
Rajgopal Srinivasan and George D. Rose


Comments/questions: svik@mail.smu.edu

Copyright 2001, Steven B. Vik, Southern Methodist University

Last modified 4/3/01