Evolution of proteins in sequence space

From: pruest@pop.dplanet.ch
Date: Thu Aug 02 2001 - 11:18:27 EDT

Next message: George Hammond: "Re: WHY 15-BILLION YEARS = 6000 YEARS"

Previous message: Darryl Maddox: "Re: possible future shortages of other resources"
Next in thread: bivalve: "Re: Evolution of proteins in sequence space"
Reply: bivalve: "Re: Evolution of proteins in sequence space"
Reply: Lawrence Johnston: "Re: Evolution of proteins in sequence space"
Reply: Howard J. Van Till: "Re: Evolution of proteins in sequence space"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Proteins may evolve in two basically different modes. One mode is by a
sequence of point mutations. The other mode is by genetic recombination
of preexisting modules or fragments. (Let me ignore deletions which
presumably are deleterious in the vast majority of cases - except
perhaps for some occasional deletions of entire codons.) Each of the new
sequences produced must then be accepted (and fixed in the population)
by natural selection or by random drift (if it is lost, it does not
contribute to evolution). Novel sequence information is generated in the
first case, series of several point mutations, only.

Basically, any sequence within the transastronomically huge
combinatorial space of the L^20 possible sequences of proteins of length
L would be accessible during evolution, if there is a mutational path
which leads from an existing sequence to the target considered and which
does not contain any intermediates which are selected against (or even
lethal). In order to evaluate this mechanism of evolution and the
probability of its success, we should have an idea about the frequency
of useful sequences in sequence space. This information has been
missing, but now some indications about it are available.

Keefe A.D., Szostak J.W., "Functional proteins from a random-sequence
library", Nature 410 (2001), 715-718, generated a library of 6x10^12
proteins, each containing 80 contiguous random amino acids, and enriched
those proteins that bound to ATP. They found four new families of
ATP-binding proteins unrelated to each other and unrelated to the
natural ones. The selectively enriched substitutions were distributed
over 62 of the 80 randomized amino acids, and a core domain of 45 amino
acids sufficient for ATP-binding was defined. Keefe et al. estimated
that roughly 1 in 10^11 of all random-sequence proteins have ATP-binding
activity.

Silverman J.A., Balakrishnan R., Harbury P.B., "Reverse engineering the
([beta]/[alpha])8 barrel fold", Proceedings of the National Academy of
Sciences USA 98 (2001), 3092-3097, analyzed the most commonly occurring
fold among protein catalysts, the TIM (triosephosphate isomerase) barrel
consisting of 8 analogous units of beta sheet, loop, alpha helix, and
turn, which together form a barrel accommodating a variable active site,
used in a large family of different enzymes. Silverman et al. applied
combinatorial mutagenesis of 182 amino acid positions in the barrel and
functional selection for TIM activity in E.coli, requiring a minimal
threshold of 10^-4 of wild-type activity. They estimate that fewer than
1 in 10^10 of the sequences in their degenerate library are able to
complement in vivo.

Thus, the two estimates agree quite well, even though they are derived
in very different ways. If we look at protein sequence space, less (how
much?) than 1 in 10^10 sequences is a triosephosphate isomerase enzyme,
and 1 in 10^11 sequences binds ATP, which is a partial activity of many
enzymes.

As the human genome contains an estimated 30,000 genes, and the number
of different protein folds is estimated to be a few thousand, we may, as
a very rough approximation, assume that there are less than 10^4
basically different protein families in the biosphere, within each of
which a number of similar proteins can be derived from each other by
feasible evolutionary paths.

The question is whether each of the 10^4 different protein families can
be similarly derived from one or very few initial sequences, or by
random mutational walks. If a novel enzyme or other functional protein
is to arise, which is not easily derivable by a few selected mutations
from an already existing one, we need a mutational random walk. The
probability of finding any sequence with the activity required is about
10^-11. If, at a given moment in the evolution of a species, any one of
10^4 different novel activities will prove advantageous, the probability
of finding any such sequence is about 10^-7.

These estimates assume that directed evolution in the lab is a valid
model for natural evolution. Of course, this is not the case, as in
directed evolution one does not have to bother about the viability of
each intermediate organism in a linear sequence of point mutations, but
only about the isolated activity of a new protein sequence after several
or many mutations. Directed evolution jumps around in sequence space,
whereas natural evolution is limited to single-step paths, and none of
these steps must go downhill on the fitness surface.

How, then, is it possible that any one of the 10^3 or 10^4 basically
different protein folds (families) arose (anywhere in the biosphere),
let alone all of them? If there was the need for 10^3 different searches
with probabilities of around 10^-10, it seems a hopeless proposition.
(And the few million years available for the formation of the first
viable organism appear transastronomically inadequate.)

The only possibility of a way out seems to be to claim that every single
one of the different protein families used in the biosphere are
intimately connected in sequence space, such that simple linear
sequences of point mutations, with all intermediates naturally selected,
will do for all proteins. In this case, more than 99.999999999% (eleven
nines altogether) of sequence space is barren for life and was never
visited by any sequence during evolution. Whether this is a feasible
proposition will have to be shown experimentally.

This still leaves us with the mystery of the origin of the first living
organism capable of natural evolution.

But the very interesting finding of the two papers mentioned is that the
protein sequence space is extremely sparcely populated with useful
sequences. This makes evolution (which, for theological reasons, I
believe has happened) an astonishingly marvellous process.

Peter

-- 
--------------------------------------------------------------
Dr Peter Ruest			Biochemistry
Wagerten			Creation and evolution
CH-3148 Lanzenhaeusern		Tel.:	++41 31 731 1055
Switzerland			E-mail:	<pruest@dplanet.ch
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
	In biology - there's no free lunch -
		and no information without an adequate source.
	In Christ - there is free and limitless grace -
		for those of a contrite heart.
--------------------------------------------------------------

Next message: George Hammond: "Re: WHY 15-BILLION YEARS = 6000 YEARS"
Previous message: Darryl Maddox: "Re: possible future shortages of other resources"
Next in thread: bivalve: "Re: Evolution of proteins in sequence space"
Reply: bivalve: "Re: Evolution of proteins in sequence space"
Reply: Lawrence Johnston: "Re: Evolution of proteins in sequence space"
Reply: Howard J. Van Till: "Re: Evolution of proteins in sequence space"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Thu Aug 02 2001 - 11:18:42 EDT