Re: Evolvability of new functions

From: Tim Ikeda (tikeda@sprintmail.com)
Date: Mon Oct 30 2000 - 23:43:32 EST

  • Next message: glenn morton: "H. erectus built a house in Japan"

    Thanks for the feedback Peter.
    >Hi Tim,
    >thank you for your extensive comments!
    >
    >Tim Ikeda <tikeda@sprintmail.com> wrote:
    >>
    >> David Campbell wrote:
    >> >> I would see the combination of parts from several
    >> >> different genes followed by selection for a new function as a relatively
    >> >> substantial innovation.
    >> ...
    >> Peter:
    >>>I agree that this is a relatively substantial innovation. But,
    >>>nevertheless, I would consider the amount of novel information gained
    >>>to be relatively small. ("Information" has even more different meanings
    >>>than "micro-" and "macroevolution"!) I would justify this claim as
    >>>follows: After a single nucleotide mutation, the mutant and the
    >>>wild-type are subject to natural selection, whose "answer" to the
    >>>mutation is "yes" or "no" or something in-between, i.e. at most 1 bit of
    >>>information. The same consideration applies to any more complex
    >>>mutation, such as a new gene composed of shuffled exons: as far as
    >>>natural selection is concerned, the gain of information from the
    >>>environment is at most 1 bit. If this seems counter-intuitive, we must
    >>>ask whether this new construct was produced in a single step, such as an
    >>>unequal crossing-over. If yes, then it was a simple step, like a simple
    >>>mutation or deletion. If it required a series of coordinated steps, the
    >>>intermediates in this path probably were not under any selection, and
    >>>the probability of end product formation may have been extremely small.

    Me:
    >> Hmm... Sounds like Lee Spetner...
    Peter:
    >Never heard of Spetner...

    That's definitely a plus in my book. ;^)

    Me:
    >> There is a very serious difficulty, IMO (and others' - consult earlier
    >> discussions in sci.bio.info-theory), in relating sequence and structural
    >> information measurements with metrics derived from selection. For example,
    Peter:
    >I fully agree that a metric derived from selection cannot be used for
    >estimating sequence and structural information. But this is NOT what I
    >am doing. What I call functional or semantic information, given by
    >sequence / structure in a given environment, cannot be measured in any
    >way I know of. The closest we can get is what H.P.Yockey did
    >("Information theory and molecular biology" (Cambridge: Cambridge
    >Univ.Press, 1992), p.254) for the presumptive functional information
    >contained in a modern protein (family). But even this doesn't tell us
    >how this information arose. Presumably, the earliest structures
    >displaying this function were very much simpler. The only source for
    >functional information in biological systems we know of is the
    >environment acting in natural selection. Each event of fixation of a
    >genetic change of any type and size is at most a yes/no answer: at most
    >1 bit of information.

    This information metric is entirely dependent on the environmental
    conditions in which the organism finds itself. And since environmental
    conditions change, this metric will change as well. For example,
    one may select for or against strains of bacteria carrying a gene which
    imparts tetracycline resistance. On media with tetracycline, the resistant
    strains are selected (an increase of one bit by your metric). However,
    when spread on Bochner plates, these same strains carrying tetracycline
    resistance alleles may be selected against -- but in this case, the
    loss of the resistance allele (via deletion or point mutation) must
    also be scored as a one bit increase which originates from environmental
    sources of information.

    So clearly selection alone, or interactions environment do not provide
    terribly useful measures of biological information, particularly at
    the sequence level as you've noted. While measures of fitness may provide
    information about how some alleles become distributed across populations,
    the models become largely uncomputable when more than a few genes are
    considered.

    Peter:
    >But natural selection can only test a functional feature already present
    >to some minimal degree. If we consider the entire historical
    >developmental path of a functionality (e.g. an enzyme), including all of
    >the functional information contained in it, its specific activity must
    >have started sometime with a minimal amount of activity just sufficient
    >to make it selectable. And before that? This is the interesting part of
    >its history, because without selection, we can estimate a probability of
    >random emergence.

    Without selection, one can estimate such probabilities if one has a good
    idea about the initial state of the system -- which requires information
    about the makeup of the organism and the environment in which it is found
    (& how these variables have changed over time). And this is only viable
    if the gene or protein in question can be known to have evolved in isolation.
    Include interactions with other components of the cell or the environment
    and one has a difficult time predicting or assigning probabilities to
    changes. I know of no such isolated systems except possibly, those
    examined in vitro.

    Peter:
    >Afterwards, normal darwinian evolution sets in, and I
    >see no way of estimating probabilities. There may be many other critical
    >points in the evolution of a new function, but this is certainly the
    >first one of them - and it is habitually ignored by evolutionary
    >biologists.

    Because the answer cannot be determined at this time? I don't know if the
    question is ignored or whether computational models are simply not known.

    Me:
    >> a single point mutation can allow a bacterium to survive on one growth
    >> medium where it couldn't previously. What could that point mutation have
    >> done? Is there only one bit of change involved?
    >>
    >> It could have changed one amino acid to another in a protein. What is
    >> the net change in the information content of the protein?
    >>
    >> It could have erased a stop codon, permitting expression of a longer
    >> protein. How many bits of information are in the longer sequence?
    >>
    >> Or, the mutation could have wiped out a promoter, preventing the
    >> expression of the protein. Is that information change positive or
    >> negative with respect to the protein?
    >>
    >> Or, the mutation could have generated a new splice site -- How much
    >> information change in the resultant protein?
    >>
    >> Or, the mutation could have replaced a proline with an aspartate,
    >> taking the break out of an alpha-helix. What's the difference
    >> in information content?
    >>
    >> These cases are not readily quantifiable. The question is: With
    >> respect to what is the information metric derived? Sure, the
    >> difference between my having one or two hundred-dollar bills in
    >> my pocket may represent an informational difference of one bit,
    >> by I can do a heck of a lot more with two of those bills than I
    >> can with one.

    Peter:
    >What is not quantifiable here is the amount of functional information
    >acquired by the system in its entire history. I was only considering the
    >last step of selection - which yields at most one bit of additional
    >information, no matter what type of change this last step represented.

    Well, if selection can favor either the fixation of a new gene or the
    deletion of a earlier one (e.g. the tetracycline example I give), I'm
    not sure how the increase in information as defined by selection
    coefficients maps to sequence variations, even at the last step in
    the process. Thus, I don't see how this discussion could apply to
    protein evolution. (Perhaps that wasn't the point anyway -- I've come
    to the conversation a bit late I can definitely be dense).

    Peter:
    >The only reason I brought it up at all is because natural selection is
    >the only natural source of biological information we know of. Of course,
    >the probabilities of the different types of changes which might have
    >produced the new function may be very different, and are usually not
    >estimable. Even if this last step alone produced a new function never
    >before found in the biosphere, the functional properties of the new
    >protein are certainly a consequence of the sequence / function
    >properties of its precursor(s). I would not consider this to be nothing,
    >even if it didn't display any of the new function at all, because it
    >represents a very specific prerequisite for the new function: you cannot
    >splice together any two odd sequences and obtain a specific function
    >required at the moment.

    Peter:
    >>> But to assume that ALL functionalities emerged in such a manner,
    >>> without any non-selectable intermediates, is entirely speculative. How
    >>> do you know this is "the vast majority" of genes? You yourself concede
    >>> that the origin of "the first gene" is not dealt with. There are an
    >>> estimated 1000 different protein folds (each grouping a series of protein
    >>> families or superfamilies) in the biosphere, considering the globular,
    >>> water-soluble proteins only (Y.I.Wolf, N.V.Grishin, E.V.Koonin,
    "Estimating
    >>> the number of protein folds and families from complete genome data",
    >>> J.Mol.Biol. 299 (2000), 897-905). Almost by definition, these 1000 folds
    >>> are not related to each other by exon shuffling and gene duplication.

    Tim:
    >> I think that may be hard to tell. For example, alpha-helices can move and
    >> be rearranged by recombination and duplication. I think some porins and
    >> other transmembrane proteins have likely arisen from events such as these.
    Peter:
    >Ok, I reduce my claim by adding "usually".

    Peter:
    >>> Each one of them had to originate somewhere at least once during the past
    >>> 3.8 billion years. Thus, it would be more realistic to talk about "the
    >>> first 1000 genes" whose emergence cannot be accounted for at present.
    These
    >>> are the cases I am considering when I talk about a mutational random walk
    >>> without intermediate selection until a minimal selectable activity happens
    >>> to be produced. These are cases I consider macroevolutionary steps posing
    >>> considerable informational problems deserving careful attempts at
    estimating
    >>> their probability and at possibly finding more realistic evolutionary
    >>> scenarios than merely assuming that "it must have happened somehow"
    through
    >>> selectable intermediates.
    >>>
    >>> You may call these the most elementary cases of Behe's "irreducibly
    complex
    >>> systems" - whose non-existence has not yet been made plausible.

    Me:
    >> One thing about the "first 1000 folds" (I think fewer perhaps, but
    nevermind),
    >> is that they seem to be common to all the major divisions of life. I'm not
    >> sure how to peer behind the curtain of 2-3 billion years ago when the
    >> major divisions appear to have split. However, one thing that comes to mind
    >> is that horizontal transfer may have been a major factor in early life
    >> (which may account for the relatedness between groups). With horizontal
    >> transfer, the pool is a little bigger and testing may go somewhat faster.
    Peter:
    >Wolf et al.'s estimate of 1000 different folds refers to the entire
    >biosphere; horizontal transfers are already taken into consideration.

    Me:
    >> One other thing you've brought up previously was the suggestion that
    >> the different protein families may represent local optima for possible
    >> (or viable?) structures.
    Peter:
    >I don't recall ... Was it the structural requirements for a compact,
    >stable fold, in addition to the functional requirements for catalysis?
    >This was an argument against assuming small peptides could serve as
    >viable proteins.

    Yes, that was what I recalled. And yes, small peptides may not always
    serve as viable proteins in today's enviroment. Well heck, even large
    polypeptides don't always serve as well in some catalyses: For example,
    they've finally nailed the case for RNA serving as the principle
    catalytic component for protein synthesis.

    Me:
    >> Those regions of "evolutionary stability"
    >> may be attactors for structural convergence. I'm not sure what may
    >> represent the first steps toward these stable regions, but is it possible
    >> that once these steps begin, some convergence to a stable form would
    >> occur?
    Peter:
    >What do you mean by "attractors for structural convergence"? Chaotic
    >attractors? Selection peaks of a fitness surface in parameter space? I
    >don't see the connection with the problem of finding the first minimal
    >activity for a given function. At those points, by definition, the
    >fitness surface is absolutely flat: nothing is selectable, we can only
    >have random walks. Once the selectable steps begin, of course, normal
    >darwinian evolution is possible. What do you mean by "evolutionary
    >stability" in this context?

    Yes, I am proposing the idea of "attractors" and fitness peaks in this
    context. I agree that in isolation, and with a flat fitness surfaces that
    random walks are what you get. But I don't think that 'flat fitness
    surfaces' necessarily remain flat forever -- The topology will change
    with the local environment. While this could work against evolutionary
    change (or directional change), I can see how it continually 'jostles'
    the surface and can move sequences to peaks. Now do these thousand
    or so 'folds' represent fitness peaks that sequences will tend
    toward? And are these peaks inaccessible from the starting sequences?
    That's tough to tell and I do not believe there is sufficient information
    to perform a decent calculation.

    David:
    >>>> Obviously, examining every known gene sequence to determine the relative
    >>>> frequency of egene duplication, exon shuffling, and the like is not
    >>>> feasible. However, the general pattern that emerges as one examines a
    >>>> gene, one finds related genes with different functions. If there are
    >>>> 1000 truly novel genes, that is still a lot less than the total number
    >>>> of genes in humans, for example.
    >>>> I did not mean to imply that all functions evolved by duplication and
    >>>> modification of existing genes, but rather that it was extremely common.
    Peter:
    >>> If each selected mutational step adds 1 bit of information from the
    >>> environment to a genome, the biosphere can collect quite a lot of
    >>> information from the environment. But how about the "truly novel genes"?
    Me:
    >> The counter you're making seems to be that in instances where it is
    >> clear that a modified sequence gives rise to a function which it didn't
    >> possess before, that these aren't truly novel genes but an un-novel
    >> mixing of old ones.
    Peter:
    >Not quite. What I call a truly novel gene is one whose function has
    >never before existed in the entire biosphere, no matter what led to the
    >last step which originated the first minimal amount of the new activity.

    As mentioned in another reply, that's an impossible criterion to
    meet because one would have to have information that's simly not
    available. We can't completely survey the current biosphere let alone
    conditions in the past.

    Peter:
    >If it is a mixing of old genes, the new gene may display a combination
    >of the old functions (whose novelty is a matter of definition, but these
    >cases need not concern us here), or possibly (but very unlikely)
    >something entirely new, while the old functions no longer exist (perhaps
    >due to clipping). For a reasonable discussion of such a possibility, we
    >should have actual examples where this happened.

    Such as the crystallins or anti-freeze proteins? Or the nylon digesting
    enzyme discovered in a bacterium?

    Peter:
    >Maybe I should distinguish between (1) the emergence of one of Wolf et
    >al.'s 1000 folds and (2) a novel function whose initial emergence
    >required 2 or more changes (mutations, shufflings, ...) going through
    >non-selectable intermediates. I just assumed that cases of (2) are most
    >likely to be found among (1). But this doesn't imply that each (1) must
    >be a (2), or that each (2) leads to a new (1).

    OK.

    Me:
    >> I wonder what "truly novel genes" one would expect
    >> to find from duplication, recombination, mutation and deletion of
    >> _previously existing_ sequences? To what does "novelty" apply: the
    >> new function, the new arrangement of DNA sequences, or the _ultimate_
    >> original origin of the sequences from which the components of the new
    >> function were derived?
    Peter:
    >Novelty applies to the biological function having never existed before
    >in the entire biosphere. There might be many different ways in which
    >novelty may emerge, but the easiest conceivable way (IMO) is a sequence
    >of point mutations in a gene duplicate (possibly in a pseudogene state)
    >leading to a minimal combination of specific amino acid occupations
    >defining a new active site in the protein product.

    Such as transaldolases and transketolases? Or the many different
    dehydrogenases? I think the hemoglobin and myoglobin families present
    interesting varieties. Yes, they bind oxygen (and NO -- which is
    another interesting diversion...), but their physiological properties
    can be very different.

    One additional mechanism with the potential of increasing diversity
    is recombination (which could also involve duplication). This could
    'encourage' the co-evolution of different parts of protein after
    a fusion welds new parts together.

    Peter:
    > In order to bypass,
    >for the moment, difficulties with the definition of the amount of
    >functional information, we'd better not begin with cases where some
    >previous function is incorporated into the new one.

    Me:
    >> Because it's clear that new functions can and do
    >> arise from biochemical mechanisms which we have observed.
    Peter:
    >Are there known cases which fit my definition of novelty?

    None meeting the requirement that the functionality could not
    have every existed anywhere or anytime previously. But the
    anti-freeze proteins are one of the easiest examples to see
    if we open the field to functions which didn't exist previously
    in a particular lineage.

    Were I to look for more ancient examples, I might examine the
    members of a particular family of folds to see if they all have the
    same catalytic activity. In particular, I would want to focus on
    those families where the catalytic activities can be found entirely
    contained within a subunit, and at 'half-sites' where the catalytic
    groups are shared at interfaces between subunits or at different
    locations in the subunits. The idea behind this approach is that
    one might be more likely to see evidence of other functions
    arising on different sites of the proteins. Also, I might
    look at families that have metal-binding complexes at the centers
    of their structural motifs. In some cases the metal complex serves
    a catalytic function, in others a primarily structural function, and
    in other cases they serve as sensors for oxidation states.

    Me:
    >> Given that
    >> sequences do tend to fall into families (with vertical and sometimes
    >> horizontal linkages) in which many members can exhibit different
    >> functions, this suggests (to me, at least), that much of the variation
    >> seen can be understood by descent with modification rather than by
    >> "spontaneous insertion".
    Peter:
    >This is the reason why I concentrate on folds (i.e. sequences without
    >any recognizable homology), rather than families. What do you mean by
    >"spontaneous insertion"?

    E.g.: Close encouters of the third kind; Or God did it. Or skinheads
    from Mars.

    Peter:
    >>> Their minimally active form must have arisen by truly random-walk
    >>> mutagenesis. Of which type of information - step-by-step selected or
    >>> random-walk generated - is there more in the biosphere? I think we don't
    >>> know. But what I am getting at is the challenge of the random-walk type.
    >>> Even if this concerns only a few percent of all existing genes, it poses
    >>> a big problem, as darwinian evolution cannot be invoked. Don't you think
    >>> so?
    Me:
    >> This is certainly an interesting problem. It's also related to what
    >> "minimal" activity is, which is a relative question. Is the fitness
    >> topology absolutely flat between peaks in most areas or not? This
    >> is very difficult to determine. No studies of possible mutational
    >> variation can be exhaustive and we're kidding ourselves if going through
    >> a few thousand or a few million variations of an existing sequence
    >> will tell us what we need to know about the possible activity of the
    >> ur-protein, especially if may not be certain of the original
    >> function and context of the original sequence.
    Peter:
    >I agree, and I never intended to approach the problem in this way. I
    >define "minimal" activity by an absolutely flat fitness topology some
    >distance away from where the fitness starts to go up (cf. the above
    >definition of "novelty"). Of course we don't know any ur-protein
    >sequence. At best we might approach a last-common-ancestor sequence.
    >But the minimal protein for a given function presumably is still much
    >simpler.

    Or very 'complex'. It depends on where the function initially arose.
    The function may be less efficient but it could reside on a huge
    protein which is doing something else in the cell.

    >It is much more hypothetical, too. But we may estimate the
    >probability of emergence for a given number of specific amino acid
    >occupations. According to my model estimate, this number cannot be
    >higher than 2 (my post of 22 Sep 2000). We could then compare this
    >number with the known invariances found in protein families, and
    >possibly folds, certainly much higher than 2.

    That is a post-hoc determination of a probability for a particular
    model of a pathway. It's not a bad first approximation considering
    that uncertainty of the route. But how can we determine whether
    your pathway is correct and reflects the initial conditions and
    steps along the way?

    David:
    >>>> The example of a pseudogene reactivated, discussed in other posts,
    >>>> would be a case of passing through unselected "random" intermediates
    >>>> before arriving at a useful function.
    Peter:
    >>> Yes, and this is exactly one of the interesting cases. Do you know of
    >>> any case where such a path via unselected intermediates has been
    >>> documented in a real biological system, not just stated as a general
    >>> hypothesis? I am eager to find such cases!
    Me:
    >> Hmm... I recall a series of papers by Daniel Dykhuizen (and Dan Hartl) in
    >> the late-'80s & early '90s about natural variation in genes of the lactose
    >> operon in bacteria (I think E. coli but possibly S. typhimurium). Using
    >> metabolic modelling and competition experiments in chemostats they showed
    >> that although natural variations the lac permease and b-galactosidase
    >> sequences were often effectively neutral with respect to growth on lactose,
    >> there were real diffences to be found during growth on other carbohydrates.
    >> In modelling the pathway, these relationships correlated with the measured
    >> activities of the enzyme variants. So, in the absence of these alternate
    >> carbon sources (some of which you wouldn't expect the bacterium to see
    >> often), the lac system could tolerate variation with little effect on the
    >> net metabolic flux. Thus under "normal" conditions some intermediates
    >> appeared to have escaped selection... at least until conditions changed
    >> (different growth environments) when suddenly those variants which
    >> arose from previously unselected intermediates became fixed via selection.
    Peter:
    >I haven't searched for this paper.

    Dykhuizen also has a chapter in Methods in Enzymology. I think the
    title was "Evolution in chemostats".

    > But I happen to have a copy of
    >B.G.Hall & H.S.Malik, "Determining the evolutionary potential of a
    >gene", Mol.Biol.Evol. 15 (1998), 1055. They analyzed a cryptic E.coli
    >beta-galactosidase ebgA ("evolved b-galactosidase"). In the absence of
    >the normal (paralogous) lacZ b-galactosidase, ebgA can be used, and
    >after 2 specific mutations it works as efficiently as lacZ. Why does it
    >exist? 25 years earier, ebgA had been thought to represent a newly
    >evolved function. Now, a phylogenetic tree of 14 b-galactosidases
    >indicates that the separation between ebgA and lacZ must have occurred
    >more than 2.2 billion years ago. Apparently, an occasional use for ebgA
    >ensured its persistence during this time. The same may be true of
    >Dykhuizen & Hartl's cryptic enzymes. Such cases, therefore, don't
    >provide clear evidence for evolution of a new function by a random
    >mutational walk.

    Dykhuizen wasn't working with cryptic sequences. These are fully
    expressed. The point I wanted to make was that even with proteins which
    are subject to selection and can even be considered to be optimized for
    a particular condition, there can still be (and often are), variations
    carried which are neutral under those conditions but not in others.
    Basically I'd like to point out the importance of the local environment
    in determining the fitness landscape at any particular moment. For
    growth on lactose, the utilization pathway was optimized and relatively
    insensitive to the variations in proteins examined (a flat landscape
    with no siginificant differences in activity). However, for growth on
    alternate carbohydrates, strains carrying the same variations were no
    longer on a flat landscape: The combinations could be differentiated.
    Under these alternate conditions, a new trajectory is found.

    Me:
    >> There is another interesting case related to the "most elementary cases
    >> of Behe's 'irreducibly complex systems'". This is a little off the main
    >> topic of protein origins, but I think an elementary case can be found in
    >> the evolution of streptomycin resistance. It's been known that some
    >> mutations which give rise to streptomycin resistance can reduce the growth
    >> rates of the bacteria relative to "wild-type" strains on media without
    >> the antibiotic. So it was thought that if the selective pressure of
    >> streptomycin resistance was removed the resistant strains would eventually
    >> become less common in the environment. But studies showed that these
    >> resistant strains persisted, even though they had not encountered
    >> streptomycin in a long while. It turned out that these strains had
    >> acquired a second mutation which suppressed the problems of carrying
    >> streptomycin resistance. When either of these mutations were carried in
    >> separate strains (strains with either the streptomycin resistance gene
    >> or the suppressor gene), growth was slower, compared to strains without
    >> both mutant genes. When both were present the strains grew as well as
    >> those lacking both genes. This represents an "elementary" IC system:
    >> a strain lacking one of the two mutations could not compete against the
    >> wild-type in normal growth -- both mutations were necessary. Interestingly,
    >> this system arose in much the way that one would expect an IC system
    >> to evolve: indirectly, through steps of selection under conditions that
    >> were not the same as where the system finally emerged.
    Peter:
    >If I remember correctly, streptomycin resistance occurs by a ribosomal
    >mutation.

    Yes, it often occurs there but can arise elsewhere. In this case, I'm
    fairly sure it was a ribosomal mutation.

    >It is to be expected that, in the absence of the antibiotic,
    >the mutant would be worse off than the wild-type; and it would not be
    >surprising if in the presence of the antibiotic, the mutant would be
    >under some selective pressure to get another mutation elsewhere that
    >would mitigate the damage done by the first one, without eliminating the
    >protection from streptomycin.

    Yes, exactly. In the presence of the antibiotic the resistant bacteria
    still have each other to compete against. If a suppressor can arise to
    restore faster growth, then that mutation will come to predominate the
    culture. Many cases of this sort of adaptability were documented
    a long time ago in the microbial fields.

    >If this should turn out to be the case, it would not constitute an IC
    >system, as each mutation can be selected by itself and the intermediate
    >is viable.

    ICness, as I understand it from reading Behe's book, has nothing
    directly to say about the route by which a system arose but how
    the system responds the removal of one of its components. It's from
    this criterion that Behe tries to make a general case about the
    evolvability of such systems. If we take strains carrying the two
    mutations and restore either one of the mutant genes with the
    original, wild-type form, then the resultant strains cannot compete
    against the wild-type strain carrying no mutations. The doubly-mutant
    system is IC in the sense that the removal or alteration of a single
    component leads to a loss of functionality.

    You are correct that the intermediate steps can be viable -- But
    that depends on the pathway. In the absence of streptomycin, the
    changes of getting those two complementary mutations in the
    same bacterium at the same time are very small. That's because
    a step-by-step random walk under those conditions is more likely
    to require steps with drops in relative fitness. However, under
    different conditions, such as prolonged growth in the presence
    of streptomycin, it's almost a certain outcome. Thus the
    probability of an evolutionary route is not simply a function
    of the starting and final sequence but also of the local conditions
    and the history of the system. These must be taken into
    consideration in any calculation. The only trouble is, how do we
    get that information?

    Regards,
    Tim Ikeda
    tikeda@sprintmail.com



    This archive was generated by hypermail 2b29 : Mon Oct 30 2000 - 23:44:59 EST