Random origin of biological information

From: pruest@pop.dplanet.ch
Date: Thu Oct 19 2000 - 11:11:24 EDT

  • Next message: Dawsonzhu@aol.com: "Re: 'Frankenfish' or Tomorrow's Dinner?"

    Wayne wrote:

    >
    > Date: Sat, 7 Oct 2000 04:18:41 EDT
    > From: Dawsonzhu@aol.com
    > Subject: Re: Random origin of biological information
    >
    > Peter (pruest@dplanet.ch) wrote:
    >
    > <<
    > As a basis for discussion, I repeat the definition of the 5 different
    > cases:
    > > > (a) search for a meaningful letter sequence among random ones,
    > > > (b) artificial selection of a functional ribozyme from a collection of
    > > > random RNA sequences,
    > > > (c) evolution of a functional ribozyme in RNA world organisms,
    > > > (d) evolution of a protein by mutation of the DNA and natural selection
    > > > of the protein,
    > > > (e) a random DNA mutational walk finding a minimally active protein.
    >
    > The problem we keep running into is that you assume that (a) and (b) are
    > representative for (d) and (e), which I contest. I group the points
    > discussed under different headings, A **** etc.:
    > >>
    >
    > Another major problem in these discussions is that we all come
    > at this with different backrgounds with their own languages.
    > Consequently, whenever these discussions surface, "words" are
    > exchanged back and forth but "meaning" is usually the victim
    > and almost no further understanding is reached.
    >
    > If I can summarize what you are saying here.... I think what you
    > *mean* is that the biological systems are "complex". Typically, a
    > researcher will focus on RNA or DNA or proteins (and only a
    > specific subset of even that!), so a general knowledge of the
    > totallity of this issue may very likely be a human impossibility
    > given the limits of our life times. Moreover, even a subset of
    > that knowledge (say RNA --- my subject area) is not achievable,
    > and the same certainly goes for DNA or protein research.
    > Finally, experimental work is always a reduction to a small
    > subset under highly controlled environmental conditions which
    > may or may not properly reflect what is really going on in an
    > open dynamic system.
    >
    > Hence, one thing I see you saying here is that because
    > of our ignorance, it is pure hubris to claim that we have it all
    > figured out. Is this something of your point?

    No, this is not what I'm getting at, although I agree about the
    complexity. The question I'm dealing with is the origin of biological
    information by means of "evolution", and under this heading I insist
    that the five cases (a) to (e) mentioned are not comparable processes,
    although many people believe they are. In particular, if we want to find
    out what happened on earth 3.8 billion years ago (and what is happening
    in the biosphere up to now), we must be very careful with drawing
    conclusions from cases (a), letter sequences, and (b), artificial
    ribozymes, which are only analogs of evolution. In this discussion, I am
    not focussing on the origin of life (including a possible RNA world,
    (c)), either, because we hardly know anything about it, and the problem
    is much too complex to venture into estimates of probability at this
    time.

    This leaves us with the two cases of (d), genuine darwinian evolution
    with natural selection after each single mutation, and (e), the
    emergence of a genuinely novel functionality by a random-walk path of
    several mutations without intermediate selection (which can start only
    when the new function is available to at least a minimal degree). Both
    (d) and (e) presuppose the existence of an organism having a nucleic
    acid genome and capable of translating it into protein. As it is quite
    unlikely that the first living organism ("protogenote" or the like)
    already was in possession of all biological functions occurring in
    today's biosphere, both (e) and (d) must have been going on for 3.8
    billion years now. It is only with case (e) that we have, at present,
    any chance of being able to estimate any realistic probabilities of
    spontaneous occurrence.
     
    > Indeed, one of the dangers of unbridled arrogance is the havoc
    > that can result. The Titanic is a nice example
    > of late 19th/early 20th century hubris. If we make the same
    > blind folly in regards to biology, we're likely to sink more
    > than a ship in the wake of our haughty blaphemy. Scientific
    > discovery should make us humble and hard lessons are the
    > rewards that befall the proud; but the "rain" will likely fall
    > on both the evil and the good.
    >
    > I can agree that the system is complex when considered as a
    > whole, but I do not commit myself to any position on this matter.
    >
    > I hope that your faith is not built on the impotence of
    > our current knowledge to achieve a more representative system.

    Of course not! In fact, my two PSCF articles in 44 (1992), 80, and 51
    (1999), 231, indicate that I believe that (1) God "hides his footsteps"
    in creation, in order to protect our personal freedom of deciding for or
    against him; (2) he is just as active in all "natural" events as in any
    "supernatural" ones; (3) he used evolution as one of his means of
    creating the biosphere. Nevertheless, I believe there are plenty of
    indications in the creation pointing to God's authorship (Romans 1:20).
    I don't expect to prove any "gaps", but to find plenty of fine-tuning
    reflected in extremely low probabilities of spontaneous emergence of
    certain circumstances and functions - the anthropic principle in
    cosmology, geology, and biology.

    > I will not *exclude* the possibility that God has left an "Intel
    > inside" written somewhere for us to find, but a survey of
    > 400 years of science and the scientific method strongly speaks
    > against proposing that God's ways are Intel's ways. The nuts and
    > bolts of God's handiwork may be entirely indistinguishable from
    > the creation. Indeed, a masterpiece of engineering is like that
    > isn't it? We may know the maker, but no name even need be
    > mentioned it is so great. And God, who even suffered his son
    > to die on the cross can do far greater works than that.
    > So I think it is also important to remember that relying
    > on ignorance as proof of the handiwork of God is a somewhat
    > precarious position to stand on.

    I fully agree.
     
    > I can accept your argument of our blithering
    > ignorance, but I'm not sure that this is a good reason to argue
    > for the existence of God. Your point is more sophisticated than
    > the run of the mill "god of the Gaps strategy" (caps purposefully
    > inverted --- WKD), but I strongly encourage you to keep aiming for
    > the "God" rather than getting to overly trapped in the "Gaps".
    >
    > <<
    > A **** Is it necessary to distinguish (a) and (b) from (d) and (e)?
    >
    > > I raised that only as a response to your contention
    > > that proteins wouldn't behave as does an RNA. I think
    > > the evidence says that they do.
    >
    > They don't: a nucleotide is worth 2 bits, an amino acid about 4.3 bits
    > which can only be selected as a whole. This may not amount to much
    > difference if each mutational step is selected individually, but
    > whenever you have intermediates without functional improvement, the
    > probability factors are multiplied at each step. RNA can be made by
    > "organisms" consisting of 1 RNA molecule each, in a soup containing RNA
    > polymerase and 4 nucleotide triphosphates, whereas a selection system
    > doing translation of DNA (on which mutation works) across RNA into
    > protein (on which selection works) requires a bacterium. You may
    > mutagenize RNA at rates of 10^(-4), perhaps also at 10^(-3) per
    > nucleotide and generation, but a bacterium will hardly survive such
    > treatments (the usual, i.e. naturally optimized, mutation rate is
    > 10^(-8)). This rate also multiplies in each time a step leads to an
    > unselected intermediate.
    > >>
    >
    > OK, I confess, I work with RNA. However, prions (mad cow disease
    > or transmissible spongiform encephalopathies" (TSE))
    > is an example of a self replicating protein.

    A prion is not a self-replicating protein, but a special, aggregating
    conformer of a normal cellular protein. Now, molecules of the prion
    conformation catalyze the conversion of molecules of the normal cellular
    conformation into the prion conformation. It is an autocatalytic process
    which confers an epigenetic modification on an organism (heritable in
    the cell line, e.g. [PSI+] in yeast). It has nothing to do with genome
    replication, and (IMHO) nothing with evolution (despite the contrary
    claim of H.L.True & S.L.Lindquist, Nature 407 (2000), 477), only with
    adaptability, and possibly microevolution - although this is fully
    speculative to date.

    > There is debate
    > about whether TSE is really caused by prions, for example
    >
    > http://www.pbs.org/wgbh/nova/madcow/prions.html
    >
    > and the mechanism of replication is not at all clear....
    >
    > (for a general discussion, this might be a good
    > place to start
    >
    > http://www-micro.msb.le.ac.uk/335/Prions.html
    >
    > A list of sources:
    > http://www.cyber-dyne.com/~tom/quick_links.html)
    >
    > In any case, with the long list of qualifiers now
    > introduced, this does indicate that the analogy
    > may quite likely be there.
    >
    > It is far less clear how the complex RNA/DNA/protein
    > system developed or evolved, but studying the separate
    > parts in isolation is a start. It may not be so easy
    > to explain how DNA and RNA developed (or codeveloped),
    > but without any other information to work with, it would
    > be reasonable to assume that they began separately and
    > eventually integrated. It might be a cautionary note
    > to realize that this might not be the case, but presently
    > we don't know.

    As I said above, in this context I'm not talking about the origin of
    life.

    > As to information content, I really think that we have
    > to get away from individual bases defining information.
    > That *may* not be correct. Consider how Chinese characters
    > are recorded in the computer now. Each Chinese character
    > carries a specific meaning. The context of the character
    > within a sentence defines whether it functions as a noun,
    > an adjective, a verb, and so forth. An example would be
    > for example "go" (or move) (Chinese; "dong" Japanese: "dou"
    > or "ugoku" depending on the context). In both cases (be it
    > C:"dong" or J:"dou") Each character is saved as two
    > bytes rather than one and the way they are stored is quite
    > different and mutually incompatible. (Conventional alphabets
    > can be saved as one byte per characters.) However, the fact
    > that more than 10000 words can even be stored
    > as two bytes means that many "words" can be "compressed"
    > into much smaller units if we are only concerned with
    > translating the "meaning". These can later be expanded
    > by appropriate decompression methods --- an analogy
    > similar to the transcription of mRNA into a protein.

    I don't know any Chinese or Japanese, but I understand what you are
    saying (by the way, 2 bytes can store 2^16 = 65536 words, not only
    10000, and without any compression).

    I have never thought information was countable by individual bases. At
    least since my talk on "The unbelievable belief that almost any DNA
    sequence will specify life" at the 1988 Tacoma, WA, conference on
    "Sources of Information Content in DNA", I distinguish between the
    "combinatorial information potential" of a DNA of length L, given by 4^L
    different sequence possibilities, and the much smaller "semantic
    information content" usually unknown. An attempt to estimate the size of
    the latter value was done by H.P.Yockey (J. Theoret. Biol. 67 (1977),
    377; Information Theory and Molecular Biology, Cambridge Univ. Press,
    1992, p.254) for the small protein cytochrome c. However, this estimates
    the amount of information needed to specify the sequence of any one of
    the 2 x 10^93 different sequences putatively active as
    (iso-1)-cytochromes c, assuming they are all equivalent. Yockey does not
    count all DNA bases, but only those protein sequence features which are
    invariant among all known cytochromes c (invariant amino acids or
    invariant functional amino acid groups). Such procedures measure
    "meaning" (or "instructions") in a biochemical context, taking into
    account redundancies through synonymy or other compressible features.

    Now, all this concerns modern, highly active proteins evolved over a
    long time by natural selection. Therefore, we cannot estimate the
    probability of their emergence by random mutational walks, as only
    mutations were random, but not selection - process (d) above. The first
    sequence of the biosphere minimally active (i.e. selectable) as a
    cytochrome c presumably was much simpler and more easily reachable by a
    random mutational walk - process (e) above, but by how much? We have no
    idea.

    > By anology then, if protein folds are the "nouns", and
    > linkages form "verbs", a simple instruction can be built
    > from a comparatively short collection of bytes. So I would
    > prefer to see the argument really look at the more
    > difficult issue of what *instructions* (aka information)
    > are actually *in* a completely folded structure. For

    These considerations have been taken into account, by Yockey and by
    myself.

    > example, with RNA, the sequence
    >
    > AAAAAAAAAAAAAAA CCCC UUUUUUUUUUUUUUU
    >
    > forms a beautiful fold of A-RNA, but I think you would agree
    > that very little information is in the sequence because it is
    > a repetitive sequence of letters with a simple algorithm to
    > generate it. However, I could reconstruct this with
    >
    > AGUAACGAGCAUUAG GAAA UUAAUGCUCGUUACU
    >
    > you will still get a loop, but the sequence information is
    > much greater than the fold information that this sequence will
    > generate. This is why I am beginning to find it objectionable
    > when I see these "sequence" arguments.

    A biological function is given by a molecular structure, which in turn
    is given by a sequence of the molecule (in a given cellular
    environment). The two hairpin RNAs you give contain the same structural
    information (presumable even with the unorthodox GU pair). However, even
    here, the second sequence may have some additional properties in a given
    cellular environment: it may be a substrate for an RNA restriction
    enzyme, or something else. But of course, we cannot simply read the
    "meaning" of a sequence, or estimate its semantic information content,
    given this single sequence alone.

    > I confess that I am not a diehard bioinformatic spokesman and
    > maybe I am will be found completely wrong, but I think
    > that finding structure and function exclusively from a
    > sequence without profuse appeals
    > to structural physics and chemistry is a truly mad way to do
    > business. In one sense, the complexity grows and the wonder
    > becomes more awesome as this structure information comes in,
    > but in another way, these subtle arguments about information
    > in the sequence become increasingly meaningless.

    That's exactly why I want to concentrate on as simple a question as
    possible, such as the one I posed to Glenn Morton on 22 Sep 2000
    13:51:34 +0200 (asa-digest V1 #1804, Date: 23 Sep 2000 09:20:01 -0000):
    how many specific amino acid substitutions can we reasonably expect in a
    random mutational walk before selection for a newly emerging "meaning"
    (or function or enzyme) sets in? However, up to now, no one on this list
    has tried to deal with it.

    > Glenn's comment:.....
    > > I think eventually we will find the same thing in proteins,
    > > and we have found it in RNAs. The solution that life uses,
    > > which seems so limiting, is merely the solution that life
    > > chose early in its evolution.
    >
    > This is a reasonable position, however, don't forget that we
    > only have one example right now. If (or when) we find life
    > on other planets, we will be in a better position to say
    > whether this is true or not. Moreover, we don't know how the
    > "dirt" really works yet. Selection is always done with "dirt"
    > that has the function already. I think it is too early to
    > commit to the idea that *any* way is ok. Fitness landscapes
    > may suggest this point, but they don't handle the real
    > complexity either, so to what extent the complexity could
    > "fine tune" these matter is still unknown.
    >
    > Peter responded:
    > <<
    > That different sequences of the same protein family (having recognizable
    > sequence similarities) often have the same function (but in different
    > organisms or environments!) is clear. The experimental evidence for
    > different folds having the same function, however, is very meager if
    > they occur at all (I don't know of any example, although it might be
    > feasible occasionally).
    > >>
    >
    > I'm not sure I understand you here.
    > Protein folds are usually divided into families
    > that correspond to similar functions.

    Proteins are grouped into families of sequences of clear homology
    (sequence similarity), similar function, and virtually identical
    tertiary structure. Some protein families may be grouped into
    superfamilies if they have detectable homology, but somewhat or clearly
    different functions, yet still very similar tertiary structure. Protein
    families and superfamilies are grouped into the same fold if they have a
    similar tertiary structure, even if they lack detectable homology or
    similarity of function. One estimates that there might be about 1000
    different folds in the biosphere. (One semantic problem here is that
    "fold" designates (1) a particular folding of a part of a given protein
    sequence, or (2) the native tertiary structure of the complete sequence
    of a specific protein, or (3) a supergroup in the classification of all
    proteins.)

    > If you consider the cytochrome P450 family as an oxident,
    > then there are no "different functions", but if you consider
    > how many cytochrome P450 proteins there are, and the fact that
    > they serve to detoxify various natural toxins that plants
    > produce to protect themselves from preditors (aka humans),
    > and that many such proteins involved in the metabolizing
    > of drugs (which is a major issue related to dosage), then
    > there are a huge number of "different functions" that
    > the "same folds" can do. Of course, there will be limits.

    So, I assume the cytochromes P450 constitute a family (or perhaps
    superfamily) as defined above, their primary function being oxidation,
    but having many different variants for different substrates (and
    different species), yet all of them belonging to the same fold.

    > Furthermore, the sentence that makes the folds is like
    > instructions in Japanese, or Russian, or English, or some African
    > language. The "words" as such are the sequence. The "enzyme"
    > (collection of folds) is the meaning.
    > Don't confuse the two. Instructions given in
    > an African language would go right past me, but I could understand
    > the same instructions in English. They could be exactly the same
    > instructions, corresponding to the "fold" --- the meaning that
    > is to be understood. So "sound" (or words) and "meaning" are not
    > the same thing. Some of the sentences in a language are easily
    > mutated into other sentences, some are not. There will also be
    > limits on the types of sentences that can be created, but that
    > would suggest that they can serve "different functions".

    Perhaps we can compare the different languages to different taxonomic
    groups which may use different representatives of the same protein
    family (or superfamily) for the same or similar purposes, although they
    are exchangable between very closely related taxa only. I am not sure,
    though, whether the dissimilarities between different language groups
    are not greater than the dissimilarities between the biochemistries of
    different biological phyla, which have at least the same genetic code
    (or virtually so).

    > In fact, you say yourself to Glenn....
    > <<
    > You misunderstand. I said " different folds having the same activity "!
    > A "fold" in this sense is a set of protein families without recognizable
    > sequence similarities between them, but folding into the same tertiary
    > structure.
    > >>
    >
    > I suppose the family of cytochrome P450 does the same "activity", but
    > it is a very loose word. We can define "activity" up to the level
    > of "enzyme" under the current circumstances.
    >
    > [large snip --- sorry, this is just too long, and too cluttered to
    > respond to.]
    >
    > >>
    > B **** What is the frequency of active RNA's in ribozyme selection (b)?
    >
    > Glenn wrote:
    > > The question is how efficient is nature at finding solutions.
    > > The experiments with biopolymers that I have cited clearly show that
    > > functionality occurs at a rate of 10^-13 or so. In the case of one of
    > Joyce's
    > > RNAs the classical probability argument would say that he had something
    > like a
    > > 1 chance in 10^236 of finding a useful sequence. But Joyce has been
    > showing
    > > that he can find functionality in a vat of 10^13 ribozymes. Surely that
    > must
    > > cause the anti-evolutionist pause because at that rate, there are 10^223
    > or so
    > > different sequences that will perform a given function. I really fail to
    > see
    > > how someone can not see the implication of this except for theological
    > > reasons.
    >
    > Peter responded:
    > To which paper are you referring? We would have to look at the details.
    > Exactly the opposite conclusion was drawn in C.Wilson, J.W.Szostak,
    > Nature 374 (1995), 777: "A pool of 5 x 10^14 different random sequence
    > RNAs was generated... On average, any given 28-nucleotide sequence has a
    > 50% probability of being represented... Remarkably, a single sequence
    > accounted for more than 90% of the selected pool... This result
    > indicates that there are relatively few solutions to the problem of
    > binding biotin." The probability of accidentally hitting on a functional
    > combination composed of L nucleotides is 4^L, no matter how large N, the
    > length of the randomized sequence is. Your conclusion that with N=392
    > (10^236 different sequences), finding one active sequence among 10^13
    > (L=22) implies that there are 10^236/10^13 = 10^223 active sequences of
    > length 392 is formally correct but completely irrelevant, as the
    > 392-22=370 other nucleotide positions add nothing at all to the
    > functionality. If L=370, instead, a completely different overall
    > probability results. Your insistence on the 10^13 to 10^14 figure is
    > entirely arbitrary. That this same figure keeps popping up in different
    > experiments may just mean that this amount of RNA is practical to work
    > with.
    > >>
    >
    > I'm not quite sure I see your argument here. One single sequence out
    > of the vat of "random" sequences was selected in 90% of the cases.
    > On that basis you think that there is still only a very
    > small number of possible sequence that will effectively result in
    > selection? Is this the first argument?

    Wilson & Szostak observed that >90% of their yield of active RNA
    molecules had the same sequence. They conclude that there are not many
    different biotin-binding RNA sequences - in contrast to Glenn's claim.
    Otherwise, they should have reaped a large number of different
    sequences, all having quite a high activity (a broad distribution of
    different sequences). It's not my argument, but that of Wilson & Szostak
    who did the experiment (and I agree with them).

    > I don't really follow the rest of your argument. The "392-22=370"
    > means the segment of the rybozyme that is selected is 22, and the
    > rest is unchanged?

    There are 10^13 different RNA sequences of length 22. So, if one finds
    one RNA molecule of a given ribozyme activity among a pool of 10^13
    different RNA sequences, one can conclude that about all these 22 bases
    are required to specify the activity selected for. But if the RNAs in
    the pool are longer than 22 bases, the extra bases don't contribute to
    the activity, their nature is irrelevant (not necessarily the same in
    all molecules, or unchanged during the experiment). Glenn's conclusion
    that in a pool of 10^236 different sequences of length 392 (that is, a
    pool containing all possible sequences of this length - which of course
    is impossible, as there are only 10^80 nucleons in the known universe),
    one would find 10^236 / 10^13 = 10^223 active sequences is entirely
    arbitrary. If the 22 required bases (in the required spacial
    distribution) also happen to be present among the 370 extra bases, this
    doesn't help much: it might at best increase the number of active long
    sequences a few times, if their placement within the long RNA is
    irrelevant (which we don't know).

    > <<
    > Even in RNA selection, probabilities depend very much on the
    > length of the RNA sequence selected, WHICH function is being selected,
    > as well as other details. So you cannot generalize. And especially, you
    > cannot draw conclusions regarding natural selection in a DNA-to-protein
    > organism from results of artificial RNA selection.
    > >>
    >
    > Here, your argument seems to be in regards to a length dependence
    > on the segment being selected. That is a good point, this does depend
    > on the unchanged existing functionality of the
    > ribozyme to aid in the selection process.
    > The selectivity measurements would all be consistent because
    > the same active region is being tested and it ignore the rest of the
    > structure. I am willing to agree that there *could* be a length
    > dependence. However, whereas that adds to the complexity of the whole,
    > it still does not say that it can not follow such a route. It leaves
    > the problem in the state of appealing to future results
    > (or null results) such as the case may be.

    I am sorry I didn't make it clear what I meant by "length" in this
    context. It is the number of bases required to be a given one of the 4
    possibilities, A "length" of 22 implying 44 bits of information. Of
    course, if in a given position any purine would do, this counts as 1 bit
    only. Or, any Whatson-Crick pair between two given positions will amount
    to an average of 1 bit per position, etc. This "length of 22" may be
    distributed over an RNA segment of greater physical length. The point of
    my length argument was the "amount of information" (in bits or half this
    number of bases) which is required to specify the function being
    selected. There may be functions sufficiently simple to be specifiable
    by 10 bits, others may need 100 bits. The yield of different active RNAs
    will be quite different in these two cases.

    > Glenn wrote:
    > > Oxytocin has only 8 amino acids. Several others have that
    > > also. An enzyme does not a priori have to have a long sequence.
    >
    > Peter responds:
    > Oxytocin is a biologically active peptide, not an enzyme. There are lots
    > of small, but biologically active things, down to ions like Ca++. Active
    > peptides usually aren't even translated from an mRNA (I'm not sure about
    > oxytocin), but synthesized by rather large enzyme complexes. Enzymes and
    > other biologically active proteins have sizes of usually a few hundred,
    > and up to a few thousand amino acids. They often are composed of domains
    > with their own tertiary structure, where domains are usually around 100
    > amino acids. As an enzyme has to fold into a more or less fixed steric
    > structure, in order to very specifically hold one or more substrates and
    > catalyze a very specific reaction, it cannot be too short.
    > >>
    >
    > OK, your appeal to "length" and "functionality" is clear.
    > However, I still find it somewhat puzzling that you can firmly
    > (and correctly) appeal to the structure of the protein,
    > yet you still insist on the antiquated ways of viewing the
    > information coded on the sequence. I think you need to consider
    > structuring your arguments around minimum information required to form
    > an enzyme, rather than focus on the maximum possible information
    > that might be contained on a given sequence. I sense that these issues
    > are "orthogonal" in your view. I'm not convinced that they are.

    Here, you formulate the same two types of "information" I defined above
    as the "combinatorial information potential" and the "semantic
    information content". The only relationship between them is that the
    latter cannot exceed the former, but is usually very much smaller. In my
    previous paragraph, I tried to clarify what I mean by the semantic
    information content for an RNA of a given function (requiring 22
    specific occupations in a sequence of unspecified length). If you do an
    analogous estimate for proteins, you get around 30 specific amino acid
    occupations (in a domain of about 100 amino acids or a little bit more)
    for cytochromes c, see Yockey, or for ribonucleases. I don't see why
    this procedure would be antiquated.

    > Anyway, it takes me a long time to read through these things,
    > try to understand the arguments, hopefully check references,
    > and finally comment on it, so I'm not likely to answer any
    > response for a while.
    >
    > By Grace alone do we proceed,
    > Wayne

    Agreed!
    Peter



    This archive was generated by hypermail 2b29 : Thu Oct 19 2000 - 11:09:22 EDT