Re: pure chance

billgr@cco.caltech.edu
Fri, 10 Jan 1997 13:26:30 -0800 (PST)

Brian Harper:

[...]

> I'm starting to wonder if I should drop out of discussions
> of information theory all together. Every time I say
> something it seems like I have to take it back a few
> days later :). I looked up overlapping genes in Yockey's
> book and found a lot of interesting stuff. But reading
> through this interesting stuff I found that I had made
> a mistake in my last post. I'm going to have to
> think about it some more but it seems I'll probably
> have to retract what I said previously, at least kind of.
> Of course, this is probably going to get Greg really mad
> at me :).
>
> Greg has been asking about the usefulness of info-theory.
> It looks like overlapping genes may be an example where
> information theory is useful. Before getting into that
> let me go over some of my musings regarding my last
> post. To refresh our memories we had the situation where
> several codons specify the same amino acid regardless of
> the third position. I had taken this as an example of
> the type thing Greg was talking about since a mutation
> at one of these positions results in no change in the
> information content of the protein. Glenn then asked whether
> the information content in the DNA might change even
> though the info-content of the protein remained the same.
> The answer to this, after my further reading is apparently
> yes. So good job to Glenn for asking the right question.

I'm not mad yet. :-) I wasn't talking about this in particular,
but multiple use *is* an example of the sort of long-range order
I was talking about.

> I got myself muddled by getting off into sources and receivers,
> so lets go back over that business a little. One thing that
> I hope is obvious is that one cannot receive more information
> than is sent. One can however send more information than
> is received. I was aware of this previously but wasn't
> aware of how crucial the point is for the genetic info
> system. I was thinking that the main reason for information
> loss was noise, various types of mistakes in encoding and
> decoding. It turns out that a lot of information will also
> be lost if the source code has a larger alphabet than the
> the receiving alphabet (which is the case for 61 codons in
> DNA and 20 amino acids). This turns out to be crucial since
> it is the information loss which guarantees that information
> is transferred in only one direction, from DNA --> protein
> but never protein --> DNA. We might then consider this a
> useful contribution of information theory in that this
> important principle of the genetic information system is
> a direct consequence of the mathematics of coding theory.
> Information transfer will always be unidirectional when
> the entropy of the source exceeds the entropy of the
> receiver.

Hmmm. This seems like an interesting line of inquiry.
The numbers you gave below correspond to a loss of about 1.5
bits for each codon->amino acid (de)coding operation. I'm
not sure I understand what you meant in the last part here. Are
you saying that this 1.5 bits is like a 'threshold' that makes
information incapable of moving in reverse (from proteins to DNA)?

If I were making a guess (and that's what it would have to be), I
would say the reason for unidirectionality is the protein folding
problem--proteins are made linearly, then fold up, and they are
stable folded up, and so can't unfold, and there aren't any enzymes
to do reverse coding from the tertiary structure (I'm not sure if there
are any from the primary structure or not...). Since all the DNA-relevant
information (or most of it) is hidden inside the scrunched-up protein,
the DNA information that went in can never be gotten out. How is
this guess related to the information direction? Can it be rephrased
in information-theoretic terms (less information in the surface topography
of a protein than in its internal structure, or something?), or does
info theory help us here?

> to the large difference in information content between
> the source (with 61 "letters") and the receiver (with
> 20 "letters"). A common mistake at this point is to
> forget that unequal frequencies of the occurrence of
> the letters decreases the information content. Nevertheless
> we can still get a rough comparison between DNA and
> protein info content by taking the maximum entropy
> which always occurs when all letters appear at the
> same frequency. For this case, the entropy of a sequence
> of N characters is:
>
> S = N*sum|1..P|[(1/P)Log2(1/P)]
>
> Where P is the number of characters in the alphabet (61 for
> DNA and 20 for protein) Thus, the max entropy is 5.93*N for
> DNA and 4.32*N {bits per symbol} for protein.
>
> Now for my retraction. Given the authors were discussing the
> entropy at the source, I think their conclusions and math
> is correct. The comments I made were, I think, correct also
> except I was (unknowingly) discussing a different entropy,
> the so-called mutual entropy, which is a measure
> of the information being passed through a communication
> system. The mutual entropy is defined as:
>
> I(A;B) = H(x) - H(x|y)
>
> Where I is the mutual entropy. A represents the source
> alphabet with an ensemble of messages x. A similar
> interpretation is given to B and y. H(x) is the entropy
> of the source (what the authors of the study were analyzing).
> H(x|y) is the conditional entropy that message xi was sent
> given message yj was received. Basically, H(x|y) is a
> measure of the information lost during the transmission
> of the message. H(x|y) can be further split into two
> components, one representing the information lost due
> to noise and the other the information lost due to the
> source having a larger alphabet than the receiver.
>
> Sorry about the confusion. I personally have learned a
> great deal from this exercise. Let's continue on
> with these ideas to see some further useful insights
> from info-theory.

Uh, oh, now I'm partially mad. :-) :-) I thought I had
persuaded you that the assumption of equal probabilities
for all the codons was a poor estimate of the ensemble (or
at least *could* be a poor estimate, and so needed some
justification). Is this what you are going back on?

The issue of redundant coding letters in the DNA is relevant
to the thing we were talking about above, but I was never
meaning this when I talked about that paper before--I was
just talking about the 'source entropy' (like you called it).

I think your discussion above is fine in that you identify
which ensemble you are talking about--the ensemble of codons
considered independently. I'm just not sure how realistic it
is to apply this to the genome, that's all.

> First of all, the redundancy in the genetic code that
> results from the source having a larger alphabet than
> the receiver is often referred to in a pejorative sense
> as degeneracy. First, we see from the above that this
> is necessary to guarantee a unidirectional transfer of
> information. Yockey also points out that redundancy is
> crucial for error correcting.

It is also a kind of ECC--that is, degeneracy like this
is exactly what ECCs rely on to decrease probability of
error. You make the codewords in clusters so that their
Hamming distance is larger than if you used densely packed
codewords. I think it would be interesting to see if the
genetic code were optimal in some sense in this way. It
should be fairly easy to figure out--do you know if anyone
has done so?

Yockey:
[...]
> code has this property, which results in the source
> alphabet having a larger entropy than the receiving
> alphabet. No code exists such that messages can be sent
> from the receiver alphabet B to the source in alphabet
> A if the alphabet a has a larger entropy. Information
> cannot be transmitted from protein to DNA or mRNA for
> this reason. On the other hand, the information capacity
> of DNA may be used more fully by recording genetic messages
> in two or even three reading frames. Thus the phenomenon
> of the overlapping genes is related mathematically to
> the Central Dogma and the redundance of the genetic code.
> ======================================================

I think I see what Yockey is getting at here, but the difference
in max entropies is not a sufficient reason for the Central
Dogma. One can translate back and forth between two codes even
of very different entropy per symbol, and the process quickly
reaches a limit cycle. For example, suppose my code was that
vowels go to '1' and consonants go to '2'. Translating from
English I get

compressthissentence -> 21222122221221221221

Obviously, I can't recover the original information in the
sentence, but for now I don't care--the functional information I'll
assume is in the 1s and 2s. So now I do a reverse translation, using
'a' for 1 and 'b' for 2: babbbabbbbabbabbabba.

Further codings and decodings will just map between the 'a' and 'b'
string and the '1' and '2' string.

So the fact that DNA has more information than protein is *evidence*
that the Central Dogma is true; it is not *the reason that* the Central
Dogma is true. This is especially the case in the DNA/protein system,
where the fact that DNA has higher information capacity is what I
understand drives neutralism. The fact that most mutations either have
no effect at all upon protein, or, if they do have an effect, that
effect is irrelevant to the function of the proteins, means that the
DNA can move towards a condition of maximum entropy within the constraints
of the proteins needed for duplication. This is the basis for all the
'molecular clock' studies. I don't think Yockey's statement here is
correct, at least as far as I understand it. It is true in a sense,
but it is a sense that makes no difference to biology.

-Greg