Re: pure chance

Brian D. Harper (harper.10@osu.edu)
Fri, 27 Dec 1996 13:04:31 -0500

At 10:30 PM 12/19/96 -0800, Greg wrote:
>Brian Harper:
>
>[...]
>
>> Now I would like to return to another question, whether the
>> information content (as defined by Shannon) increases during
>> evolution or more specifically due to a mutation. Based primarily
>> on what little I know about info-theory and my intuition I had
>> indicated in a "conversation" with Steve Jones that a random
>> mutation would increase the info-content. This was in response
>> to a bold assertion by Steve that information content would
>> never increase due to random mutation. There was also a
>

Greg:==
>I just attended a series of talks about how information theory
>might help figure out how neurons in the brain work. Mixed in
>was a more general discussion of how Info Theory might apply to
>biology. The problem is that the Shannon theorem is extremely
>broad--it is a theorem that given three or four (depending on
>how you look at it) assumptions, the Shannon entropy measure IS
>the correct measure for information. No-one wants to abandon
>any of the assumptions. Trouble is, how does one apply that
>to biology? Do critters try to maximize the information in
>their genome? Doesn't seem obvious--genomes are ordered, and a
>good thing, too!
>

Application of info theory to biology is, apparently, very controversial.
Shortly after I became interested in info theory I made a post to
bionet.info-theory with several newbie type questions. I remember
one question had to do with Hubert Yockey (his reputation etc.) as
my initial interest in the field was due primarily to Yockey's book
and papers in J. Theor. Biol. I found out that Yockey is also very
controversial, but his "credentials" are impeccable. Another question
I asked was regarding the general acceptance of info theory applications
in biology by biologists in general. Here I got a little bit of a surprise.
Several people followed up on this with horror stories of how they
were nearly stoned after presenting their papers at meetings.
Apparently there is a great resistance by many toward the idea
of applying any type of mathematics to biology.

I am not an expert in either biology or info theory so I prefer
to just watch the controversy from a distance :). Several on
bionet.info-theory have given what seemed to me very eloquent
and reasoned defenses of the appropriateness of applying
info-theory to biology.

Now, let's go back to your final comment, repeated below:

Do critters try to maximize the information in
their genome? Doesn't seem obvious--genomes are ordered, and a
good thing, too!

Of course critters are not trying to maximize the information in their
genome, the interesting question is whether it happens
anyway. My own view is that the information content tends
to increase during evolution with possibly some exceptions.
A more extreme view is that there is something like the second
law applying to information wherein it *must* increase. This is
the view of Brooks and Wiley <Evolution as Entropy, Univ. of
Chicago Press, 1988>.

I believe Glenn already pinged you on the word order, nevertheless
I think you were going along the right track with your comment
even though you used the wrong word. Better to say organized.
As I alluded to earlier, one of the biggest controversies and
difficulties with complexology is the basic task of deciding on
a definition of complex. Most "complexologists" have abandoned
Shannon entropy and Kolmogorov complexity as definitions for
complexity for the reason you are suggesting (If I'm understanding
you correctly). The "problem" with both of these measures is that
they are maximal for random strings. Avoidance of this little
"problem" has led to a lot of confusion with talk of "interesting
complexity", "meaningful complexity", "functional complexity" etc.
The problem though is that an objective, intrinsic measure
cannot determine meaning or functionality. I think though
that when people talk about these things they are really talking
about organization. I think there is some hope that organization
can be objectively measured. So, I think that Kolmogorov complexity
is a great measure of complexity but that one needs more than
this. Biologically "interesting" complexity should be Kolmogorov
complex but also organized. Best is to keep both words, and
refer to organized complexity.

Well, I've started rambling again. In a nutshell I believe complexity
does tend to increase due to random mutations. Does organized
complexity increase? No, I don't think so, not *just* due to random
mutations anyway. I think there would need to be some other
principle at work, what Casti referred to as the optimistic arrow
of time.

[...]

BH:==
>> J. S. Rao, C.P. Geevan and G.S. Rao (1982). "Significance of the
>> Information Content of DNA in Mutations and Evolution,"
>> <J. Theor. Biology> 96:571-577.
>>
>> Here the authors consider one point mutations and show that the
>> only requirement for the Shannon IC to increase is that the
>> frequency of the codon which mutates must be larger than the
>> frequency of the codon to which it is mutated.
>

Greg:==
>This doesn't seem convincing. Surely there must be other ingredients?
>

Below is the derivation given by the authors. Inspecting the equations
for the entropy it seems pretty clear that the only ingredients
which could affect the result are the two frequencies.

===================
Consider two DNA molecules which differ from each other
by a single base change, as a result of which codon 1 in
the unmutated DNA changes to codon 2 in the mutated DNA.
Then if P1,P2,...P64 represent the codon frequencies in the
unmutated DNA and P1',P2',...P64', the corresponding
frequencies in the mutated DNA, the only changes that
would occur would be in the frequencies of the codons
1 and 2 and these are given by

P1 = N1/N; P1' = (N1-1)/N
P2 = N2/N; P2' = (N2+1)/N

and

Pi = Pi' (i=3,4,...64)

where, Ni is the number of codons of type i, and N is
the total number of codons in the coding region of
DNA.

The information content of the unmutated DNA is given
by

H = -P1*logP1-P2*logP2-SUM[3..64]Pi*logPi

Similarly, for the mutant DNA we have,

H' = -P1'*logP1'-P2'*logP2'-SUM[3..64]Pi*logPi

Since the mutation should lead to an increase in
information content, we require,

H' - H > 0,

or, as P1+P2 = P1' + P2' = P,

dH/d(-P1) > 0 (valid for large N) <actually, a partial deriv. --BH>

Since

H = -P1*logP1-(P-P1)*log(P-P1)-SUM[3..64]Pi*logPi

we have

dH/d(-P1) = log(P1/(P-P1)) > 0

Thus

P1>P-P1

or

P1>P2

Thus, we see that, if the information content has to
increase in a mutation, the frequency of the codon which
mutates has to be greater than the frequency of the codon
to which it mutated.
============================

[...]

BH:
>> The above statement appears in the discussion section and is
>> offered as a way of understanding the empirical results presented.
>> The authors analyzed a bunch of data for one point mutations in
>> the human haemoglobin gene and found that out of a total of
>> 204 one point mutations, 139 resulted in an increase in IC,
>> 2 resulted in no change in IC, 54 in a decrease with 9 being
>> uncertain. So, one point mutations resulted in an increase in
>> information content about 70% of the time.
>

Greg:
>How did they measure the information content of the genome?
>The other problem with using Shannon measures is that you have
>to have a complete prior distribution. Assuming something as
>important as this seems fraught with peril to me. Perhaps the
>biologists know something I don't, though....Can anyone enlighten
>me? It would seem that to measure the information content of
>the genome in question, you would have to figure out how much
>information is conveyed to you when you are told which genome
>actually is present. To do this, you need a prior distribution
>over all genomes. I have a hard time believing the biologists
>know this distribution.
>

Here is the authors description of how they obtained the
frequencies:

=======================
In order to analyse the mutations in the human haemoglobin,
gene, we need data on the codon frequencies in human DNA.
Unfortunately, only very small regions of human DNA have
been sequenced, and this data is quite insufficient for
a meaningful analysis. We have, therefore, taken as our
data the codon frequencies averaged over all the vertebrate
genes that have been sequenced so far (ref.). These
frequencies are given in Table 1.

Using the above data, we have examined the one point
mutations in the alpha, beta, gamma and delta chains,
i.e. those mutations which can be attributed to a
single base substitution in the corresponding codons
of the normal gene. The sequence of the the alpha, beta,
gamma and delta chains of the normal gene has already
been determined (refs.), and as we know the amino acid
substitutions (ref.) it can be seen what single base
substitutions in the codon would lead to the observed
mutation. In other words, in most cases, we know what
the mutated codon is. Using Table 1, it is then a simple
matter to see whether in the mutation the condition
satisfied is P1>P2 or not. ....
==================

Now to your questions, which raise some very valid points.
unnececessary to actually know what the information content
is in order to know whether or not it increases. This is determined
solely by the frequencies P1 and P2. Which raises then another
question, how accurately are P1 and P2 known? It is obvious
that these are not known precisely, nevertheless, in the authors
defense, they don't have to be known precisely. Are they known
well enough to be confident in saying P1>P2? Still, though,
regardless of this the mathematics shows that a random mutation
will likely increase the information content.

Brian Harper | "If you don't understand
Associate Professor | something and want to
Applied Mechanics | sound profound, use the
The Ohio State University | word 'entropy'"
| -- Morrowitz
Bastion for the naturalistic |
rulers of science |