===========================================================================
From: hpyockey@aol.com (HPYockey)
Newsgroups: bionet.info-theory
Subject: Re: Mutual Information Mutual entropy
Date: 31 Oct 1995 10:38:21 -0500
Organization: America Online, Inc. (1-800-827-6364)
Lines: 193
Sender: root@newsbf02.news.aol.com
Message-ID: <475ftd$alg@newsbf02.news.aol.com>
Reply-To: hpyockey@aol.com (HPYockey)
NNTP-Posting-Host: newsbf02.mail.aol.com
Subject: Mutual information, mutual entropy, uncertainty, order,
orderliness, are names for functions in information theory
Much of the difficult in understanding information theory is due to the
anthropomorphisms associated with such words as "information",
"uncertainty, "entropy" "complexity", "self-organization" and so forth.
They must be regarded as only names of mathematical functions. [Nature
v344 p823 (1990)]
To establish the principal purpose of communication systems let us go to
the founders, Hartley [Bell System Technical Journal v7 p535-563 (1928)]
and Shannon [BSTJ v27 p379-423 (1948)]. First let us go to Hartley: "What
I hope to accomplish in this direction is to set up a quantitative measure
whereby the capacities of various systems to transmit information may be
compared." Shannon: "The fundamental problem of communication is that of
reproducing at one point either exactly or approximately a message
selected at another point." x xx"The significant aspect is that the actual
message is one selected from a set of possible messages. The system must
be designed to operate for each possible selection, not just the one which
will actually be chosen since this is unknown at the time of design."
It should be emphasized at first that messages are not necessarily
generated by human beings or other organisms and consequently do not
always reflect the intuitive anthropomorphic notion of "uncertainty". The
pictures of the outer planets received from space crafts are not generated
by man. All that is needed is that the signals be digitized and selected
from a set of possible messages. This applies to the genetic messages
recorded in DNA. Notice that the word "uncertainty" has a different
meaning in the Heisenberg Uncertainty Principle than in communication
technology.
Shannon goes on to point out that: "If the number of messages is finite
then this number or any monotonic function of this number can be regarded
as a measure of the information produced when one message is chosen from
the set, all choices being equally likely. As was pointed out by Hartley
the most natural choice is the logarithmic function." Shannon then showed
that only the expectation value of the logarithm of the probabilities of
the letters of the alphabet in which the message is expressed has the
properties needed for a measure of the information in an ensemble of
messages chosen with a given probability distribution. This function looks
like the formula for entropy in classical statistical mechanics. If that
bothers you, remember that entropy is only a name; call it Shadrach. The
mathematical symbol used by Shannon for the entropy at the source of the
ensemble of messages is H(x).
Notice also that one must establish that the phenomenon under
consideration is stochastic before applying probability theory. The rising
of the Sun is not a stochastic event; discussing that in terms of
probability is nonsense.
How does one go about relating the entropy of a set of messages to the
number of messages so that one can get a measure of the information in the
ensemble, as Shannon requires? The total number of possible messages is
given by the following expression: A^N where A is the number of letters in
the alphabet and N is the number of letters in the chain of letters in
each member of the ensemble of messages. For example, it is often said
that the total possible number of proteins of 100 amino acids is 20^100 =
1.26 x 10^130. This is technically true but it ignores the fact that all
amino acids are not equally probable. Does that make difference?
Let us approach the problem from another point of view. Suppose we
consider the following way of calculating the number, p(C), of members of
the ensemble of messages.
p(C) = Product p(i)^p(i)N where p(i) is the probability for each letter in
the alphabet. Let us take the logarithm of p(C) to the base 2. log [p(C)]
= N Sum p(i) log p(i) = - NH, where H is recognized as the expectation
value of the logarithm, to base 2, of the ensemble, the entropy of the
ensemble.
Therefore p(C) = 2^-NH The number of sequences in this calculation is
2^NH. 2^NH is equal to A^N only when all members of the alphabet are
equally probable and is otherwise often very much smaller. Notice that the
expression for H was not introduced ad hoc but comes out of the woodwork,
so to speak.
Since 2^NH is very much smaller than A^N, except when all members of the
alphabet are equally probable, where did all the other sequences go? This
is given by the Shannon-McMillan-Breiman theorem that is briefly as
follows:
Given epsilon greater than zero and eta greater than zero, but no matter
how small, then for sequences of length N being sufficiently large, all
sequences being chosen from an alphabet of A symbols, the ensemble of
sequences can be divided into two groups such that:
(1) the probability p(C) of any sequence in the first group satisfies:
{[log 1/p(C)]/N - H} < eta
(2) The sum of the probabilities of all sequences in the second group is
less than epsilon.
The Shannon-McMillan-Breiman theorem tells us that the number of sequences
in the high probability group is 2^NH and they are all nearly equally
probable. We can ignore all others because, if N is large, their total
probability is very small.
Now we can refer to Shannon's condition: "If the number of messages is
finite then this number or any monotonic function of this number can be
regarded as a measure of the information produced when one message is
chosen from the set, all choices being equally likely." Thus we see that
the number of sequences we need to be concerned about is controlled by the
value of H, the entropy of the ensemble. The larger H is the more messages
can be assembled at the source. That is why H is called the "information"
sent: the more messages there are, the more "information" and "knowledge"
can be sent from the source.
This is presuming that the sender incorporates knowledge or "meaning" in
each message in the high probability set. How he does this is, or whether
he does so his business. Remember that the telephone company is not
interested in what you say nor its value to you. It is interested only in
how much you can say, and will charge you accordingly. By the same token,
your hard drive measures information in bytes without regard to "meaning".
H is also a measure of the "complexity" since the more messages in the
high probability group the more "complex" it is. H is also a measure of
"uncertainty" because the more messages there are the less certain we are
that any given one will appear in the receiver. If H = 0 we are quite
certain that only one message can be assembled. We should try to avoid
these anthropomorphisms, however.
I showed in Journal of Theoretical Biology v91, 13-31 (1981) that given an
estimate of H for reasonable probabilities of the amino acids one finds
1.04 x 10^125 sequences in a series of 100 amino acids, in the high
probability group. That is only 10^-5 of the total possible. An
interesting exercise to illustrate the Shannon-McMillan-Breiman theorem is
to apply it to a dice game where the letters of the alphabet are the
numbers 2 through 12, not all of equal probability. How many sequences are
there in the total possible and how many are in the high probability
group?
We are now in a position to consider the effect of noise and the role of
mutual information. The function for the entropy of the ensemble of
messages received at the destination has the same mathematical form as
H(x). Shannon used the symbol H(y). Shannon chose the Markov process as
the mathematical model for communication of the message from the sender to
the receiver. The alphabet in the sender need not be the same as the
alphabet in the receiver. For example, in molecular biology, the alphabet
of the sender in DNA and RNA is not the same alphabet in the receiver,
protein. In DNA it is the second extension of the primary T, G, C, A
alphabet and forms the codon triplets. In order for the communication
system to operate, a code must exist that relates the letters of the
sending alphabet to those of the receiving alphabet. The number of letters
in the sending alphabet may be equal to or larger, but not less, than the
number of letters in the receiving alphabet. Thus in the genetic code
there are 64 letters in the DNA and mRNA alphabets and 20 in the protein
alphabet.
The matrix elements of the Markov process will reflect the fact that
during the transmission in a practical case, it happens that a letter sent
is not always received according to the code. This effect is called noise.
Shannon showed that the amount of information lost due to noise is the
conditional entropy of x(i) given that y(j) has been received. That is, as
he put it, given a correction channel leading back to a correction device,
it is possible to encode the correction data so that all but an
arbitrarily small fraction epsilon, of the errors would be corrected. The
mutual entropy is the entropy of the source less the conditional entropy
of X given that Y has been received. That is, mutual entropy is the
entropy of the messages common to sender and receiver. It is sometimes
called the mutual information. There is a Venn diagram on page 20 of Cover
and Thomas and on page 84 of Roman "Coding and Information Theory" that
show the relation between these functions. See also Quastler discussed
below.
I showed how mutual entropy plays a role in the transmission of
information from mRNA to protein in Journal of Theoretical Biology v46
369-406, (1974) and the application to calculating the information content
of cytochrome c in Journal of Theoretical Biology v67 345-376 (1977)
Among the books those who use mutual information in their work should be
familiar with are: Mathematical Foundations of Information Theory, A. I.
Khinchin, translated by R. A. Silverman and M. D. Friedman Dover (1957);
Coding and Information Theory, Richard W. Hamming (1986) Prentice-Hall;
Information Theory, Cover and Thomas, (1991) John Wiley: Coding and
Information Theory, Steven Roman, Springer-Verlag (1992). These books are
directed toward mathematicians or electrical engineers. Information Theory
and Molecular Biology Cambridge University Press (1992) shows the
applications to molecular biology.
Henry Quastler was the first to realize the importance of information
theory and coding theory in molecular biology. He gives a good discussion
of the basic ideas in information theory and their application to problems
in molecular biology and psychology in "A Primer on Information Theory" in
Symposium on Information Theory in Biology, edited by Hubert P. Yockey,
Robert L. Platzman and Henry Quastler; Pergamon Press 1958. In
particular, he shows a diagram of the relation of mutual entropy to other
entropy functions. This book is out of print but many university libraries
have it and it can be borrowed by Interlibrary loan.
For mathematicians interested in probability theory there is a good deal
of material on entropy in probability theory in: Ergodic Theory by Karl
Peterson, Cambridge University Press (1983).
It is important to stick to the mathematics and not depend on one's
intuition or attempt to extract mathematics from the meaning of words. It
goes the other way.
"Although I am fully convinced of the truth of the views given in this
volume under the form of an abstract, I by no means expect to convince
experienced naturalists whose minds are stocked with a multitude of facts,
all viewed, during a long course of years, from a point of view directly
opposite to mine.x Any one whose disposition leads him to attach more
weight to unexplained difficulties than to the explanation of a certain
number of facts will certainly reject the theory."
Charles Robert Darwin Origin of Species Chapter XV
Best regards, Hubert P. Yockey
========================================================================
========================
Brian Harper |
Associate Professor | "It is not certain that all is uncertain,
Applied Mechanics | to the glory of skepticism" -- Pascal
Ohio State University |
========================