RE: Design detection and minimum description length

From: Glenn Morton (glenn.morton@btinternet.com)
Date: Sat Dec 14 2002 - 15:25:22 EST

  • Next message: Glenn Morton: "RE: Noah not in the Black Sea"

    Iain wrote:
    >-----Original Message-----
    >From: Iain Strachan [mailto:iain.strachan@eudoramail.com]
    >Sent: Monday, December 09, 2002 10:00 PM

    >(1) The effectiveness of the Fourier transform analysis arose from
    >your observation that you had a sequence with periodicity, and
    >your personal knowledge that a periodic sequence can be analysed
    >as a Fourier series. Personally I don't see the difference
    >conceptually between spotting a Fourier series (i.e. a periodic
    >function) and spotting a sequence derived from primes. Both are
    >bits of maths that you had to know in order to make the deduction.

    Agreed, but, when it comes to DNA, exactly WHAT is the math that Demski has
    to know in order to conclude that DNA is designed? There is none. The only
    math he cites is that of a low probability, but any sequence of similar
    length has precisely the same low probability of occurrence. Thus, what
    Dembski does with DNA is to say he knows it is designed. There is no math
    to indicate design, only math to indicate low probability but that isn't
    design at all.

    I would like to note that before Fourier came up with that methodology of
    detecting periodicities, he knew of no math which would perform that trick.
    Where is the Dembski transform? Where is the design transform? It doesn't
    exist. THere isn't one because, with DNA, Dembski only claims design based
    upon personal bias not knowledge of how or when the Creator designed the
    DNA. There is no math for the positive side of his claim--i.e. a coefficient
    of design.

    >
    >(2) The point about the correlation coefficient is more
    >interesting, because it precisely illustrates the point that you
    >do need inside knowledge, and that you can't rely on some
    >"objective math" formula that allows you to crank the handle and
    >churn out meaningful results. First, let me quote from a standard
    >text-book on numerical analysis (Numerical Recipes in C), talking
    >about the correlation coefficient:
    >
    >"When a correlation is _known to be significant_ [emphasis mine],
    >R is one conventional way of summarizing its strength. In fact,
    >the value of R can be translated into a statement about what
    >residuals (root mean square deviations) are to be expected if the
    >data are fitted to a straight line by the least-squares method
    >[ref to equations skipped] ... Unfortunately, R is a rather poor
    >statistic for deciding _whether_ [emphasis in the original] an
    >observed correlation is significant, and/or whether one observed
    >correlation is significantly stronger than another. The reason is
    >that R is ignorant of the individual distributions of x and y, so
    >there is no universal way to compute its distribution in the case
    >of a null hypothesis" [Press, Teukolsky, Vetterling & Flannery:
    >"Numerical Recipes in C", Second Edition, Cambridge University
    >Press, 1992, p636].
    >
    >So, what are Press et al saying here? That the correlation
    >coefficient R is pretty meaningless unless you know that the data
    >are correlated already. This is precisely your objection to
    >Dembski (he can't detect design unless you tell him it's
    >designed, by the "side information"). Does this mean that the
    >Correlation coefficient is a totally useless statistic? Not at
    >all. They go on to discuss the general shape of the distributions
    >of x and y (concerning the fall-off rate of the tails of the
    >distributions), that allow one do derive meaningful results and a
    >distribution for R. What it comes down to is that if your data
    >when plotted on an X/Y scatter plot looks a bit like a long thin
    >ellipsoid, then you've good reason to suspect that they are
    >correlated, and from that, you can get meaningful results by
    >comparing values of R. So, you have to use your intelligence and
    >prior knowledge of what correlated variables look like, in order
    >to use the correlation coefficient.
    >
    >To see just how meaningless the results get if you just put the
    >numbers into the formula and crank out the result, consider the
    >following experiment that you can easily perform in Microsoft Excel.
    >
    >Generate 100 pairs of (x,y) points from random numbers in the
    >range 0-1 (this can be done with the Excel RAND() function. Add a
    >101st (x,y) pair and make it equal to (100,100). Now compute the
    >correlation coefficient between the two sequences (using the Excel
    >CORREL() function). You will get an answer for R that is close to
    >0.999. So your "objective math" is telling you that the sequences
    >are highly correlated.

    So???? They are highly correlated. To shorten your example, the correlation
    between:
    1,4,3,5,9

    and
    1,4,3,5,9,2

    is very high and R shows it. I see nothing here to support your contention
    that R is not a good measure of correlation at all.

    >
    >But something tells me that these sequences are not highly
    >correlated. What do you think it is? It's my inside knowledge of
    >what correlated data ought to look like. That tells me that the
    >(100,100) point is a massive outlier, and should be discarded.
    >(When R drops to around 0.01).

    The main issue here is whether or not design can be detected in biological
    systems. Your example above won't occur in biological systems. You are
    treating the biological sequence as if it is an expermental measurement. It
    is a series of relationally fixed objects. One can't simply throw out 'the
    outlier' cause there aren't any. All vales are either a,c,t,or g. Put into
    math terms all values are interger (0-3) or (1-4). You don't have 100's in
    DNA sequences! You have gone off into an area which is irrelevant to the
    question of design in the biological polymers.

      ID proponents claim to be able to determine that the inforamtion in DNA is
    designed because of the exact order of these sequences is required for
    protein function. One doesn't have 'massive outliers' unless one find the
    letter X in the sequence.

    >
    >Is this a silly example that wouldn't occur in real life? I've
    >seen a lot worse than that.

    Not in a sequence of a,c,t, and g's you haven't. Tell me what the outlier
    is there? Pray tell?

      In the first Neural Nets application
    >I worked on (that ended up as a successfully deployed analysis
    >tool), I was using a neural net to predict plasma electron density
    >profiles inside a fusion experiment (the JET vacuum vessel). The
    >electron densities were of the order of 10^20 per cubic metre.
    >However, the data file I received had a few electron densities
    >that were of the order of 10^76 per cubic metre. My background
    >knowledge of Physics told me that you just don't get electron
    >densities of 10^76 per cubic metre in a vacuum vessel (or anywhere
    >else for that matter ;-). I therefore concluded that these would
    >be down to a processing error in the computer program that gave me
    >the file of data, and discarded the offending items. If I'd
    >naively shoved it all in to the neural net, it would have ended up
    >predicting everything in the region of 10^75 - 10^76, and the
    >results would have been completely useless.

    This doesn't seem to make a point here.

    >
    >The moral of the story is that you can't make any statistical
    >inference (whether it's correlation, pattern detection, or
    >"design") just by blindly plugging your data into some formula,
    >and relying on the maths to tell you the answer. You have to use
    >your background knowledge if it's not to be "Lies, damned lies and
    >statistics".

    Biological systems don't have outliers in the sense you are using that term
    You are equivocating how you treat an experimental measurment versus a
    series. That is getting you into trouble. The exact DNA sequence in many of
    these protein genes is a sequence which has been verified by multiple
    workers. It can no longer be treated as an outlier to be tossed in the
    trash. The question is, given the sequence

    a,c,c,t,t,t,g,c,a,a,c,a... is it designed?

    One doesn't goe through the DNA and say, 'Gee, that 2nd t doesn't belong
    cause I don't like it,' and thus by getting rid of it I turn an obviously
    undesigned sequence into one which is designed. DNA is much more like a time
    series than an experimental measurement.

    >
    >That is why I don't believe your objection to Dembski's use of
    >"side information" is a valid one. There may be other reasons for
    >criticizing Dembski, but this isn't one of them.

    We disagree. Dembski has no coefficient of design. How do we tell that God
    would design DNA in the fashion we see? Maybe God works in other ways?
    Maybe he doesn't design things like we do. Dembski is anthropomorphising
    God, making God behave like a human.
    >
    >Apologies for the long delay in responding to this. Other things
    >intervened.

    No problem, I was in London this week giving an invited paper on some
    geophysical work we did in the North Sea last year and chairing a session of
    the conference. I too had other things to do.

    glenn

    see http://www.glenn.morton.btinternet.co.uk/dmd.htm
    for lots of creation/evolution information
    anthropology/geology/paleontology/theology\
    personal stories of struggle



    This archive was generated by hypermail 2.1.4 : Sun Dec 15 2002 - 23:26:29 EST