correlation, Regression

deutsch english français Print

8.6 CORRELATION, REGRESSION

INTRODUCTION

In both the natural sciences and in daily life, the collection of data in the form of measurements plays an important role. However, you do not always need to use a measuring instrument. The data can also consist in poll values, for example, in connection with a statistical investigation. After any data acquisition, it is most important to interpret the measured data. You may look for qualitative statements where the measured values rise or fall over time, or you might want experimentally verify a law of nature.

A major problem is that measurements are almost always subject to fluctuations and are not exactly reproducible. Despite these "measurement errors" people would still like to make scientifically correct statements. Measurement errors do not always emerge from flawed measuring instruments, but they may also lie in the nature of the experiment. For instance, the "measurement" of the numbers on a rolled die basically results in dispersed values between 1 and 6. Therefore, with any type of measurement, statistical considerations play a central role.

Often two variables x and y are measured together in a measurement series and questions arise as to whether these are related and if they are subject to regularity. Plotting the (x, y) values as measuring points in a coordinate system is called data visualization. You can almost always recognize whether the data are dependent of each other by simply looking at the distribution of the measured values.

PROGRAMMING CONCEPTS:

Data visualization, measured value distribution, cloud diagram, noise, covariance, correlation coefficient, regression, best fit

VISUALIZING INDEPENDENT AND DEPENDENT DATA

You can easily simulate independent data in a x-y diagram by using uniformly distributed random numbers for x and y. Although it is not necessary in this case, you copy the measured values into the data lists xval and yval and only then display them as data points.

from random import random
from gpanel import *

z = 10000
    
makeGPanel(-1, 11, -1, 11)
title("Uuniformly distributed value pairs")
drawGrid(10, 10, "gray")

xval = [0] * z
yval = [0] * z

for i in range(z):
    xval[i] = 10 * random()
    yval[i] = 10 * random()
    move(xval[i], yval[i])
    fillCircle(0.03)

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

You get galaxy-like graphics if the x and y values are still independent of each other, but normally distributed around an average value. In this example, 5 is used as the average and 1 is used as the dispersion. This is also called a scatter plot or a cloud diagram.

You can easily generate normally distributed random numbers by using random.gauss(average value, dispersion).

from random import gauss
from gpanel import *

z = 10000
    
makeGPanel(-1, 11, -1, 11)
title("Normally distributed value pairs")
drawGrid(10, 10, "gray")

xval = [0] * z
yval = [0] * z

for i in range(z):
    xval[i] = gauss(5, 1)
    yval[i] = gauss(5, 1)
    move(xval[i], yval[i])
    fillCircle(0.03)

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

You can also simulate dependencies between x and y values by assuming that x increases in equidistant steps in a certain range, and y is a function of x that is, however, subject to statistical fluctuations. In physics, such fluctuations are often called noise.

You draw the cloud diagram for a parabola y = -x * (0.2 * x - 2) and normally distributed noise in the area x = 0..10 with an increment size of 0.01.

from random import gauss
from gpanel import *
import math

z = 1000
a = 0.2
b = 2

def f(x):
    y = -x * (a * x - b)
    return y

makeGPanel(-1, 11, -1, 11)
title("y = -x * (0.2 * x - 2) with normally distributed noise")
drawGrid(0, 10, 0, 10, "gray")

xval = [0] * z
yval = [0] * z

for i in range(z):
    x = i / 100
    xval[i] = x
    yval[i] = f(x) + gauss(0, 0.5)
    move(xval[i], yval[i])
    fillCircle(0.03)

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

MEMO

You can recognize interdependencies of measured variables immediately through the use of data visualization [more... This example is characterized famous because correlation does not recognize the dependence].

With uniformly distributed random numbers in an interval [a, b], the numbers appear with the same frequency in each subinterval of equal length. Normally distributed numbers have a bell-shape distribution, where 68% of the numbers lie inside the interval (average - dispersion) and (average + dispersion).

COVARIANCE AS A MEASURE FOR DEPENDENCY

Here the goal is to not only make interdependencies visible in a diagram, but also to express them by a number. As often, you start from a concrete example and consider the double rolling of a die, where x is the number you get from the first roll and y is the number from the second roll. You conduct the rolling experiment many times and draw the value pairs as a cloud diagram. Since both measurement values of x and y are independent, you obtain a regular point cloud. If you calculate the expected value of x and y, it results in the generally known 3.5.

Since x and y are independent, the probability p_xy, to get the pair x, y from a double roll is p_xy = p_x * p_y.

It is an obvious assumption that in the general case the following product rule applies.

If x and y are independent, the expected value of x * y equals the product of the expected values of x and y [more... The theoretical proof is simple:

] .

This assumption is confirmed in the simulation.

from random import randint
from gpanel import *

z = 10000 # number of double rolls

def dec2(x):
    return str(round(x, 2))

def mean(xval):
    n = len(xval)
    count = 0
    for i in range(n):
        count += xval[i]
    return count / n

makeGPanel(-1, 8, -1, 8)
title("Double rolls. Independent random variables")
addStatusBar(30)
drawGrid(0, 7, 0, 7, 7, 7, "gray")

xval =  [0] * z
yval =  [0] * z
xyval = [0] * z

for i in range(z):
    a = randint(1, 6)
    b = randint(1, 6)
    xval[i] = a
    yval[i] = b
    xyval[i] = xval[i] * yval[i]
    move(xval[i], yval[i])
    fillCircle(0.2)

xm = mean(xval)
ym = mean(yval)
xym = mean(xyval)
setStatusText("E(x) = " + dec2(xm) + \
          ",  E(y) = " + dec2(ym) + \
          ",  E(x, y) = " + dec2(xym))

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

MEMO

It is advisable to use a status bar where the results can be written out. The function dec2() rounds the value to 2 digits and returns it as a string.

The values are of course subject to a statistical fluctuation.

In the next simulation you will no longer analyze the rolled pair of numbers, but rather the numbers x of the first die and the sum y of the two rolled numbers. It is now evident that y is dependent on x, for example if x = 1, the probability of y = 4 is not the same as when x = 2.

Your simulation confirms that the product rule indeed no longer applies if x and y are interdependent. It is therefore reasonable to introduce the deviation from the product rule, the difference

c = E(x*y) - E(x)*E(y)

called covariance, as a measure of the dependency between x and y. Simultaneously, you see in your program that the covariance can also be calculated as the sum of all square deviations from the average.

from random import randint
from gpanel import *

z = 10000 # number of double rolls

def dec2(x):
    return str(round(x, 2))

def mean(xval):
    n = len(xval)
    count = 0
    for i in range(n):
        count += xval[i]
    return count / n

def covariance(xval, yval):
    n = len(xval)
    xm = mean(xval)
    ym = mean(yval)
    cxy = 0
    for i in range(n):
       cxy += (xval[i] - xm) * (yval[i] - ym)
    return cxy / n
    
makeGPanel(-1, 11, -2, 14)
title("Double rolls. Independent random variables")
addStatusBar(30)
drawGrid(0, 10, 0, 13, 10, 13, "gray")

xval =  [0] * z
yval =  [0] * z
xyval = [0] * z

for i in range(z):
    a = randint(1, 6)
    b = randint(1, 6)
    xval[i] = a
    yval[i] = a + b
    xyval[i] = xval[i] * yval[i]
    move(xval[i], yval[i])
    fillCircle(0.2)

xm = mean(xval)
ym = mean(yval)
xym = mean(xyval)
c = xym - xm * ym
setStatusText("E(x) = " + dec2(xm) + \
          ",  E(y) = " + dec2(ym) + \
          ",  E(x, y) = " + dec2(xym) + \
          ",  c = " + dec2(c) + \
          ",  covariance = " + dec2(covariance(xval, yval)))

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

MEMO

You get a value of about 2.9 for the covariance in the simulation. The covariance is therefore well qualified as a measure for the dependence of random variables.

THE CORRELATION COEFFICIENT

The just introduced covariance also has a disadvantage. Depending on how the measured values x and y are scaled, different values occur, even if the values are apparently equally dependent on each other. This is easy to see. If you take, for example, two dice that have sides with numbers going from 10 to 60 instead of from 1 to 6, the covariance changes greatly even though the dependence is the same. This is why you introduce a normalized covariance, called a correlation coefficient, by dividing the covariance by the dispersion of the x and y values

correlation coefficient(x, y) = ( covariance(x, y) )/ dispersion(x) * dispersion(y)

The correlation coefficient is always in the range from -1 to 1. A value close to 0 corresponds to a small dependence and a value close to 1 represents a large dependence, whereas increasing values of x correspond to increasing values of y, and a value close to -1 corresponds to increasing values of x and decreasing values of y.

You again use the double roll in the program and analyze the dependence of the sum of the numbers from the number of the first die.

from random import randint
from gpanel import *
import math

z = 10000 # number of double rolls
k = 10 # scalefactor

def dec2(x):
    return str(round(x, 2))

def mean(xval):
    n = len(xval)
    count = 0
    for i in range(n):
        count += xval[i]
    return count / n

def covariance(xval, yval):
    n = len(xval)
    xm = mean(xval)
    ym = mean(yval)
    cxy = 0
    for i in range(n):
       cxy += (xval[i] - xm) * (yval[i] - ym)
    return cxy / n

def deviation(xval):
    n = len(xval)
    xm = mean(xval)
    sx = 0
    for i in range(n):
        sx += (xval[i] - xm) * (xval[i] - xm)
    sx = math.sqrt(sx / n)
    return sx

def correlation(xval, yval):
    return covariance(xval, yval) / (deviation(xval) * deviation(yval))
    
makeGPanel(-1 * k, 11 * k, -2 * k, 14 * k)
title("Double rolls. Independent random variables.")
addStatusBar(30)
drawGrid(0, 10 * k, 0, 13 * k, 10, 13, "gray")

xval =  [0] * z
yval =  [0] * z
xyval = [0] * z

for i in range(z):
    a = k * randint(1, 6)
    b = k * randint(1, 6)
    xval[i] = a
    yval[i] = a + b
    xyval[i] = xval[i] * yval[i]
    move(xval[i], yval[i])
    fillCircle(0.2 * k)

xm = mean(xval)
ym = mean(yval)
xym = mean(xyval)
c = xym - xm * ym
setStatusText("E(x) = " + dec2(xm) + \
          ",  E(y) = " + dec2(ym) + \
          ",  E(x, y) = " + dec2(xym) + \
          ",  covariance = " + dec2(covariance(xval, yval)) + \
          ",  correlation = " + dec2(correlation(xval, yval)))

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

MEMO

You can change the scaling factor k and the correlation will stay around 0.71, as opposed to the covariance which greatly changes [more... The correlation coefficient checks only linear dependencies.
You turn it, for example, to the above parabolic cloud map,
so there is a value close to 0, although the values associated obvious].

Instead of always calling it the correlation coefficient, you can simply call it correlation.

MEDICAL RESEARCH PUBLICATION

With the knowledge you have gained, you are already capable of understanding and evaluating a scientific publication, published in the year 2012 in the prestigious journal "New England Journal of Medicine" [more... FH Messerli, Chocolate Comsumption, Cognitive Function, and Nobel Laureates,
The New England Journal of Medicine 367; 16 (Oct. 2012)]. The connection between consumers of chocolate and the number of Nobel Prize winners is investigated in different industrialized countries, or in other words, the question is considered whether there is a correlation between eating chocolate and intelligence. In the article the author uses the following data sources which you can also find on the Internet

Nobel Prizes:
http://en.wikipedia.org/wiki/List_of_countries_by_Nobel_laureates_per_capita

Chocolate consumption:
http://www.chocosuisse.ch/web/chocosuisse/en/documentation/facts_figures.html
http://www.theobroma-cacao.de/wissen/wirtschaft/international/konsum

You want to reconstruct the investigation yourself. In your program, you use a list data with sub-lists of three elements: the name of the country, the amount of chocolate consumption in kg per year and per inhabitant, and the number of Nobel Prize winners per 10 million inhabitants. You display the data graphically and determine the correlation coefficient.

from gpanel import *
import math

data = [["Australia", 4.8, 5.141],
        ["Austria", 8.7, 24.720],
        ["Belgium", 5.7, 9.005],
        ["Canada", 3.9, 6.253],
        ["Denmark", 8.2, 24.915],
        ["Finland", 6.8, 7.371],
        ["France", 6.6, 9.177],
        ["Germany", 11.6, 12.572],
        ["Greece", 2.5, 1.797],
        ["Italy", 4.1, 3.279],
        ["Ireland", 8.8, 12.967],
        ["Netherlands", 4.5, 11.337],
        ["Norway", 9.2, 21.614],
        ["Poland", 2.7, 3.140],
        ["Portugal", 2.7, 1.885],
        ["Spain", 3.2, 1.705],
        ["Sweden", 6.2, 30.300],
        ["Switzerland", 11.9, 30.949],
        ["United Kingdom", 9.8, 19.165],
        ["United States", 5.3, 10.811]]

def dec2(x):
    return str(round(x, 2))

def mean(xval):
    n = len(xval)
    count = 0
    for i in range(n):
        count += xval[i]
    return count / n

def covariance(xval, yval):
    n = len(xval)
    xm = mean(xval)
    ym = mean(yval)
    cxy = 0
    for i in range(n):
       cxy += (xval[i] - xm) * (yval[i] - ym)
    return cxy / n

def deviation(xval):
    n = len(xval)
    xm = mean(xval)
    sx = 0
    for i in range(n):
        sx += (xval[i] - xm) * (xval[i] - xm)
    sx = math.sqrt(sx / n)
    return sx

def correlation(xval, yval):
    return covariance(xval, yval) / (deviation(xval) * deviation(yval))

makeGPanel(-2, 17, -5, 55)
drawGrid(0, 15, 0, 50, "lightgray")

xval = []
yval = []

for country in data:
    d = country[1]
    v = country[2]
    xval.append(d)
    yval.append(v)   
    move(d, v)
    fillCircle(0.2)
    text(country[0])

title("Chocolate-Brainpower-Correlation: " + dec2(correlation(xval, yval)))

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

MEMO

This results in a high correlation of approximately 0.71, which can be interpreted in different ways. It is right to say that there is a correlation between the two sets of data, but we can only speculate about the reasons why. In particular a causal relationship between the consumption of chocolate and intelligence can by no means be proven this way. As a general rule: When there is a high correlation between x and y, x can be the cause for the behavior of y. Likewise, y can be the cause for the behavior of x, or x and y can be associated with other perhaps unknown causes.

Discuss the meaning and purpose of this investigation with other people in your community and ask them for their opinion.

MEASUREMENT ERRORS AND NOISE

Even with an exact relationship between x and y, measurement errors or other influences can result in fluctuating measured values. In this simulation you assume a linear relationship between x and y, whereby the function values y are subjected to normally distributed fluctuations.

The program calculates the correlation coefficient and draws the point with the coordinates (xm, ym) where xm and ym are the expected values of x and y, respectively.

from random import gauss
from gpanel import *
import math

z = 1000
a = 0.6
b = 2
sigma = 1

def f(x):
    y = a * x + b
    return y

def dec2(x):
    return str(round(x, 2))

def mean(xval):
    n = len(xval)
    count = 0
    for i in range(n):
        count += xval[i]
    return count / n

def covariance(xval, yval):
    n = len(xval)
    xm = mean(xval)
    ym = mean(yval)
    cxy = 0
    for i in range(n):
       cxy += (xval[i] - xm) * (yval[i] - ym)
    return cxy / n

def deviation(xval):
    n = len(xval)
    xm = mean(xval)
    sx = 0
    for i in range(n):
        sx += (xval[i] - xm) * (xval[i] - xm)
    sx = math.sqrt(sx / n)
    return sx

def correlation(xval, yval):
    return covariance(xval, yval) / (deviation(xval) * deviation(yval))
    
makeGPanel(-1, 11, -1, 11)
title("y = 0.6 * x + 2 normally distributed measurement errors")
addStatusBar(30)
drawGrid(0, 10, 0, 10, "gray")
setColor("blue")
lineWidth(3)
line(0, f(0), 10, f(10))

xval = [0] * z
yval = [0] * z

setColor("black")
for i in range(z):
    x = i / 100
    xval[i] = x
    yval[i] = f(x) + gauss(0, sigma)
    move(xval[i], yval[i])
    fillCircle(0.03)

xm = mean(xval)
ym = mean(yval)
move(xm, ym)
circle(0.5)

setStatusText("E(x) = " + dec2(xm) + \
          ",  E(y) = " + dec2(ym) + \
          ",  correlation = " + dec2(correlation(xval, yval)))

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

MEMO

This results in a high correlation as expected, which becomes even greater the smaller the selected dispersion sigma is. The correlation is exactly 1 for sigma = 0.
Also, the point with the expected values P(0.5, 0.5) lies on the straight line.

REGRESSION LINE, BEST FIT

Previously you started with a straight line, that you made "noisy" yourself. Here, you ask the opposite question: How do you detect the straight line again if you only have the two series of measurements? The desired line is called the regression line.

At least you already know that the regression line passes through the point P with the expected values. To find it, you can do the following:

You place any line through P and determine the squares of the deviations from the measured values for all x from the straight line (vertical distance). The deviations must be as small as possible for the regression line. For this, you determine an error amount by adding all individual deviations together.

You can find the best line in the simulation by gradually turning your placed line around the point P and always calculating the error sum, which will decline. You will have found the best fit just before the error sum begins to rise again.

from random import gauss
from gpanel import *
import math

z = 1000
a = 0.6
b = 2

def f(x):
    y = a * x + b
    return y

def dec2(x):
    return str(round(x, 2))

def mean(xval):
    n = len(xval)
    count = 0
    for i in range(n):
        count += xval[i]
    return count / n

def covariance(xval, yval):
    n = len(xval)
    xm = mean(xval)
    ym = mean(yval)
    cxy = 0
    for i in range(n):
       cxy += (xval[i] - xm) * (yval[i] - ym)
    return cxy / n

def deviation(xval):
    n = len(xval)
    xm = mean(xval)
    sx = 0
    for i in range(n):
        sx += (xval[i] - xm) * (xval[i] - xm)
    sx = math.sqrt(sx / n)
    return sx

sigma = 1
    
makeGPanel(-1, 11, -1, 11)
title("Simulate data points. Press a key...")
addStatusBar(30)
drawGrid(0, 10, 0, 10, "gray")
setStatusText("Press any key")

xval = [0] * z
yval = [0] * z

for i in range(z):
    x = i / 100
    xval[i] = x 
    yval[i] = f(x) + gauss(0, sigma)
    move(xval[i], yval[i])
    fillCircle(0.03)

getKeyWait()
xm = mean(xval)
ym = mean(yval)
move(xm, ym)
lineWidth(3)
circle(0.5)

def g(x):
    y = m * (x - xm) + ym
    return y

def errorSum():
    count = 0
    for i in range(z):
        x = i / 100
        count += (yval[i] - g(x)) * (yval[i] - g(x))
    return count

m = 0
setColor("red")
lineWidth(1)
error_min = 0
while m < 5:
    line(0, g(0), 10, g(10))
    if m == 0:
        error_min = errorSum()
    else:
        if errorSum() < error_min:
            error_min = errorSum()
        else:
            break
    m += 0.01

title("Regression line found")
setColor("blue")
lineWidth(3)
line(0, g(0), 10, g(10))
setStatusText("Found slope: " +  dec2(m) + \
               ", Theory: " + dec2(covariance(xval, yval) 
              /(deviation(xval) * deviation(xval))))

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

MEMO

Instead of using a computer simulation to find the best fit, you can also directly calculate the slope of the regression line.

m =

(covariance(x, y))/ dispersion(x)²

Since the line passes through the point P with the expected values E(x) and E(y), it thus has the linear equation:

y - E(y) = m * (x - E(x))

EXERCISES

The well-known Engel's law of economics states that the actual amount of money spent on food rises in households with a rising income, yet the proportion of income spent on food decreases. Show the accuracy of this using the data material.

a. Visualize the relationship between income and total food expenditure (amount of money spent)

b. Visualize the relationship between income and relative food expenditure

c. Determine the correlation between income and absolute food expenditure

d. Determine the regression line between income and absolute food expenditure

Data:

Monthly income	4000	4100	4200	4300	4400	4500	4600	4700	4800	4900
Expenditure%	64	63.25	62.55	61.90	61.30	60.75	60.25	59.79	59.37	58.99

5000	5100	5200	5300	5400	5500	5600	5700	5800	5900	6000
58.65	58.35	58.08	57.84	57.63	57.45	57.30	57.17	57.06	56.97	56.90

The currently generally accepted evolutionary history of the universe assumes that there was a Big Bang a long time ago and since then, the universe expands. The main question is when the Big Bang dates back to, or what is called the age of the universe. The astronomer Hubble published his world famous studies in 1929, where he found that there is a linear relationship between the distance d of the galaxies and their escape velocity v. The Hubble law is:

v = H * d

(where H is the Hubble constant)

You can comprehend the astrophysical thought process by starting with the following experimental data obtained from the Hubble Space Telescope:

Galaxy	Distance [Mpc)	Velocity [km/s]
NGC0300	2	133
NGC095	9.16	664
NGC1326A	16.14	1794
NGC1365	17.95	1594
NGC1425	21.88	1473
NGC2403	3.22	278
NGC2541	11.22	714
NGC2090	11.75	882
NGC3031	3.63	80
NGC3198	13.8	772
NGC3351	10	642
NGC3368	10.52	768
NGC3621	6.64	609
NGC4321	15.21	1433
NGC4414	17.7	619
NGC4496A	14.86	1424
NGC4548	16.22	1384
NGC4535	15.78	1444
NGC4536	14.93	1423
NGC4639	21.98	1403
NGC4725	12.36	1103
IC4182	4.49	318
NGC5253	3.15	232
NGC7331	14.72	999

1 Megaparsec (Mpc) = 3.09 * 10¹⁹km

Plot the data in a scatter plot. Copy it to the data list into your program.

data = [
["NGC0300", 2.00, 133],
["NGC095", 9.16, 664],
["NGC1326A", 16.14, 1794],
["NGC1365", 17.95, 1594],
["NGC1425", 21.88, 1473],
["NGC2403", 3.22, 278],
["NGC2541", 11.22, 714],
["NGC2090", 11.75, 882],
["NGC3031", 3.63, 80],
["NGC3198", 13.80, 772],
["NGC3351", 10.0, 642],
["NGC3368", 10.52, 768],
["NGC3621", 6.64, 609],
["NGC4321", 15.21, 1433],
["NGC4414", 17.70, 619],
["NGC4496A", 14.86, 1424],
["NGC4548", 16.22, 1384],
["NGC4535", 15.78, 1444],
["NGC4536", 14.93, 1423],
["NGC4639", 21.98, 1403],
["NGC4725", 12.36, 1103],
["IC4182", 4.49, 318],
["NGC5253", 3.15, 232],
["NGC7331", 14.72, 999]]

from Freedman et al, The Astrophysical Journal, 553 (2001)

Show that the values correlate well and determine the slope H of the regression line.

If you assume that the velocity v of a certain galaxy remains constant, its distance is d = v * T, where T is the age of the universe. Following the Hubble law (v = H * d) we can deduce T = 1 / H . Determine T.

ADDITIONAL MATERIAL

FINDING CACHED INFORMATION WITH AUTOCORRELATION

Intelligent beings on a distant planet want to contact other living beings. To do this, they send out radio signals that present a certain regularity (you can imagine a kind of Morse code). The signal becomes weaker and weaker on the long transmission path and gets masked by statistically fluctuating noise. We receive this signal on Earth with a radio telescope, and so for the time being we only hear the noise.

The statistics and a computer program can help us retrieve the original radio signal again. If you calculate the correlation coefficient of the signal with its own time-shifted signal, you decrease the statistical noise components. (A correlation with itself is called an autocorrelation).

You can simulate this property important for signal analysis with a Python program. The original useful signal is a sine wave and you superimpose it on the noise by adding a random number to each sample value (with a normal distribution). Show the noisy signal in the upper part of the graph and wait for a key press. The useful signal is no longer recognizable.

Subsequently, you build the autocorrelation of the signal and draw the course of the correlation coefficient in the lower part of the graph. The useful signal will be clearly recognizable again.

from random import gauss
from gpanel import *
import math

def mean(xval):
    n = len(xval)
    count = 0
    for i in range(n):
        count += xval[i]
    return count / n

def covariance(xval, yval):
    n = len(xval)
    xm = mean(xval)
    ym = mean(yval)
    cxy = 0
    for i in range(n):
       cxy += (xval[i] - xm) * (yval[i] - ym)
    return cxy / n

def deviation(xval):
    n = len(xval)
    xm = mean(xval)
    sx = 0
    for i in range(n):
        sx += (xval[i] - xm) * (xval[i] - xm)
    sx = math.sqrt(sx / n)
    return sx

def correlation(xval, yval):
    return covariance(xval, yval)/(deviation(xval)*deviation(yval))
def shift(offset):
    signal1 = [0] * 1000
    for i in range(1000):
       signal1[i] = signal[(i + offset) % 1000]
    return signal1

makeGPanel(-10, 110, -2.4, 2.4)
title("Noisy signal. Press a key...")
drawGrid(0, 100, -2, 2.0, "lightgray")

t = 0
dt = 0.1
signal = [0] * 1000
while t < 100:
    y = 0.1 * math.sin(t) # Pure signal
#    noise = 0
    noise = gauss(0, 0.2)
    z = y + noise
    if t == 0:
        move(t, z + 1)
    else:
        draw(t, z + 1)
    signal[int(10 * t)] = z
    t += dt

getKeyWait()
title("Signal after autocorrelation")
for di in range(1, 1000):
    y = correlation(signal, shift(di))
    if di == 1:
        move(di / 10, y - 1)
    else:
        draw(di / 10, y - 1)

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

To make it a bit more exciting, listen to the noisy signal first and then the extracted useful signal. To do this, tap into your knowledge from the chapter on Sound.

from soundsystem import *
import math
from random import gauss
from gpanel import *

n = 5000

def mean(xval):
    count = 0
    for i in range(n):
        count += xval[i]
    return count / n

def covariance(xval, k):
    cxy = 0
    for i in range(n):
       cxy += (xval[i] - xm) * (xval[(i + k) % n] - xm)
    return cxy / n

def deviation(xval):
    xm = mean(xval)
    sx = 0
    for i in range(n):
        sx += (xval[i] - xm) * (xval[i] - xm)
    sx = math.sqrt(sx / n)
    return sx

makeGPanel(-100, 1100, -11000, 11000)
drawGrid(0, 1000, -10000, 10000)
title("Press <SPACE> to repeat. Any other key to continue.")

signal = []
for i in range(5000):
    value = int(200 *  (math.sin(6.28 / 20 * i) + gauss(0, 4)))
    signal.append(value)
    if i == 0:
        move(i, value + 5000)
    elif i <= 1000:
        draw(i, value + 5000)

ch = 32
while ch == 32:
    openMonoPlayer(signal, 5000)
    play()
    ch = getKeyCodeWait()

title("Autocorrelation running. Please wait...")
signal1 = []
xm = mean(signal)
sigma = deviation(signal)
q = 20000 / (sigma * sigma)
for di in range(1, 5000):
    value = int(q *  covariance(signal, di))
    signal1.append(value)
title("Autocorrelation Done. Press any key to repeat.")
for i in range(1, 1000):
    if i == 1:
        move(i, signal1[i] - 5000)
    else:
        draw(i, signal1[i] - 5000)

while True:
    openMonoPlayer(signal1, 5000)
    play()
    getKeyCodeWait()

Highlight program code (Ctrl+C to copy, Ctrl+V to paste)

MEMO

Instead of executing the program, you can listen to the two signals as WAV files here

Noisy signal (click here)

Desired signal (click here)