Friday, 25 October 2013

2010 Project Maths with Ipython Notebook (and NBConvert)

In [10]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
%load_ext autoreload
%autoreload 1
A car rental company has been using Evertread tyres on their fleet of economy cars. All cars in this fleet are identical. The company manages the tyres on each car in such a way that the four tyres all wear out at the same time. The company keeps a record of the lifespan of each set of tyres. The records show that the lifespan of these sets of tyres is normally distributed with mean 45 000 km and standard deviation 8000 km.
The company is considering switching brands from Evertread tyres to SafeRun tyres, because they are cheaper. The distributors of SafeRun tyres claim that these tyres have the same mean lifespan as Evertread tyres. The car rental company wants to check this claim before they switch brands. They have enough data on Evertread tyres to regard these as a known population. They want to test a sample of SafeRun tyres against it.
The company selects 25 economy cars at random from the fleet and fits them with the new tyres. For these cars, it is found that the mean life span of the tyres is 43 850 km.
Test, at the 5% level of significance, the hypothesis that the mean lifespan of SafeRun tyres is the same as the known mean of Evertread tyres. State clearly what the company can conclude about the tyres.

State the Null and Alternative Hypothesese

\[H_0: \mu = 45000\]
\[H_1: \mu \neq 45000\]
Every out come in life is a mixture of true talent plus luck. If a footballer scores fifteen in goals there is a small chance that he is a true-fifteen-goal player who had even luck. There is a greater chance that he was a player who should have scored goals who was unlucky or a player who should have scored less goals but was unlucky. We can never know for sure but if the standard deviations for goals is 2 we but be much happier rejecting the null hypothesis that a player is a fifteen-goal player if he went out and scored 25 goals (5 standard deviations away) than if he scored 17 goals (1 standard deviation away)
One thing to note with standard deviations is how quickly things escalate:
In [11]:
numbers = [0.5 ,1.0 , 1.5 ,2.0 , 2.5, 3.0,4.0]

for x in numbers:
    # get the p-value
    print x, "standard deviation would be due to chance one out of", int(1.0/(norm.sf(x)*2)), "times"
0.5 standard deviation would be due to chance one out of 1 times
1.0 standard deviation would be due to chance one out of 3 times
1.5 standard deviation would be due to chance one out of 7 times
2.0 standard deviation would be due to chance one out of 21 times
2.5 standard deviation would be due to chance one out of 80 times
3.0 standard deviation would be due to chance one out of 370 times
4.0 standard deviation would be due to chance one out of 15787 times

Probability Distribution Function

We can see why this is by drawing a probability distribution and notice how numbers that are close to the mean are higher than those that are far away.
In [12]:
# Question Variables
mean = 45000
hypothetical_population_mean = mean
sd = 8000
sample_size = 25
confidence_interval = 1*sd  # The value from the log tables

curve_width = 4*sd # the number of standard deviations to show

# Get the range of numbers
# Use numpy's arrange(start,finish,increment)
range = np.arange(mean - curve_width, mean + curve_width , curve_width * 0.002)

# pdf is probability distribution function
y = norm.pdf(range,mean,sd)

# Show the graph
plt.plot(range, y, color="black")
[<matplotlib.lines.Line2D at 0x52a5fb0>]

How far is far enough away

We are interested in numbers in the red regions as those are far enough away that we can reject the idea that that the difference was due to chance.
In [13]:
plt.plot(range,y,color="black", ls='-')

left = mean - confidence_interval
right = mean + confidence_interval

# x < - 2
subx1 = [n for n in range if n < left] 

suby1 = y[:len(subx1)]

plt.fill_betweenx(suby1,subx1, left, color = "red")

# x < - 2
subx2 = [n for n in range if n > right] 

suby2 = y[-len(subx2):]

plt.fill_betweenx(suby2,subx2, right, color="red")
<matplotlib.collections.PolyCollection at 0x5ca5f10>

Transformation and Normalisation

When you ask someone where something is thankfully they don't take you literaly. For then the answer would be to give Ordnance Survey map co-ordinates or longituted and latitude. What we really mean is where is it from here. This is the transformation. We care that the mean of the second sample is 43,850 only so we can figure out how far it is from the proposed mean.
In maths speak this is:
\[\bar{x} - \mu\]
In [14]:
sample_mean = 43580
distance_from_mean = sample_mean - hypothetical_population_mean
print "The sample mean is", distance_from_mean, "away from the population mean"
The sample mean is -1420 away from the population mean

Next we have the normalisation which is fancy speak for measuring something. When we measure something we pick a unit. Here we measure in is standard deviations but divided by the square root of the sample size.
Dividing by the square root of the sample size gives us a smaller measuring unit for large samples. This is ok because it is counterbalanced by the increasing unlikeliness of freak results as sample sizes grow.
In maths speak:
In [15]:
normalisation_unit = sd/math.sqrt(sample_size)
factor = 2
nu_for_doubled_sample_size = sd/math.sqrt(sample_size*factor)

print "The normalisation unit is", normalisation_unit, "for sample size",sample_size
print "If you multiplied the sample size by", factor, "the unit would be", nu_for_doubled_sample_size
The normalisation unit is 1600.0 for sample size 25
If you multiplied the sample size by 2 the unit would be 1131.3708499

Standard Score (or Z-Score)

Finally to measure we divide our distance into units. Semi-formally this is:
\[\frac{distance From Mean}{normalisation Unit}\]
In maths speak:
\[\frac{\bar{x} - \mu}{\sigma/\sqrt{n}}\]
In [16]:
standard_score = float(distance_from_mean)/normalisation_unit
print "The standard score is", standard_score
The standard score is -0.8875

Is it in the Rejection Region

Now we check our log tables for the value 0.025 (dividing 0.5 in half because we are using both sides).
In [17]:
%aimport graph

ci = 1.96

In [17]:

In [17]:

Arrow Key Nav