Hypothesis Testing in Python

This web-post aims to provide a practical guide to Hypothesis Testing in Python; this is for testing for Statistically Significant differences between two situations or scenarios (it is more complicated if there’s more than two).

This blog assumes little math or statistics background and doesn’t attempt to prove or use the math theorems directly – it is just a guide to using the pre-coded algorithms and functions to achieve standard day-to-day practical tasks.

Background and Theory

If we want to determine whether two independent outcomes have statistically significant different mean results (assuming some natural random variation) we can perform a T-Test (also referred to as Students T-Test).  This tests the Null Hypothesis that there is no significant difference versus the Alternative Hypothesis that there is a difference in the mean of the two sets of data.

  • The Null Hypothesis represents the status quo. This hypothesis is usually labeled, H_{0}. This is what we assume by default. In a practical sense, this would mean that there is no significant difference between the data-set being assessed and the general population.
  • The Alternative Hypothesis or research hypothesis is what we require evidence to conclude that there is a significant difference between our comparison data-sets. This hypothesis is usually labeled H_{a} or sometimes H_{1}  (or some other number other than 0).

There is a great video from Khan Academy that introduces Hypothesis Testing here:

The convention for Hypothesis testing is to define a value alpha or \alpha which is the probability threshold for which we accept or reject the Null Hypothesis. This is also known as the “Significance Level”.  By convention a value of 0.05 (5%) is chosen for \alpha, although sometimes this can be set to \alpha = 0.01 if we want more certainty.

We then need to calculate the p-value, which is the estimated probability of getting this sample by chance (from the population we are comparing against).

  • If the p-value is less than \alpha then we reject the Null Hypothesis, H_{0} and assume that is unlikely that the variance in the mean of our sample occurred by chance. There is a statistically significant difference.
  • If the p-value is greater than \alpha then we accept the Null Hypothesis (or at least, we cannot reject it) and assume there is a reasonable chance the variation in our test result could have occurred by chance.

There is also a related T-statistic value, which is not considered in this blog post.  This is a relative measure of difference between the two groups of data being assessed.

The background for this blog is mostly taken from Codecademy with some updates to the code and data-sets – https://www.codecademy.com/courses/learn-scipy-hypothesis-testing/lessons/hypothesis-testing/exercises/one-sample-t-test-ii

Part 1 – Use Python SciPy to test for Statistical Significance

The scipy.stats function ttest_1samp can be used to calculate a T-statistic and a P-value for a given mean value compared to a sample population.

If we supply a mean value and a reference sample set of data and the p-value is less than our chosen \alpha  (i.e. 0.05) then we infer that our mean value is not likely to occur by chance in this sample population. There is a statistically significant difference.

SciPy ttest_1samp Description

ttest_1samp calculates a univariate T-test.
This is a two-sided test for the null hypothesis that the expected value (mean) of a sample of independent observations a is equal to the given population mean.

The “Two Sided” aspect means that we consider the upper and lower bounds of the data-population distribution (i.e. not just the upper threshold.  (https://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htm)

 

Example

from scipy.stats import ttest_1samp
import numpy as np   

#Source an "age of visitors" CSV
import urllib.request as urllib2
URL = "https://raw.githubusercontent.com/edbullen/Hypothesis/master/ages.csv"
r = urllib2.urlopen(URL)
ages = list(r.read())
ages = np.asarray(ages)  

#Calculate the Mean of the ages
mu = np.mean(ages)
print("mean:", mu.round())
#Use scipy T-Test to find:
# + p-value for age of 30 given the "ages" distribution
# + p-value for age of 3 given the "ages" distribution
print("Age 30", ttest_1samp(ages,30))
print("Age 3", ttest_1samp(ages,3))
mean: 38.0
 Age 30 Ttest_1sampResult(statistic=0.5973799001456603
    , pvalue=0.5605155888171379)
 Age 3 Ttest_1sampResult(statistic=16.72663720407849
    , pvalue=3.574863924479613e-10)
 

The P-Value is the second value – this is the estimated probability of achieving this value by chance from the sample set.  Generally, if we receive a p-value of less than 0.05, we can reject the null hypothesis and state that there is a significant difference.

For the age of 30 we got a p-value that was much higher than 0.05, so we cannot reject the null hypothesis. This means a mean age of 30 could occur by chance for this population (even though our comparison has a mean of 38).

For the other example, for the age of 3 we got a very small p-value and accept the alternative hypothesis; this would be a very unusual age based on the original sample – if we had a set of visitors with average age 3, something special is going on.

Does this mean that if we wait for more visitors, the average age would definitely be 30 and not 38? Not necessarily. In fact, in this case, we know that the mean of our sample was 38.

P-values give us an idea of how confident we can be in a result.  Just because we don’t have enough data to detect a difference doesn’t mean that there isn’t one.

Part 2 – Two-Sample T-Tests

This section is based on
https://www.codecademy.com/courses/learn-scipy-hypothesis-testing/lessons/hypothesis-testing/exercises/two-sample-t-test

“Suppose that last week the average amount of time spent per visitor to a website was 25 minutes. This week, the average amount of time spent per visitor to a website was 28 minutes. Did the average time spent per visitor change? Or is this part of natural fluctuations?”

Use a Two-Sample T-Test to compare two independent data-sets that are approximately normally distributed.

To decide if the data-sets are roughly normally distributed, often a visual inspection is enough, as shown in the chart below. Alternatively the Sapiro-Wilk test can be used: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.shapiro.html


import numpy as np
import urllib.request as urllib2
import matplotlib.pyplot as plt

# Source two sets of data from web - not really CSVs !
# just a list of numbers separated by new-line character
URL1 = "https://raw.githubusercontent.com/edbullen/Hypothesis/master/week1.csv"
URL2 = "https://raw.githubusercontent.com/edbullen/Hypothesis/master/week2.csv"

r = urllib2.urlopen(URL1)
week1 = r.read().decode('utf-8')
r = urllib2.urlopen(URL2)
week2 = r.read().decode('utf-8')

#split the data by newline, convert list to np array float
week1 = np.asarray(week1.split()).astype(float)
week2 = np.asarray(week2.split()).astype(float)

# plot
plt.hist(week1, bins=20, color='b', alpha=0.5)
plt.hist(week2, bins=20, color='r', alpha=0.5)
plt.show()

plot1

Use SciPy’s  ttest_ind to obtain the p-value for two independent sets of data and determine the likelihood of them having the same mean:


from scipy.stats import ttest_ind  

week1_mean = week1.mean()
week2_mean = week2.mean()

print("Week 1 Sample Mean:", week1_mean)
print("Week 2 Sample Mean:", week2_mean)

pval = ttest_ind(week1,week2)
print(pval)

Week 1 Sample Mean: 25.4480593952
Week 2 Sample Mean: 29.0215681076
Ttest_indResult(statistic=-3.510875818698744
     , pvalue=0.000676767690454633)

If we assume \alpha  s 0.005 and we get a pvalue=0.0006767... , this means we reject the null-hypothesis and say this is a different distribution and the mean is significantly different.


if pval[1] < 0.05:
 print("pval is ", pval[1], "Null Hypothesis Rejected")
 print("samples are statistically different")
else:
 print("pval is ", pval[1], "Null Hypothesis Accepted")
 print("samples are NOT statistically different")


pval is 0.000676767690454633 Null Hypothesis Rejected
samples are statistically different

Dangers of Multiple T-Tests

(taken from Codecademy) Suppose that we own a chain of stores that sell ants, called VeryAnts. There are three different locations: A, B, and C. We want to know if the average ant sales over the past year are significantly different between the three locations.

At first, it seems that we could perform T-tests between each pair of stores.

We know that the p-value is the probability that we incorrectly reject the null hypothesis on each t-test. The more t-tests we perform, the more likely that we are to get a false positive, a Type I error.
For a p-value of 0.05, if the null hypothesis is true then the probability of obtaining a significant result is  1 – 0.05 = 0.95

When we run another t-test, the probability of still getting a correct result is

0.95 * 0.95 = 0.9025
That means our probability of making an error is now close to 10%! This error probability only gets bigger with the more t-tests we do.

Part 3 – Anova Testing for Multiple Data Sets

When comparing more than two numerical datasets, the best way to preserve a Type I  error probability of 0.05 is to use ANOVA. ANOVA (Analysis of Variance) tests the null hypothesis that all of the datasets have the same mean. If we reject the null hypothesis with ANOVA, we’re saying that at least one of the sets has a different mean (from a statistical Significance perspective); however, it does not tell us which datasets are different.

Use the SciPy function f_oneway to perform ANOVA on multiple datasets. It takes in each dataset as a different input and returns the t-statistic and the p-value.
EG:
fstat, pval = f_oneway(scores_mathematicians, scores_writers, scores_psychologists)
returns the pval of at least one of the data-sets being statistically different.
However, after using only ANOVA, we can’t make any conclusions on which two populations have a significant difference.

“The one-way ANOVA tests the null hypothesis that two or more groups have the same population mean.”

one-way test identifies if at least two groups are different from each other, but doesn’t identify which groups. Used when there is one independent variable and one dependent variable.

two-way test can be used when there are two independent variables – i.e. Store and Hour-of-Day (factors).

Anova Test Assumptions

  • The samples are independent.
  • Each sample is from a normally distributed population.
  • The population standard deviations of the groups are all equal. This property is known as homoscedasticity.

Consider “Kruskal-Wallis H-test” if these assumptions are not valid.

3.1 Example – One Way Anova Test

Perform an ANOVA test for Daily Sales on Stores a, b, and c and store the p-value in a variable called pval.

First, load some test data-sets for three different stores and visualise them:


import numpy as np
import seaborn as sns # Seaborn has some nice looking plots
import pandas as pd
import urllib.request as urllib2

# Source data from web - not really CSVs - just a list separated by new-line
URLA = "https://raw.githubusercontent.com/edbullen/Hypothesis/master/storeA.csv"
URLB = "https://raw.githubusercontent.com/edbullen/Hypothesis/master/storeB.csv"
URLC = "https://raw.githubusercontent.com/edbullen/Hypothesis/master/storeC.csv"

A = urllib2.urlopen(URLA).read().decode('utf-8')
B = urllib2.urlopen(URLB).read().decode('utf-8')
C = urllib2.urlopen(URLC).read().decode('utf-8')

#split the data by newline, convert list to np array float
A = np.asarray(A.split()).astype(float)
B = np.asarray(B.split()).astype(float)
C = np.asarray(C.split()).astype(float)

# Create a Pandas Data Frame with Store A,B,C as the columns
df = pd.DataFrame({'StoreA':A, 'StoreB':B, 'StoreC':C})
# Have to "melt" the data-frame for Seaborn
sns.boxplot(x="variable", y="value", data=pd.melt(df))

plot2


print("Store A Mean:", A.mean().round(3))
print("Store B Mean:", B.mean().round(3))
print("Store C Mean:", C.mean().round(3))


Store A Mean: 58.35
Store B Mean: 55.754
Store C Mean: 56.767

Next, run a one-way ANOVA test and get the p-value:


from scipy.stats import ttest_ind
from scipy.stats import f_oneway

pval = f_oneway(A,B,C)[1]
print(pval.round(5))


0.32132

As the p-value is greater than 0.05 we reject the Null Hypothesis and assume there is no significant difference in sales between stores, even though the mean values are slightly different.

Although visually on the box-plot it looks like there is a difference, this is probably just natrual random variation.

Now – load some new data after a Sales Campaign is run for Store C.


URLC_NEW = "https://raw.githubusercontent.com/edbullen/Hypothesis/master/storeC_NEW.csv"
C_NEW = urllib2.urlopen(URLC_NEW).read().decode('utf-8')
C_NEW = np.asarray(C_NEW.split()).astype(float)
# Create a Pandas Data Frame with Store A,B,C as the columns
df = pd.DataFrame({'StoreA':A, 'StoreB':B, 'StoreC':C_NEW})

#plot
sns.boxplot(x="variable", y="value", data=pd.melt(df))

plot3


from scipy.stats import ttest_ind
from scipy.stats import f_oneway

pval = f_oneway(A,B,C_NEW)[1]

print(pval.round(5))


0.01859

So if your \alpha  is 0.05 then we Reject the Null Hypothesis and say that it is unlikely this variation in the mean has occurred by chance as 0.0185 is less than 0.05.

If, however, we had chosen our \alpha  threshold to be 0.01 (1%) then we would still accept the null hypothesis and say there is still a reasonable statistical likelihood that this variation in the mean has occurred by chance. This is because 0.01859 is great than 0.01

This illustrates the limitation of using the p-value test to determine statistical significance.

3.2 Sample Size Effect for Determining Significant Variation

Broadly, the more samples we have from a population, the more accurately we can determine the true mean.

This is because the Sample Mean, \bar{X}  is an approximation to the Actual Mean, \mu  given by the following formula:

\mu = \bar{X} \pm \frac { 2\sigma } {\sqrt{n}}

\sigma  is the Standard Deviation (square root of the variance).

This means as n  increases (the sample size) the \pm  variation size decreases and we are closer to the actual mean – which intuitively makes sense.

This can be demonstrated with a simple simulation from our existing data-set. In the example below we decrease n  to 50 by taking a random sample of 50 items from each of the store data sets.


np.random.seed(1)
storeAsample = np.random.choice(A, size=50, replace=False)
storeBsample = np.random.choice(B, size=50, replace=False)
storeCsample = np.random.choice(C, size=50, replace=False)

Visually compare the distribution of the original StoreA sales and the random sample of 50 sales from the StoreA data-set:


import matplotlib.pyplot as plt
plt.hist(A, bins=20, alpha=0.5, color='b')
plt.hist(storeAsample, bins=20,alpha=0.5, color='r')
leg = ['StoreA', 'StoreA Sample']
plt.legend(leg)
plt.show()

plot4


# Create a Pandas df with Store A,B,C as cols
df = pd.DataFrame({'StoreAsample':storeAsample, 'StoreBsample':storeBsample, 'StoreCsample':storeCsample})

#plot
sns.boxplot(x="variable", y="value", data=pd.melt(df))

plot5


pval = f_oneway(storeAsample,storeBsample,storeCsample)[1]
print(pval.round(5))


0.47443

Conclusion

With the smaller sample of data, although we can see that StoreC still has the highest mean sales, we can no longer detect any statistical difference (the p-value is greater than 0.05)

Part 4 – Tukey Range Tests

A Tukey Range Test can be used to find out which data-set is different if we have determined that one of our data-sets has a significant difference (i.e. by using ANOVA to test for a significant difference in mean).

If we feed in a set of data-sets that are independent and normally distributed, Tukey’s test can identify which pairs of sets are different from each other.

The function to perform Tukey’s Range Test is pairwise_tukeyhsd, which is found in statsmodel, not scipy.

Provide the function with one list of data and a list of labels that identifies the data against sample-sets.

Also, provide the “p-val” significance level that is required, i.e. 0.05.


#re-use the Store Data from the previous example
#Store C prepared so that it is significantly different
storeA = A
storeB = B
storeC = C_NEW

from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Concatenate the the data into a single list / vector
vec = np.concatenate((storeA, storeB, storeC),axis=0)
# Create a vector of labels aligned to our data
labels = ['A'] * len(storeA) + ['B'] * len(storeB) + ['C'] * len(storeC)

tukey_results = pairwise_tukeyhsd(vec, labels, 0.05)
print(tukey_results)


Multiple Comparison of Means - Tukey HSD,FWER=0.05
============================================
group1 group2 meandiff lower upper reject
--------------------------------------------
 A B -2.5955 -6.7976 1.6066 False
 A C 2.4712 -1.731 6.6733 False
 B C 5.0667 0.8645 9.2688 True
--------------------------------------------

This tells us we can Reject the Null Hypothesis for Store C vs Store B -there is a significant difference.

Part 5 – Binomial Test

If we need to decide whether a count of binary outcomes shows a significant variation from the expected outcome, we can use a Binomial Test.

A Binomial Test compares a categorical dataset to some expectation – for example

  • Comparing the actual number of heads from 1000 coin flips of a weighted coin to the expected number of heads
  • Comparing the actual percentage of respondents who gave a certain survey response to the expected survey response

The null hypothesis is that there is no (statistically significant) difference between the observed behavior and the expected behavior. If we get a p-value of less than 0.05, we can reject that hypothesis and determine that there is a difference between the observation and expectation.

The SciPy function for Binomial Testing is binom_test. This requires 3 inputs:

  • number of observed sucesses / positive outcomes
  • total number of trials / events
  • Expected probability of sucesses / positive outcomes

Binomial Test Example 1 – is the coin weighted?

Throw a coin 1000 times; we expect to get a head 50% of the time, i.e. 500 positive outcomes.

Do we think the coin is weighted if we get 650 heads?


from scipy.stats import binom_test

pval_exactly_50pct = binom_test(500,n=1000,p=0.5)
pval_650_heads = binom_test(650,n=1000,p=0.5)
print(pval_exactly_50pct)
print(pval_650_heads)


1.0
1.6156310386976815e-21

…so that is a resounding YES – we reject the NULL hypothesis for 650 heads out of 1000 … we think there is something wrong with the coin!

Now, reduce the number of coin-tosses:


pval = binom_test(7,n=10,p=0.5)

print(pval)

0.3437499999999999

…so even if we get 7 heads out of 10 throws, there is nothing to be suspicious about.

Binomial Test Example 2 – Customer Purchases after visiting Web-Site

This is taken from CodeCademy:

Imagine that we are analyzing the percentage of customers who make a purchase after visiting a website. We have a set of 1000 customers from this month, 58 of whom made a purchase.

Over the past year, the number of visitors per every 1000 who make a purchase hovers consistently at around 72. Thus, our marketing department has set our target number of purchases per 1000 visits to be 72.

We would like to know if this month’s number, 58, is a significant difference from that target or a result of natural fluctuations.

n = 1000
expected_pct = 72/1000
actual = 58

pval = binom_test(actual,n=n, p=expected_pct)
print(pval.round(5))
0.09813

There is a greater than 5% likelihood of this variation occurring by chance, we accept the Null Hypothesis and assume no significant difference.

Part 6 – Chi-Square Test for Differences Between Groups with Multiple Factors

The Pearson Chi-Square test can be used to test for significant differences between groups where multiple factors are involved in the outcome – i.e. a comparison of survey responses (“A” vs “B”) for some people who have been given one of two sets of information (such as “true” info and “false” info, for example).

The Chi-Square takes a contingency table input similar to the following:

      | A   |  B
-----------------
true  | 30  | 10
false | 20  | 20

… so it looks like giving people false information skews the result by 10/40 or 25% in this example.

The Contingency Table can have as many columns and rows as necessary.

  • Rows = Outcomes
  • Columns = Conditions

Use the SciPy Stats chi2_contingency function to perform a Chi Square test:


from scipy.stats import chi2_contingency

X = [[30,10],
[20,20]]

chi2, pval, dof, expected = chi2_contingency(X)
print(pval.round(5))


0.03767

On the basis of this we Accept the Null Hypothesis and say this variation could have occured by chance.

What if we got the same ratio difference for a larger sample?


from scipy.stats import chi2_contingency

X = [[300,100],
 [200,200]]

chi2, pval, dof, expected = chi2_contingency(X)
print(pval.round(5))


4.832154459214226e-13

…we can be pretty sure that this is a significant difference and we definitely Reject the Null Hypothesis.

Chi-Square Example from CodeCademy

The management at the VeryAnts ant store wants to know if their two most popular species of ants, the Leaf Cutter and the Harvester, vary in popularity between 1st, 2nd, and 3rd graders.

We have created a table representing the different ants bought by the children in grades 1, 2, and 3 after the last big field trip to VeryAnts. Run the code to see what happens when we enter this table into SciPy’s chi-square test.

Does the resulting p-value mean that we should reject or accept the null hypothesis?


from scipy.stats import chi2_contingency

# Contingency table
# harvester | leaf cutter
# ----+------------------+------------
# 1st gr | 30 | 10
# 2nd gr | 35 | 5
# 3rd gr | 28 | 12

X = [[30, 10],
[35, 5],
[28, 12]]
chi2, pval, dof, expected = chi2_contingency(X)
print(pval)


0.15508230807673704

On the basis of this we Accept the Null Hypothesis – we tell management that there is no significant difference in popularity for these ants between 1st, 2nd, 3rd grade pupils

Leave a comment