Basic steps for the investigation

calc test statistic: e.g. coin of interest is biased ($p \ne 0.5$)
define null hypothesis: e.g. if the coin were unbiased / fair we would have $p = 0.5$)
compute the p-value

The core of the p-value is to estimate how probable an aspect of an observed dataset is. The aspect is the test statistic, like the mean occurrences of heads in a series of coin tosses. To retrieve the p-value we compute the test statistic for the observed dataset and for a lot of perturbed version of the dataset. Then we can quantify how rare the value for the test statistic we calculated for the observed, unperturbed, dataset really is.

What I find needs a bit of getting used to is to figure out how to perturb the dataset properly. To apply this to various practical situations one may need quite a bit of experience. In the first example below we want to verify if two coins deviate from oneanother in their fairness. So our test statistic is the difference of the share of obtained heads. We define our hypothesis as: the two coins are equally fair. To verify this we perturb our dataset of the coin tosses of both coins by repeatedly, n_test times, mixing their results up and computing the test statistic. We do this because of both coins are equally fair the mixing up shouldn't matter. This way we can gauge where in the distribution of the test statistic for the mixed up datasets the test statistic of the original dataset lies. If we want to know how much more biased towards heads coin 1 is than coin 0 we only need to compute the cumulative density function over the test statistics from the mixups and find the probability (left tail only) for the test statistic of the original dataset. If we want to know about general deviation in fairness we need to consider both sides of the distribution of the test statistics from the mixups.

Synthetic data¶

Discrete - Bernoulli $p$¶

Defining generative distributions for population 0 and 1

pop0_gen = stats.bernoulli(p=.5)
pop1_gen = stats.bernoulli(p=.7)

Number of samples to generate for each population

n0 = 100
n1 = 120

Generating the populations

samples0 = pop0_gen.rvs(size=n0)
samples1 = pop1_gen.rvs(size=n0)
samples0[:5], samples1[:5]

(array([0, 1, 1, 0, 1]), array([0, 1, 1, 0, 1]))

We claim (hypothesis) the two coins pop0_gen and pop1_gen do not deviate from each other in their fairness → test statistic is the difference of their means

test_statistic = get_mean_test_statistic(samples0, samples1); test_statistic

-0.22999999999999998

To compute the p-value we need to compute the distribution of bootstrapped distributions and their test statistics. n_tests defines the number of bootstrappings to perform

n_tests = 1000

%%time
test_statistic_boot = get_boot_test_statistics(samples0, samples1, n_tests=n_tests)

Wall time: 29 ms

%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val

Wall time: 975 µs

0.00013503295330145114

Visualizing where the test statistic of the observation lies relative to the bootstraps for the evaluation of the significance

ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()

Discrete observables - $n$-sided dice $p$s¶

Comparing fairness of dice using the $\chi^2$ test statistic

pop0_gen = stats.rv_discrete(values=(np.arange(5), [.2, .2, .2, .2, .2]))
pop1_gen = stats.rv_discrete(values=(np.arange(5), [.2, .25, .15, .2, .2]))

n0 = 1000
n1 = 120

samples0 = pop0_gen.rvs(size=n0)
samples1 = pop1_gen.rvs(size=n0)
samples0[:5], samples1[:5]

(array([4, 3, 0, 3, 0]), array([0, 0, 0, 3, 3]))

pd.Series(samples0).value_counts().sort_index()

0    220
1    215
2    183
3    209
4    173
dtype: int64

test_statistic = get_chi2_test_statistic(samples1, samples0); test_statistic

0.007944368251113451

%%time
test_statistic_boot = get_boot_test_statistics(samples1, samples0, n_tests=n_tests, 
                                               test_statistic_fun=get_chi2_test_statistic)

Wall time: 2.75 s

%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val

Wall time: 0 ns

0.4166074247279987

ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()

Continuous - correlation¶

Testing how significant the ranked Spearman correlation of two variables is

x_noise = stats.norm(loc=0, scale=1.)
y_noise = stats.norm(loc=0, scale=1.)

n = 100

samples0 = np.linspace(-1,1,n) + x_noise.rvs(size=n)
samples1 = samples0 + y_noise.rvs(size=n)
samples0[:5], samples1[:5]

(array([-1.44618272, -2.17776027, -0.63076426, -0.05349711, -2.01605591]),
 array([-3.42547504, -2.42894734, -0.85560307, -0.04791731, -1.3396359 ]))

plt.scatter(samples0, samples1)
plt.show()

test_statistic = get_spearmanr_test_statistic(samples1, samples0); test_statistic

0.773105310531053

%%time
test_statistic_boot = get_boot_test_statistics(samples1, samples0, n_tests=n_tests, 
                                               test_statistic_fun=get_spearmanr_test_statistic)

Wall time: 551 ms

%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val

Wall time: 0 ns

-4.440892098500626e-16

ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()

NSFG data¶

Computing if the weight of newborns to first time mothers is significantly different to mothers with 1+ children using get_mean_test_statistic.

df = pd.read_csv(nsfg.clean_file_path)
df.head()

Grouping samples into first newborn (first) and 2nd or higher baby count (other)

live = df.loc[df["outcome"]==1, :]
first, other = live.loc[live["pregordr"]==1, :], live.loc[live["pregordr"]!=1, :]
len(first), len(other)

(3368, 5780)

Collecting the weights and cleaning

col = "totalwgt_kg"  # prglngth = pregnancy length, totalwgt_kg = total weight of the newborn
samples0, samples1 = first[col].dropna().values, other[col].dropna().values

[3.9972795 4.139027  3.8838815 3.4302895 3.5436875] [3.572037  3.175144  2.8066005 4.3374735 3.798833 ]

-0.047628365605402845

Computing the test statistic over the original dataset

test_statistic = get_mean_test_statistic(samples0, samples1); test_statistic

Mixing up the dataset computing test statistics

%%time
test_statistic_boot = get_boot_test_statistics(samples0, samples1, n_tests=n_tests)

Wall time: 26 ms

%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val

Wall time: 1.03 ms

1.0136336214827679e-13

ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()

	caseid	prglngth	outcome	pregordr	birthord	birthwgt_lb	birthwgt_oz	agepreg	finalwgt	totalwgt_lb	totalwgt_kg
0	1	39	1	1	1.0	8.0	13.0	33.16	6448.271112	8.8125	3.997279
1	1	39	1	2	2.0	7.0	14.0	39.25	6448.271112	7.8750	3.572037
2	2	39	1	1	1.0	9.0	2.0	14.33	12999.542264	9.1250	4.139027
3	2	39	1	2	2.0	7.0	0.0	17.83	12999.542264	7.0000	3.175144
4	2	39	1	3	3.0	6.0	3.0	18.33	12999.542264	6.1875	2.806601

Hypothesis testing

Synthetic data¶

Discrete - Bernoulli $p$¶

`get_mean_test_statistic`[source]

`get_boot_test_statistics`[source]

`get_p_value`[source]

`plot_hist_with_kde_for_hypothesis_test`[source]

Discrete observables - $n$-sided dice $p$s¶

`get_chi2_test_statistic`[source]

Continuous - correlation¶

`get_spearmanr_test_statistic`[source]

NSFG data¶

Hypothesis testing

Synthetic data¶

Discrete - Bernoulli $p$¶

get_mean_test_statistic[source]

get_boot_test_statistics[source]

get_p_value[source]

plot_hist_with_kde_for_hypothesis_test[source]

Discrete observables - $n$-sided dice $p$s¶

get_chi2_test_statistic[source]

Continuous - correlation¶

get_spearmanr_test_statistic[source]

NSFG data¶

`get_mean_test_statistic`[source]

`get_boot_test_statistics`[source]

`get_p_value`[source]

`plot_hist_with_kde_for_hypothesis_test`[source]

`get_chi2_test_statistic`[source]

`get_spearmanr_test_statistic`[source]