Basic steps for the investigation
- calc test statistic: e.g. coin of interest is biased ($p \ne 0.5$)
- define null hypothesis: e.g. if the coin were unbiased / fair we would have $p = 0.5$)
- compute the p-value
The core of the p-value is to estimate how probable an aspect of an observed dataset is. The aspect is the test statistic, like the mean occurrences of heads in a series of coin tosses. To retrieve the p-value we compute the test statistic for the observed dataset and for a lot of perturbed version of the dataset. Then we can quantify how rare the value for the test statistic we calculated for the observed, unperturbed, dataset really is.
What I find needs a bit of getting used to is to figure out how to perturb the dataset properly. To apply this to various practical situations one may need quite a bit of experience. In the first example below we want to verify if two coins deviate from oneanother in their fairness. So our test statistic is the difference of the share of obtained heads. We define our hypothesis as: the two coins are equally fair. To verify this we perturb our dataset of the coin tosses of both coins by repeatedly, n_test
times, mixing their results up and computing the test statistic. We do this because of both coins are equally fair the mixing up shouldn't matter. This way we can gauge where in the distribution of the test statistic for the mixed up datasets the test statistic of the original dataset lies. If we want to know how much more biased towards heads coin 1 is than coin 0 we only need to compute the cumulative density function over the test statistics from the mixups and find the probability (left tail only) for the test statistic of the original dataset. If we want to know about general deviation in fairness we need to consider both sides of the distribution of the test statistics from the mixups.
Synthetic data¶
Discrete - Bernoulli $p$¶
Defining generative distributions for population 0 and 1
pop0_gen = stats.bernoulli(p=.5)
pop1_gen = stats.bernoulli(p=.7)
Number of samples to generate for each population
n0 = 100
n1 = 120
Generating the populations
samples0 = pop0_gen.rvs(size=n0)
samples1 = pop1_gen.rvs(size=n0)
samples0[:5], samples1[:5]
We claim (hypothesis) the two coins pop0_gen
and pop1_gen
do not deviate from each other in their fairness → test statistic is the difference of their means
test_statistic = get_mean_test_statistic(samples0, samples1); test_statistic
To compute the p-value we need to compute the distribution of bootstrapped distributions and their test statistics. n_tests
defines the number of bootstrappings to perform
n_tests = 1000
%%time
test_statistic_boot = get_boot_test_statistics(samples0, samples1, n_tests=n_tests)
%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val
Visualizing where the test statistic of the observation lies relative to the bootstraps for the evaluation of the significance
ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()
Discrete observables - $n$-sided dice $p$s¶
Comparing fairness of dice using the $\chi^2$ test statistic
pop0_gen = stats.rv_discrete(values=(np.arange(5), [.2, .2, .2, .2, .2]))
pop1_gen = stats.rv_discrete(values=(np.arange(5), [.2, .25, .15, .2, .2]))
n0 = 1000
n1 = 120
samples0 = pop0_gen.rvs(size=n0)
samples1 = pop1_gen.rvs(size=n0)
samples0[:5], samples1[:5]
pd.Series(samples0).value_counts().sort_index()
test_statistic = get_chi2_test_statistic(samples1, samples0); test_statistic
%%time
test_statistic_boot = get_boot_test_statistics(samples1, samples0, n_tests=n_tests,
test_statistic_fun=get_chi2_test_statistic)
%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val
ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()
Continuous - correlation¶
Testing how significant the ranked Spearman correlation of two variables is
x_noise = stats.norm(loc=0, scale=1.)
y_noise = stats.norm(loc=0, scale=1.)
n = 100
samples0 = np.linspace(-1,1,n) + x_noise.rvs(size=n)
samples1 = samples0 + y_noise.rvs(size=n)
samples0[:5], samples1[:5]
plt.scatter(samples0, samples1)
plt.show()
test_statistic = get_spearmanr_test_statistic(samples1, samples0); test_statistic
%%time
test_statistic_boot = get_boot_test_statistics(samples1, samples0, n_tests=n_tests,
test_statistic_fun=get_spearmanr_test_statistic)
%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val
ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()
NSFG data¶
Computing if the weight of newborns to first time mothers is significantly different to mothers with 1+ children using get_mean_test_statistic
.
df = pd.read_csv(nsfg.clean_file_path)
df.head()
Grouping samples into first newborn (first
) and 2nd or higher baby count (other
)
live = df.loc[df["outcome"]==1, :]
first, other = live.loc[live["pregordr"]==1, :], live.loc[live["pregordr"]!=1, :]
len(first), len(other)
Collecting the weights and cleaning
col = "totalwgt_kg" # prglngth = pregnancy length, totalwgt_kg = total weight of the newborn
samples0, samples1 = first[col].dropna().values, other[col].dropna().values
Computing the test statistic over the original dataset
test_statistic = get_mean_test_statistic(samples0, samples1); test_statistic
Mixing up the dataset computing test statistics
%%time
test_statistic_boot = get_boot_test_statistics(samples0, samples1, n_tests=n_tests)
%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val
ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()