How to practically compare aspects of two distributions

Basic steps for the investigation

  1. calc test statistic: e.g. coin of interest is biased ($p \ne 0.5$)
  2. define null hypothesis: e.g. if the coin were unbiased / fair we would have $p = 0.5$)
  3. compute the p-value

The core of the p-value is to estimate how probable an aspect of an observed dataset is. The aspect is the test statistic, like the mean occurrences of heads in a series of coin tosses. To retrieve the p-value we compute the test statistic for the observed dataset and for a lot of perturbed version of the dataset. Then we can quantify how rare the value for the test statistic we calculated for the observed, unperturbed, dataset really is.

What I find needs a bit of getting used to is to figure out how to perturb the dataset properly. To apply this to various practical situations one may need quite a bit of experience. In the first example below we want to verify if two coins deviate from oneanother in their fairness. So our test statistic is the difference of the share of obtained heads. We define our hypothesis as: the two coins are equally fair. To verify this we perturb our dataset of the coin tosses of both coins by repeatedly, n_test times, mixing their results up and computing the test statistic. We do this because of both coins are equally fair the mixing up shouldn't matter. This way we can gauge where in the distribution of the test statistic for the mixed up datasets the test statistic of the original dataset lies. If we want to know how much more biased towards heads coin 1 is than coin 0 we only need to compute the cumulative density function over the test statistics from the mixups and find the probability (left tail only) for the test statistic of the original dataset. If we want to know about general deviation in fairness we need to consider both sides of the distribution of the test statistics from the mixups.

Synthetic data

Discrete - Bernoulli $p$

Defining generative distributions for population 0 and 1

pop0_gen = stats.bernoulli(p=.5)
pop1_gen = stats.bernoulli(p=.7)

Number of samples to generate for each population

n0 = 100
n1 = 120

Generating the populations

samples0 = pop0_gen.rvs(size=n0)
samples1 = pop1_gen.rvs(size=n0)
samples0[:5], samples1[:5]
(array([0, 1, 1, 0, 1]), array([0, 1, 1, 0, 1]))

We claim (hypothesis) the two coins pop0_gen and pop1_gen do not deviate from each other in their fairness → test statistic is the difference of their means

get_mean_test_statistic[source]

get_mean_test_statistic(a:ndarray, b:ndarray)

test_statistic = get_mean_test_statistic(samples0, samples1); test_statistic
-0.22999999999999998

To compute the p-value we need to compute the distribution of bootstrapped distributions and their test statistics. n_tests defines the number of bootstrappings to perform

n_tests = 1000

get_boot_test_statistics[source]

get_boot_test_statistics(a:ndarray, b:ndarray, n_tests:int=100, test_statistic_fun:callable='get_mean_test_statistic', mode='merge+shuffle')

Computes bootstrapped test statistics using shuffling

%%time
test_statistic_boot = get_boot_test_statistics(samples0, samples1, n_tests=n_tests)
Wall time: 29 ms

get_p_value[source]

get_p_value(x:ndarray, test_statistic:float)

Computes the p-value using x as generated from get_boot_test_statistics and the original test_statistic

%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val
Wall time: 975 µs
0.00013503295330145114

Visualizing where the test statistic of the observation lies relative to the bootstraps for the evaluation of the significance

plot_hist_with_kde_for_hypothesis_test[source]

plot_hist_with_kde_for_hypothesis_test(test_statistic_boot:ndarray, p_val:float, left_ext:float=0.5, right_ext:float=0.5)

ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()

Discrete observables - $n$-sided dice $p$s

Comparing fairness of dice using the $\chi^2$ test statistic

pop0_gen = stats.rv_discrete(values=(np.arange(5), [.2, .2, .2, .2, .2]))
pop1_gen = stats.rv_discrete(values=(np.arange(5), [.2, .25, .15, .2, .2]))

n0 = 1000
n1 = 120

samples0 = pop0_gen.rvs(size=n0)
samples1 = pop1_gen.rvs(size=n0)
samples0[:5], samples1[:5]
(array([4, 3, 0, 3, 0]), array([0, 0, 0, 3, 3]))
pd.Series(samples0).value_counts().sort_index()
0    220
1    215
2    183
3    209
4    173
dtype: int64

get_chi2_test_statistic[source]

get_chi2_test_statistic(a:ndarray, b:ndarray)

b is the reference

test_statistic = get_chi2_test_statistic(samples1, samples0); test_statistic
0.007944368251113451
%%time
test_statistic_boot = get_boot_test_statistics(samples1, samples0, n_tests=n_tests, 
                                               test_statistic_fun=get_chi2_test_statistic)
Wall time: 2.75 s
%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val
Wall time: 0 ns
0.4166074247279987
ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()

Continuous - correlation

Testing how significant the ranked Spearman correlation of two variables is

x_noise = stats.norm(loc=0, scale=1.)
y_noise = stats.norm(loc=0, scale=1.)

n = 100

samples0 = np.linspace(-1,1,n) + x_noise.rvs(size=n)
samples1 = samples0 + y_noise.rvs(size=n)
samples0[:5], samples1[:5]
(array([-1.44618272, -2.17776027, -0.63076426, -0.05349711, -2.01605591]),
 array([-3.42547504, -2.42894734, -0.85560307, -0.04791731, -1.3396359 ]))
plt.scatter(samples0, samples1)
plt.show()

get_spearmanr_test_statistic[source]

get_spearmanr_test_statistic(a:ndarray, b:ndarray)

Computes the ranked Spearman correlation

test_statistic = get_spearmanr_test_statistic(samples1, samples0); test_statistic
0.773105310531053
%%time
test_statistic_boot = get_boot_test_statistics(samples1, samples0, n_tests=n_tests, 
                                               test_statistic_fun=get_spearmanr_test_statistic)
Wall time: 551 ms
%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val
Wall time: 0 ns
-4.440892098500626e-16
ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()

NSFG data

Computing if the weight of newborns to first time mothers is significantly different to mothers with 1+ children using get_mean_test_statistic.

df = pd.read_csv(nsfg.clean_file_path)
df.head()
caseid prglngth outcome pregordr birthord birthwgt_lb birthwgt_oz agepreg finalwgt totalwgt_lb totalwgt_kg
0 1 39 1 1 1.0 8.0 13.0 33.16 6448.271112 8.8125 3.997279
1 1 39 1 2 2.0 7.0 14.0 39.25 6448.271112 7.8750 3.572037
2 2 39 1 1 1.0 9.0 2.0 14.33 12999.542264 9.1250 4.139027
3 2 39 1 2 2.0 7.0 0.0 17.83 12999.542264 7.0000 3.175144
4 2 39 1 3 3.0 6.0 3.0 18.33 12999.542264 6.1875 2.806601

Grouping samples into first newborn (first) and 2nd or higher baby count (other)

live = df.loc[df["outcome"]==1, :]
first, other = live.loc[live["pregordr"]==1, :], live.loc[live["pregordr"]!=1, :]
len(first), len(other)
(3368, 5780)

Collecting the weights and cleaning

col = "totalwgt_kg"  # prglngth = pregnancy length, totalwgt_kg = total weight of the newborn
samples0, samples1 = first[col].dropna().values, other[col].dropna().values  
[3.9972795 4.139027  3.8838815 3.4302895 3.5436875] [3.572037  3.175144  2.8066005 4.3374735 3.798833 ]
-0.047628365605402845

Computing the test statistic over the original dataset

test_statistic = get_mean_test_statistic(samples0, samples1); test_statistic

Mixing up the dataset computing test statistics

%%time
test_statistic_boot = get_boot_test_statistics(samples0, samples1, n_tests=n_tests)
Wall time: 26 ms
%%time
p_val = get_p_value(test_statistic_boot, test_statistic); p_val
Wall time: 1.03 ms
1.0136336214827679e-13
ax = plot_hist_with_kde_for_hypothesis_test(test_statistic_boot, p_val, left_ext=.5, right_ext=.5)
plt.show()