Synthesizing data¶
x_noise = stats.norm(loc=0, scale=1.)
y_noise = stats.norm(loc=0, scale=1.)
n = 100
samples0 = np.linspace(-1,1,n) + x_noise.rvs(size=n)
samples1 = samples0 + y_noise.rvs(size=n)
samples0[:5], samples1[:5]
plt.scatter(samples0, samples1)
plt.show()
Computing coefficients¶
wiki:
The offset, $\widehat{\alpha}$, can be calculated using: $$ \widehat{\alpha } = {\bar{y}}-{\widehat{\beta}}\bar{x} $$
And the slope, $\widehat{\beta}$, can be calculated using: $$ \widehat{\beta} = {\frac {\sum _{i=1}^{n}(x_{i}-{\bar{x}})(y_{i}-{\bar{y}})}{\sum_{i=1}^{n}(x_{i}-{\bar{x}})^{2}}} = \frac {\text{cov}\left(x,y\right)} {\text{var}(x)}$$
$\hat{x}$ and $\hat{y}$ are the expected values for $x$ and $y$ respectively.
%%time
slope, offset = regression_coeffs_with_covar(samples0, samples1); slope, offset
%%time
res = stats.linregress(samples0, samples1)
Visualizing scipy's stats.linregress
against the regression implemented here
fig, ax = plt.subplots(figsize=(7,4))
ax.scatter(samples0, samples1, label="data")
ax.scatter(samples0, slope*samples0+offset, label="fit")
ax.scatter(samples0, res[0]*samples0+res[1], marker="x", label="fit - linregress")
ax.legend(loc="best")
plt.show()
Confidence calculation¶
%%time
res = bootstrap_regression(samples0, samples1, regress_func=regression_coeffs_with_covar, max_pct=1., n_iter=100)
Visualizing the fits obtained by bootstrapping by overlaying them as well as plotting the distributions of calculated slopes and intercepts
fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(10,7), constrained_layout=True)
# reshaping the top of the figure
gs = axs[0,0].get_gridspec()
for ax in axs[0,:]: ax.remove()
axb = fig.add_subplot(gs[0,:])
dots = axb.scatter(samples0, samples1, alpha=.7, label="data")
x = np.linspace(samples0.min(), samples0.max(), 100)
for (slope, offset) in res:
line_bs, = axb.plot(x, slope*x+offset, "-k", alpha=.1, lw=1, label="bootstrapped fit")
axb.legend(handles=[dots, line_bs], loc="best")
axb.set_title("Fits - bootstrapped")
ax = axs[1,0]
ax = basic_stats.plot_hist_with_kde(res[:,0], left_ext=.5, right_ext=.5, ax=ax, bins=20,
plot_params={"title": "Slope - bootstrapped"})
ax = axs[1,1]
ax = basic_stats.plot_hist_with_kde(res[:,1], left_ext=.5, right_ext=.5, ax=ax, bins=20,
plot_params={"title": "Offset - bootstrapped"})
plt.show()
Hypothesis testing the slope¶
Above we have found what the confidence regions for the slope and offset parameters are. Here we compute the significance of the slope value. For this we randomly re-assign $x$ and $y$ pairs.
n_tests = 100
test_statistic_fun = partial(regression_coeffs_with_covar, return_offset=False)
%%time
test_statistic_boot = hypothesis_testing.get_boot_test_statistics(samples0, samples1, n_tests=n_tests,
test_statistic_fun=test_statistic_fun,
mode="shuffle")
test_statistic = res[:,0].mean(); test_statistic
p_val = hypothesis_testing.get_p_value(test_statistic_boot, test_statistic); p_val
Same but using statsmodels
df = pd.DataFrame({"x": samples0, "y": samples1})
%%time
model = smf.ols("y ~ x", data=df)
res = model.fit(); res
res.params
res.pvalues
res.rsquared