Synthesizing data¶

x_noise = stats.norm(loc=0, scale=1.)
y_noise = stats.norm(loc=0, scale=1.)

n = 100

samples0 = np.linspace(-1,1,n) + x_noise.rvs(size=n)
samples1 = samples0 + y_noise.rvs(size=n)
samples0[:5], samples1[:5]

(array([-2.35969528, -0.74308751, -1.12853045, -0.79721491,  0.30391563]),
 array([-3.69093851, -1.10378229, -2.16618519, -1.74904238,  0.41284136]))

plt.scatter(samples0, samples1)
plt.show()

Computing coefficients¶

wiki:

The offset, $\widehat{\alpha}$, can be calculated using: $$ \widehat{\alpha } = {\bar{y}}-{\widehat{\beta}}\bar{x} $$

And the slope, $\widehat{\beta}$, can be calculated using: $$ \widehat{\beta} = {\frac {\sum _{i=1}^{n}(x_{i}-{\bar{x}})(y_{i}-{\bar{y}})}{\sum_{i=1}^{n}(x_{i}-{\bar{x}})^{2}}} = \frac {\text{cov}\left(x,y\right)} {\text{var}(x)}$$

$\hat{x}$ and $\hat{y}$ are the expected values for $x$ and $y$ respectively.

%%time
slope, offset = regression_coeffs_with_covar(samples0, samples1); slope, offset

Wall time: 988 µs

(0.9915053349133488, -0.051078530663675535)

%%time
res = stats.linregress(samples0, samples1)

Wall time: 1.01 ms

Visualizing scipy's stats.linregress against the regression implemented here

fig, ax = plt.subplots(figsize=(7,4))
ax.scatter(samples0, samples1, label="data")
ax.scatter(samples0, slope*samples0+offset, label="fit")
ax.scatter(samples0, res[0]*samples0+res[1], marker="x", label="fit - linregress")
ax.legend(loc="best")
plt.show()

Confidence calculation¶

%%time
res = bootstrap_regression(samples0, samples1, regress_func=regression_coeffs_with_covar, max_pct=1., n_iter=100)

Wall time: 17 ms

Visualizing the fits obtained by bootstrapping by overlaying them as well as plotting the distributions of calculated slopes and intercepts

fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(10,7), constrained_layout=True)
# reshaping the top of the figure
gs = axs[0,0].get_gridspec()
for ax in axs[0,:]: ax.remove()
axb = fig.add_subplot(gs[0,:])

dots = axb.scatter(samples0, samples1, alpha=.7, label="data")
x = np.linspace(samples0.min(), samples0.max(), 100)
for (slope, offset) in res:
    line_bs, = axb.plot(x, slope*x+offset, "-k", alpha=.1, lw=1, label="bootstrapped fit")
axb.legend(handles=[dots, line_bs], loc="best")
axb.set_title("Fits - bootstrapped")

ax = axs[1,0]
ax = basic_stats.plot_hist_with_kde(res[:,0], left_ext=.5, right_ext=.5, ax=ax, bins=20,
                                    plot_params={"title": "Slope - bootstrapped"})
ax = axs[1,1]
ax = basic_stats.plot_hist_with_kde(res[:,1], left_ext=.5, right_ext=.5, ax=ax, bins=20,
                                    plot_params={"title": "Offset - bootstrapped"})
plt.show()

Hypothesis testing the slope¶

Above we have found what the confidence regions for the slope and offset parameters are. Here we compute the significance of the slope value. For this we randomly re-assign $x$ and $y$ pairs.

n_tests = 100

test_statistic_fun = partial(regression_coeffs_with_covar, return_offset=False)

%%time
test_statistic_boot = hypothesis_testing.get_boot_test_statistics(samples0, samples1, n_tests=n_tests, 
                                                                  test_statistic_fun=test_statistic_fun,
                                                                  mode="shuffle")

test_statistic = res[:,0].mean(); test_statistic

0.972975974304457

p_val = hypothesis_testing.get_p_value(test_statistic_boot, test_statistic); p_val

1.1102230246251565e-16

Same but using statsmodels

df = pd.DataFrame({"x": samples0, "y": samples1})

%%time
model = smf.ols("y ~ x", data=df)
res = model.fit(); res

Wall time: 10 ms

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x21190365308>

res.params

Intercept   -0.051272
x            0.981590
dtype: float64

res.pvalues

Intercept    5.716606e-01
x            2.806342e-21
dtype: float64

res.rsquared

0.601232317515692

Linear least squares

Synthesizing data¶

Computing coefficients¶

`regression_coeffs_with_covar`[source]

Confidence calculation¶

`bootstrap_regression`[source]

Hypothesis testing the slope¶

Linear least squares

Synthesizing data¶

Computing coefficients¶

regression_coeffs_with_covar[source]

Confidence calculation¶

bootstrap_regression[source]

Hypothesis testing the slope¶

`regression_coeffs_with_covar`[source]

`bootstrap_regression`[source]