Performing linear least squares using covariance and statsmodels plus confidence calculation and hypothesis testing

Synthesizing data

x_noise = stats.norm(loc=0, scale=1.)
y_noise = stats.norm(loc=0, scale=1.)

n = 100

samples0 = np.linspace(-1,1,n) + x_noise.rvs(size=n)
samples1 = samples0 + y_noise.rvs(size=n)
samples0[:5], samples1[:5]
(array([-2.35969528, -0.74308751, -1.12853045, -0.79721491,  0.30391563]),
 array([-3.69093851, -1.10378229, -2.16618519, -1.74904238,  0.41284136]))
plt.scatter(samples0, samples1)
plt.show()

Computing coefficients

wiki:

The offset, $\widehat{\alpha}$, can be calculated using: $$ \widehat{\alpha } = {\bar{y}}-{\widehat{\beta}}\bar{x} $$

And the slope, $\widehat{\beta}$, can be calculated using: $$ \widehat{\beta} = {\frac {\sum _{i=1}^{n}(x_{i}-{\bar{x}})(y_{i}-{\bar{y}})}{\sum_{i=1}^{n}(x_{i}-{\bar{x}})^{2}}} = \frac {\text{cov}\left(x,y\right)} {\text{var}(x)}$$

$\hat{x}$ and $\hat{y}$ are the expected values for $x$ and $y$ respectively.

regression_coeffs_with_covar[source]

regression_coeffs_with_covar(xs:ndarray, ys:ndarray, return_offset:bool=True)

Computing slope and offset

%%time
slope, offset = regression_coeffs_with_covar(samples0, samples1); slope, offset
Wall time: 988 µs
(0.9915053349133488, -0.051078530663675535)
%%time
res = stats.linregress(samples0, samples1)
Wall time: 1.01 ms

Visualizing scipy's stats.linregress against the regression implemented here

fig, ax = plt.subplots(figsize=(7,4))
ax.scatter(samples0, samples1, label="data")
ax.scatter(samples0, slope*samples0+offset, label="fit")
ax.scatter(samples0, res[0]*samples0+res[1], marker="x", label="fit - linregress")
ax.legend(loc="best")
plt.show()

Confidence calculation

bootstrap_regression[source]

bootstrap_regression(xs, ys, regress_func:callable='regression_coeffs_with_covar', max_pct:float=1.0, n_iter:int=100)

Bootstraps xs and ys to repeatedly perform regression using regress_func

%%time
res = bootstrap_regression(samples0, samples1, regress_func=regression_coeffs_with_covar, max_pct=1., n_iter=100)
Wall time: 17 ms

Visualizing the fits obtained by bootstrapping by overlaying them as well as plotting the distributions of calculated slopes and intercepts

fig, axs = plt.subplots(nrows=2, ncols=2, figsize=(10,7), constrained_layout=True)
# reshaping the top of the figure
gs = axs[0,0].get_gridspec()
for ax in axs[0,:]: ax.remove()
axb = fig.add_subplot(gs[0,:])

dots = axb.scatter(samples0, samples1, alpha=.7, label="data")
x = np.linspace(samples0.min(), samples0.max(), 100)
for (slope, offset) in res:
    line_bs, = axb.plot(x, slope*x+offset, "-k", alpha=.1, lw=1, label="bootstrapped fit")
axb.legend(handles=[dots, line_bs], loc="best")
axb.set_title("Fits - bootstrapped")

ax = axs[1,0]
ax = basic_stats.plot_hist_with_kde(res[:,0], left_ext=.5, right_ext=.5, ax=ax, bins=20,
                                    plot_params={"title": "Slope - bootstrapped"})
ax = axs[1,1]
ax = basic_stats.plot_hist_with_kde(res[:,1], left_ext=.5, right_ext=.5, ax=ax, bins=20,
                                    plot_params={"title": "Offset - bootstrapped"})
plt.show()

Hypothesis testing the slope

Above we have found what the confidence regions for the slope and offset parameters are. Here we compute the significance of the slope value. For this we randomly re-assign $x$ and $y$ pairs.

n_tests = 100
test_statistic_fun = partial(regression_coeffs_with_covar, return_offset=False)
%%time
test_statistic_boot = hypothesis_testing.get_boot_test_statistics(samples0, samples1, n_tests=n_tests, 
                                                                  test_statistic_fun=test_statistic_fun,
                                                                  mode="shuffle")
test_statistic = res[:,0].mean(); test_statistic
0.972975974304457
p_val = hypothesis_testing.get_p_value(test_statistic_boot, test_statistic); p_val
1.1102230246251565e-16

Same but using statsmodels

df = pd.DataFrame({"x": samples0, "y": samples1})
%%time
model = smf.ols("y ~ x", data=df)
res = model.fit(); res
Wall time: 10 ms
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x21190365308>
res.params
Intercept   -0.051272
x            0.981590
dtype: float64
res.pvalues
Intercept    5.716606e-01
x            2.806342e-21
dtype: float64
res.rsquared
0.601232317515692