Content in this notebook
- generate your synthetic data →
obs
- do some basic analysis of that data using a random forest →
model
- create your hypothesis (causal graph) about your data using the networkx package →
causal_model
- compute causal estimand, causal effect and potential refutations using the dowhy package
References
References for the dowhy package:
Material which helped my introductory reading so far by :
- degeneratestate.org: Pt I - potential outcomes (contains definitions of ATE, ATT and ATC, among other things)
- degeneratestate.org: Pt II - causal graphs & the backdoor criterion
- this youtube video on backdoor paths
- Rubin et al. 2005 on causal inference and potential outcomes
- wiki article on the average treatment effect (ATE)
- this wiki article on propensity score matching
- this wiki article on regression discontinuity design
Open questions
- How is matching or grouping done in categorical vs continuous treatment variable cases?
- How to visualize ATT, ATC and ATE and the reasoning for their potential different applications?
- Could the code easily be adjusted to extend the regression using arbitrary scikit-learn methods?
- Why is logistic regression used to compute the propensity score and not other methods?
n = 1000
a,b,c = 1.5, 1., 0
target = 'Y'
obs = pd.DataFrame(columns=['X0', 'X1', target])
obs['X0'] = stats.uniform.rvs(loc=-1, scale=2, size=n)
# obs['X1'] = stats.norm.rvs(loc=0, scale=.1, size=n)
obs['X1'] = stats.uniform.rvs(loc=-.5, scale=1, size=n)
obs[target] = a * obs.X0 ** b + c + obs.X1
obs.head()
n = 1000
a,b,c,d = 1.5, 1., 1., 0.
target = 'Y'
obs = pd.DataFrame(columns=['X0', 'X1', 'X2', target])
obs['X0'] = stats.uniform.rvs(loc=-1, scale=2, size=n)
obs['X1'] = stats.norm.rvs(loc=0, scale=.1, size=n)
obs['X2'] = stats.norm.rvs(loc=0, scale=.1, size=n)
obs[target] = a * (obs.X0 * obs.X1) ** b + obs.X1 ** c + d
obs.head()
Data generating model: $ Y = V \cdot \beta + W \cdot c_2 + X \cdot c_e \cdot V $
with $$t \equiv \varepsilon + W \cdot c_1 + Z \cdot c_z$$
- $V =$ treatment
- $Y =$ outcome
- $W =$ common cause (links treatment and outcome)
- $X =$ effect modifiers (only affect the outcome)
- $Z =$ instrumental variables (only affect the outcome through the treatments) → examples
n = 1000
n_common_causes = 2
n_instruments = 3
n_eff_mods = 2
n_treatments = 1
beta = 1 # max random value
target = 'Y'
%%time
cc = CommonCauses.get_obs(n, n_common_causes)
cc.obs.head()
%%time
ins = Instruments.get_obs(n, n_instruments)
ins.obs.head()
%%time
em = EffectModifiers.get_obs(n, n_eff_mods)
em.obs.head()
stochastically_convert_to_binary(.5)
%%time
treat = Treatments.get_obs(n, n_treatments, cc, ins, beta, treatment_is_binary=True)
treat.obs.head()
outcome_is_binary = True
out = Outcomes.get_obs(treat, cc, em, outcome_is_binary=outcome_is_binary)
out.obs.head()
obs = pd.concat((treat.obs, cc.obs, em.obs, ins.obs, out.obs), axis=1)
obs.head()
obs.describe()
Covariation of the target variable with all other variables individually as observed (still being all confounded and what not)
plot_target_vs_rest(obs)
Distribution of each individual variables observed values
plot_var_hists(obs)
Looking at correlations to get an idea of what interaction might be important
show_correlations(obs)
X, y, not_target = get_Xy(obs, target=target)
model = ensemble.RandomForestRegressor(n_estimators=100, max_features='sqrt')
model.fit(X,y)
Getting a feeling for the quality of the model
get_model_feel(model, obs, target=target)
Looking at which features are particularly important for the prediction
fi_scores = get_feature_importance(model, obs, target=target,
metric=metrics.mean_squared_error)
fi_scores.head()
Looking at partial dependencies
%%time
part_deps = get_partial_dependencies(model, obs, target=target,
max_num_obs=100,
max_num_ys=10)
%%time
plot_partial_dependencies(part_deps, target=target)
gg = GraphGenerator(obs)
print(f'target var: {gg.target}, not target vars: {", ".join(gg.not_targets)}')
g = gg.get_only_Xi_to_Y()
g = gg.get_Xi_to_Y_with_ccs_and_such()
layout = nx.random_layout(g)
g.edges
gg.vis_g(g, kind='spiral')
First let's cast the graph into a form consumable by dowhy.CausalModel
gml = gg.get_gml(g); gml
Instantiating the causal_model
treatment = ['V0', ] # 'X1'
causal_model = dw.CausalModel(data=obs, treatment=treatment,
outcome=target, graph=gml)
Sanity checking that the model looks actually like what we had in mind with g
, noticing that a node, U, was added between treatment
and outcome
, which is an unobserved confounder
causal_model.view_model()
Generating the expression for the estimand
identified_estimand = causal_model.identify_effect(proceed_when_unidentifiable=True)
print(identified_estimand)
Estimating the effect between treatment and control. For this we need to define what the treatment_value
and control_value
are, respectively. This is what for each group the treatment, e.g. X0, will be set to. If it is binary then this can be translated into "received a drug" or "received a placebo" by, for example, using 0 and 1 respectively.
How dowhy.CausalModel
handles the treatment input is specified with the second part in the method_name
argument, the part after the dot. For example, when the treatement is of a categorical nature you can use 'propensity_score_stratification'
, or when it is of a continuous nature you can use 'linear_regression'
.
Still unclear:
Difference between
'regression_discontinuity'
and'linear_regression'
?Difference between
'instrumental_variable'
and'backdoor'
?
# method_name = 'backdoor.propensity_score_stratification'
# method_name = 'backdoor.propensity_score_matching'
# method_name = 'backdoor.propensity_score_weighting'
# method_name = 'backdoor.linear_regression'
# method_name = 'iv.instrumental_variable'
# method_name = 'iv.regression_discontinuity'
method_name = 'iv.linear_regression'
effect_kwargs = dict(
method_name=method_name,
control_value = 0,
treatment_value = 1,
target_units = 'ate',
test_significance = True
)
causal_estimate = causal_model.estimate_effect(identified_estimand,
**effect_kwargs)
print(causal_estimate)
For the simple linear case using method_name = backdoor.linear_regression'
with control_value = 0
and treatment_value = 1
we get the slope of the (linear) relationship between treatment and outcome, which is ~1.5, just as the value used to generate the data.
If however the exponent $b$ of $X_0$ in the data generating model is changed from 1 → 2 we see that suddenly the estimated value is bonkers.
Trying to refute the observed causal effect. I see this as sanity checking the causal graph created to capture the observed data by trying to poke holes into it in various ways, like adding potentially spurious correlations with a new variable or replacing the treatment with noise.
method_name
:
- "random_common_cause": Adding a randomly-generated confounder
- "add_unobserved_common_cause": Adding a confounder that is associated with both treatment and outcome
- "placebo_treatment_refuter": Replacing the treatment with a placebo (random) variable)
- "data_subset_refuter": Removing a random subset of the data
# method_name = 'random_common_cause'
# method_name = 'add_unobserved_common_cause'
method_name = 'placebo_treatment_refuter'
# method_name = 'data_subset_refuter'
refute_kwargs = dict(
method_name=method_name,
placebo_type = "permute", # relevant for placebo refutation
subset_fraction = .9, # relevant for subset refutation
)
refute_res = causal_model.refute_estimate(identified_estimand,
causal_estimate,
**refute_kwargs)
print(refute_res)