Analysing censored data [1].

To play with survival analysis let's imagining the following case: we survey a randomly selected group of women / girls of various ages and binary marital status (random in age and legal status married / not married). What we want know is what is the expected time in years until they, on average, get married. For this we need to compute the hazard and survival function.

[1], Think Stats, ch. 13

Synthesize survey data

Age range for survey participants

age = np.arange(start=100, step=1)

Probability for being married, given the age

married = stats.norm.cdf(age, loc=25, scale=2) * .5  \
    + stats.norm.cdf(age, loc=35, scale=2) * .2

Probability for age

population_share = stats.expon.sf(age, scale=1000)
population_share = population_share / population_share.sum()

Storing in a dataframe plus ensuring societal constraint

df = pd.DataFrame({"age": age, "married": married, "population_share": population_share})
df.loc[df["age"]<18, "married"] = 0

Plotting the underlying distributions

fig, axs = plt.subplots(figsize=(10,4), ncols=2, constrained_layout=True)
ax = axs[0]
ax.plot("age", "married", data=df)
ax.set_ylim((0,1))
ax.set(ylabel="Fraction of married women given age", xlabel="Age")

ax = axs[1]
ax.plot("age", "population_share", data=df)
ax.set_ylim((0,None))
ax.set(ylabel="Fraction of female population", xlabel="Age")

plt.show()

Simulating a survey of n participants

n = 500  # number of participants

Generating the data

%%time
rv_age = stats.rv_discrete(values=(df.age, df.population_share))
survey = pd.DataFrame({"age": rv_age.rvs(size=n)})
get_p = lambda _age: df.loc[df["age"]==_age, "married"].values[0]
survey["married"] = [np.random.choice([0, 1], p=[1-get_p(_age), get_p(_age)]) for _age in survey.age]
survey.head()
Wall time: 391 ms
age married
0 9 0
1 31 0
2 59 1
3 48 1
4 18 0

Computing the hazard function, survival function and expected survival life time

Hazard function: $${\displaystyle \lambda (t)=\lim _{dt\rightarrow 0}{\frac {\Pr(t\leq T<t+dt)}{dt\cdot S(t)}}={\frac {f(t)}{S(t)}}=-{\frac {S'(t)}{S(t)}}.}$$

Expected survival life time at time $t_0$: $${\frac {1}{S(t_{0})}}\int _{0}^{{\infty }}t\,f(t_{0}+t)\,dt={\frac {1}{S(t_{0})}}\int _{{t_{0}}}^{{\infty }}S(t)\,dt,$$

wiki

estimate_hazard_fun[source]

estimate_hazard_fun(df:DataFrame, dep_var:str, idp_var:str, fill:int=0)

Estimates the hazard function using the dependent variable dep_var and the independent variable idp_var, assuming the former is binary

%%time
hf = estimate_hazard_fun(survey, "married", "age"); display(hf.head()), display(hf.tail())
dep_var 'married' shares: 0 = 0.52, 1 = 0.48
age count_0 count_1 survivors hazard_function survival_function cdf event_frequency sf_int expected_survival
0.0 0 9.0 0.0 500.0 0.0 1.0 0.0 0.0 67.881915 67.881915
1.0 1 5.0 0.0 491.0 0.0 1.0 0.0 0.0 66.881915 66.881915
2.0 2 4.0 0.0 486.0 0.0 1.0 0.0 0.0 65.881915 65.881915
3.0 3 6.0 0.0 482.0 0.0 1.0 0.0 0.0 64.881915 64.881915
4.0 4 4.0 0.0 476.0 0.0 1.0 0.0 0.0 63.881915 63.881915
age count_0 count_1 survivors hazard_function survival_function cdf event_frequency sf_int expected_survival
86.0 95 3.0 2.0 22.0 0.090909 0.127450 0.872550 0.4 0.322374 2.529412
87.0 96 1.0 4.0 17.0 0.235294 0.097462 0.902538 0.8 0.194924 2.000000
NaN 97 0.0 6.0 12.0 0.500000 0.048731 0.951269 1.0 0.097462 2.000000
NaN 98 0.0 2.0 6.0 0.333333 0.032487 0.967513 1.0 0.048731 1.500000
88.0 99 2.0 2.0 4.0 0.500000 0.016244 0.983756 0.5 0.016244 1.000000
Wall time: 41.6 ms
(None, None)

Visualizing

%%time
fig, axs = plt.subplots(figsize=(10,9), constrained_layout=True, nrows=3, sharex=True)

ax = axs[0]
ax.plot("age", "married", data=df,label="married - ground truth", color="black")
ax.plot("age", "event_frequency", data=hf, color="g", linestyle="--", label="married - naive estimate")
ax.set(ylabel="Fraction of married women", ylim=(0,1))
ax.legend(loc="best")

ax = axs[1]
ax.plot("age", "hazard_function", data=hf, color="b", linestyle="--", label="hazard function - estimated")
ax.plot("age", "cdf", data=hf, color="g", linestyle="--", label="prob. married - estimated") # , marker="o"
ax.set(ylabel="Survival function value")
ax.legend(loc="best")

ax = axs[2]
ax.plot("age", "expected_survival", data=hf)
ax.set(ylabel="Expected unmarried years | age", xlabel="Age")
plt.show()
Wall time: 844 ms