To play with survival analysis let's imagining the following case: we survey a randomly selected group of women / girls of various ages and binary marital status (random in age and legal status married / not married). What we want know is what is the expected time in years until they, on average, get married. For this we need to compute the hazard and survival function.
[1], Think Stats, ch. 13
Synthesize survey data¶
Age range for survey participants
age = np.arange(start=100, step=1)
Probability for being married, given the age
married = stats.norm.cdf(age, loc=25, scale=2) * .5 \
+ stats.norm.cdf(age, loc=35, scale=2) * .2
Probability for age
population_share = stats.expon.sf(age, scale=1000)
population_share = population_share / population_share.sum()
Storing in a dataframe plus ensuring societal constraint
df = pd.DataFrame({"age": age, "married": married, "population_share": population_share})
df.loc[df["age"]<18, "married"] = 0
Plotting the underlying distributions
fig, axs = plt.subplots(figsize=(10,4), ncols=2, constrained_layout=True)
ax = axs[0]
ax.plot("age", "married", data=df)
ax.set_ylim((0,1))
ax.set(ylabel="Fraction of married women given age", xlabel="Age")
ax = axs[1]
ax.plot("age", "population_share", data=df)
ax.set_ylim((0,None))
ax.set(ylabel="Fraction of female population", xlabel="Age")
plt.show()
Simulating a survey of n
participants
n = 500 # number of participants
Generating the data
%%time
rv_age = stats.rv_discrete(values=(df.age, df.population_share))
survey = pd.DataFrame({"age": rv_age.rvs(size=n)})
get_p = lambda _age: df.loc[df["age"]==_age, "married"].values[0]
survey["married"] = [np.random.choice([0, 1], p=[1-get_p(_age), get_p(_age)]) for _age in survey.age]
survey.head()
Computing the hazard function, survival function and expected survival life time¶
Hazard function: $${\displaystyle \lambda (t)=\lim _{dt\rightarrow 0}{\frac {\Pr(t\leq T<t+dt)}{dt\cdot S(t)}}={\frac {f(t)}{S(t)}}=-{\frac {S'(t)}{S(t)}}.}$$
Expected survival life time at time $t_0$: $${\frac {1}{S(t_{0})}}\int _{0}^{{\infty }}t\,f(t_{0}+t)\,dt={\frac {1}{S(t_{0})}}\int _{{t_{0}}}^{{\infty }}S(t)\,dt,$$
%%time
hf = estimate_hazard_fun(survey, "married", "age"); display(hf.head()), display(hf.tail())
Visualizing
%%time
fig, axs = plt.subplots(figsize=(10,9), constrained_layout=True, nrows=3, sharex=True)
ax = axs[0]
ax.plot("age", "married", data=df,label="married - ground truth", color="black")
ax.plot("age", "event_frequency", data=hf, color="g", linestyle="--", label="married - naive estimate")
ax.set(ylabel="Fraction of married women", ylim=(0,1))
ax.legend(loc="best")
ax = axs[1]
ax.plot("age", "hazard_function", data=hf, color="b", linestyle="--", label="hazard function - estimated")
ax.plot("age", "cdf", data=hf, color="g", linestyle="--", label="prob. married - estimated") # , marker="o"
ax.set(ylabel="Survival function value")
ax.legend(loc="best")
ax = axs[2]
ax.plot("age", "expected_survival", data=hf)
ax.set(ylabel="Expected unmarried years | age", xlabel="Age")
plt.show()