Analyzing time to an event can answer many questions about a population. Medicine, epidemiology, and actuarial science have historically analyzed lifespans. For example, actuaries use life tables to assess the probability of someone living to a certain age. Researchers may analyze the likelihood of a new treatment improving survival rates.
One technique historically used by these fields is survival analysis. Survival analysis enables the estimation of probability to an event over time. Recently we used survival analysis to help analyze the likelihood of different user segments “surviving” to activation. However, there are many uses for survival analysis and we will be showing a general example here.
Structuring Data for Censorship
One of the main strengths of survivor analysis is its ability to handle censorship, an observation of an event has not occurred yet. In the classic medical case, accounting for censorship enables researchers to correctly estimate a lifespan and account for patients that have not yet died or have left a study for reasons other than death. In our case, we were able to use censorship to handle very recently registered users that have not had time to abandon the product.
Survival functions like the commonly used Kaplam-Meier estimator depend on at least two values:
- Time to an observation or duration in the study (e.g. months to death or months in the study)
- Observation of an event (e.g. death)
time,age,death,Stage II,Stage III,Stage IV 0.6,77,1,0,0,0 1.3,53,1,0,0,0 2.4,45,1,0,0,0 2.5,57,0,0,0,0 3.2,58,1,0,0,0 3.2,51,0,0,0,0
Applying Survival Analysis
With the data formatted for survival analysis we can now apply an estimator to it, generate a survival curve, and interpret the results. We will be using Python and the lifelines package. To recreate the following examples, a complete Jupyer Notebook is here.
The following method will allow us to print and plot survivor analysis output
for different segments. Lifelines has a variety of estimators, but
we will be using the
KaplanMeierFitter, which implements the Kaplam-Meier estimator.
run_survival method takes in a data frame
with time and observation columns labeled
In addition to plotting survival curves,
run_survival will also output the
initial survival probabilities over time and median time to death. Rates of survival are in months since
diagnosis in this laryngeal (throat) cancer data set.
Interpreting Survival Curves
Survival curves display estimated probability to an event (vertical axis) over time (horizontal axis). When we run the estimator across the entire laryngeal cancer sample, we can see, on average, that the median time of death is approximately six months. In other words, someone diagnosed with laryngeal cancer, is half as likely to survive after six months.
However when we segment the sample by the stage of the cancer at diagnosis, we can clearly see that Stage III and Stage IV laryngeal cancer results in especially accelerated rates of death.
Survival Curves by Cancer Stage
Stage III Laryngeal Cancer
Stage IV Laryngeal Cancer
As shown here, even the most optimistic cases of Stage IV laryngeal cancer fail to do better than the baseline survival curve.
Survival analysis is just one tool to help assess likelihood to an event over time. While we’ve shown a classic case of survival analysis, many product questions can be turned into survival questions:
- What’s the likelihood over time of product abandonment after creating an account?
- On average, how long did it take for someone to cancel a product?
- Is a segment more likely to churn than another one?
If you’d like to learn more about
lifelines or survival analysis, Cameron
Davidson-Pilon provides a more in depth overview of survival analysis,
applications of it, and the library in this talk.
While we provided a brief overview of survival analysis in Python, other languages
like R have mature survival analysis tools. For example, the original
source of the laryngeal cancer data set used here is from R’s
KMsurv package, originally
implemented in Klein and Moeschberger (1997), “Survival Analysis, Techniques for Censored and Truncated Data”.