Seeking survivors: introduction to survival analysis

Analyzing time to an event can answer many questions about a population. Medicine, epidemiology, and actuarial science have historically analyzed lifespans. For example, actuaries use life tables to assess the probability of someone living to a certain age. Researchers may analyze the likelihood of a new treatment improving survival rates.

One technique historically used by these fields is survival analysis. Survival analysis enables the estimation of probability to an event over time. Recently we used survival analysis to help analyze the likelihood of different user segments "surviving" to activation. However, there are many uses for survival analysis and we will be showing a general example here.

Structuring data for censorship

One of the main strengths of survivor analysis is its ability to handle censorship, an observation of an event has not occurred yet. In the classic medical case, accounting for censorship enables researchers to correctly estimate a lifespan and account for patients that have not yet died or have left a study for reasons other than death. In our case, we were able to use censorship to handle very recently registered users that have not had time to abandon the product.

Survival functions like the commonly used Kaplam-Meier estimator depend on at least two values:

Time to an observation or duration in the study (e.g. months to death or months in the study)
Observation of an event (e.g. death)

For example, the following survival data shows larynx cancer segmented by stage of disease:

time,age,death,Stage II,Stage III,Stage IV
0.6,77,1,0,0,0
1.3,53,1,0,0,0
2.4,45,1,0,0,0
2.5,57,0,0,0,0
3.2,58,1,0,0,0
3.2,51,0,0,0,0

Applying survival analysis

With the data formatted for survival analysis we can now apply an estimator to it, generate a survival curve, and interpret the results. We will be using Python and the lifelines package. To recreate the following examples, a complete Jupyer Notebook is here.

The following method will allow us to print and plot survivor analysis output for different segments. Lifelines has a variety of estimators, but we will be using the KaplanMeierFitter, which implements the Kaplam-Meier estimator. The run_survival method takes in a data frame with time and observation columns labeled time and death respectively.

In addition to plotting survival curves, run_survival will also output the initial survival probabilities over time and median time to death. Rates of survival are in months since diagnosis in this laryngeal (throat) cancer data set.

Interpreting survival curves

Survival curves display estimated probability to an event (vertical axis) over time (horizontal axis). When we run the estimator across the entire laryngeal cancer sample, we can see, on average, that the median time of death is approximately six months. In other words, someone diagnosed with laryngeal cancer, is half as likely to survive after six months.

However when we segment the sample by the stage of the cancer at diagnosis, we can clearly see that Stage III and Stage IV laryngeal cancer results in especially accelerated rates of death.

Survival curves by cancer stage

Stage III laryngeal cancer

Stage IV laryngeal cancer

As shown here, even the most optimistic cases of Stage IV laryngeal cancer fail to do better than the baseline survival curve.

Conclusion

Survival analysis is just one tool to help assess likelihood to an event over time. While we've shown a classic case of survival analysis, many product questions can be turned into survival questions:

What's the likelihood over time of product abandonment after creating an account?
On average, how long did it take for someone to cancel a product?
Is a segment more likely to churn than another one?

If you'd like to learn more about lifelines or survival analysis, Cameron Davidson-Pilon provides a more in depth overview of survival analysis, applications of it, and the library in this talk.

While we provided a brief overview of survival analysis in Python, other languages like R have mature survival analysis tools. For example, the original source of the laryngeal cancer data set used here is from R's KMsurv package, originally implemented in Klein and Moeschberger (1997), "Survival Analysis, Techniques for Censored and Truncated Data".