Survival analysis
From Wikipedia, the free encyclopedia
It has been suggested that Survival function be merged into this article or section. (Discuss) |
Survival analysis is a branch of statistics which deals with death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, and duration analysis or duration modeling in economics or sociology. More generally, survival analysis involves the modelling of time to event data; in this context, death or failure is considered an "event" in the survival analysis literature. Another example of time to event modeling could be the rate or time to which former convicts commit a crime again after they've been released. In this case, the 'event' of interest would be time to committing a crime. Many concepts in Survival analysis have been explained by the Counting Process Theory, which has emerged more recently. The flexibility of a counting process is that it allows modeling multiple (or recurrent) events. This type of modeling fits very well in many situations (e.g. people can go to jail multiple times, alcoholics can start and stop drinking multiple times, people can get married and get a divorce many times).
Survival analysis attempts to answer questions such as: what is the fraction of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the odds of survival?
To answer such questions, it is necessary to define "lifetime". In the case of biological survival, death is unambiguous, but for mechanical reliability, failure may not be well-defined, for there may well be mechanical systems in which failure is partial, a matter of degree, or not otherwise localized in time. Even in biological problems, some events (for example, heart attack or other organ failure) may have the same ambiguity. The theory outlined below assumes well-defined events at specific times; other cases may be better treated by models which explicitly account for ambiguous events.
The theory of survival presented here also assumes that death or failure happens just once for each subject. Recurring event or repeated event models relax that assumption. The study of recurring events is relevant in systems reliability, and in many areas of social sciences and medical research.
This article is phrased primarily in terms of biological survival, but this is just a convenience. An equivalent formulation in terms of mechanical failure can be made by replacing every occurrence of death with failure.
Contents |
[edit] General formulation
[edit] Survival function
The object of the primary interest is the survival function, conventionally denoted S, which is defined as
where t is some time, T is a random variable denoting the time of death, and "Pr" stands for probability. That is: the survival function is the probability that the time of death is later than some specified time. The survival function is also called the survivor function or survivorship function in problems of biological survival, and the reliability function in mechanical survival problems. In the latter case, the reliability function is denoted R(t).
Usually one assumes S(0) = 1, although it could be less than 1 if there is the possibility of immediate death or failure.
The survival function must be non-increasing: S(u) ≤ S(t) if u > t. This property follows directly from S(t) being the integral of a non-negative function. This reflects the notion that survival at a later age is only possible if surviving all younger ages. Given this property, the lifetime distribution function and event density (F and f below) are well-defined.
The survival function is usually assumed to approach zero as age increases without bound, i.e., S(t) → 0 as t → ∞, although the limit could be greater than zero if eternal life is possible.
[edit] Lifetime distribution function and event density
Related quantities are defined in terms of the survival function. The lifetime distribution function, conventionally denoted F, is defined as the complement of the survival function,
and the derivative of F (i.e., the density function of the lifetime distribution) is conventionally denoted f,
f is sometimes called the event density; it is the rate of death or failure events per unit time.
[edit] Hazard function and cumulative hazard function
The hazard function, conventionally denoted λ, is defined as the event rate at time t conditional on survival until time t or later,
Force of mortality is a synonym of hazard function which is used particularly in demography and actuarial science. The term hazard rate is another synonym.
The hazard function must be nonnegative, λ(t) ≥ 0, and its integral over must be infinite, but is not otherwise constrained; the hazard function may be increasing or decreasing, nonmonotonic, or discontinuous. An example is the bathtub curve hazard function, which is large for small values of t, decreasing to some minimum, and thereafter increasing again; this can model the property of some mechanical systems to either failure soon after operation, or much later, as the system ages.
The hazard function can alternatively be represented in terms of the cumulative hazard function, conventionally denoted Λ:
so
Λ is called the cumulative hazard function because the preceding definitions together imply
- ,
which is the "accumulation" of the hazard over time.
From Λ(t) = − logS(t) we see that Λ(t) increases without bound as t tends to infinity (assuming S(t) tends to zero). This implies that λ(t) must not decrease too quickly, since the cumulative hazard diverges. For example, exp( − t) is not the hazard function of any survival distribution, because its integral converges (to 1).
[edit] Quantities derived from the survival distribution
Future lifetime at a given time t0 is denoted by the time remaining until death, thus future lifetime is T − t0 in the present notation. The expected future lifetime is the expected value of future lifetime. The probability of death at or before t + t0, given survival until t0, is just
Therefore the probability density of future lifetime is
and the expected future lifetime is
For t0 = 0, i.e., at birth, this reduces to the expected lifetime.
In reliability problems, the expected lifetime is called the mean time to failure, and the expected future lifetime is called the mean residual lifetime.
The probability of individual survival until t or later is S(t), by definition. The expected number of survivors, in a population of n individuals, is n × S(t), assuming the same survival function for all. Thus the expected proportion of survivors is S(t), and the variance of the proportion of survivors is S(t) × (1-S(t))/n.
The age at which a specified proportion of survivors remain can be found by solving the equation S(t) = q for t, where q is the quantile in question. Typically one is interested in the median lifetime, for which q = 1/2, or other quantiles such as q = 0.90 or q = 0.99.
One can also make more complex inferences from the survival distribution. In mechanical reliability problems, one can bring cost (or utility, more generally) into consideration and solve problems concerning repair or replacement. See age-replacement problem and durability and renewal theory and reliability theory of aging and longevity for further discussion of this topic.
[edit] Some survival distributions
Parametric survival models are constructed by choosing a specific probability distribution for the survival function. It is straightforward to phrase model fitting and analysis in general terms, using the concepts outlined below under Fitting parameters to data. Thus it is relatively easy to substitute one distribution for another, in order to study the consequences of different choices.
The choice of survival distribution expresses some particular information about the relation of time and any exogenous variables to survival. It is natural to choose a statistical distribution which has non-negative support since survival times are non-negative. There are several distributions commonly used in survival analysis, which are listed in the table below. Additional distributions can be found in the references.
Distribution | Survival function S(t) |
---|---|
exponential (special case of Weibull) | |
Weibull | |
Gompertz | |
Log-normal | |
Log-logistic |
where Φ is the cumulative distribution function of the standard normal distribution.
[edit] Censoring
Censoring is a form of missing data problem which is common in survival analysis. Ideally, both the birth and death dates of a subject are known, in which case the lifetime is known. If it is known only that the date of death is after some date, this is called right censoring. Right censoring will occur for those subjects whose birth date is known but who are still alive when they are lost to follow-up or when the study ends. If a subject's lifetime is known to be less than a certain duration, the lifetime is said to be left-censored. It may also happen that subjects with a lifetime less than some threshold may not be observed at all: this is called truncation. Note that truncation is different from left censoring, since for a left censored datum, we know the subject exists, but for a truncated datum, we may be completely unaware of the subject. Truncation is also common. In a so-called delayed entry study, subjects are not observed at all until they have reached a certain age. For example, people may not be observed until they have reached the age to enter school. Any deceased subjects in the pre-school age group would be unknown.
[edit] Fitting parameters to data
Survival models can be usefully viewed as ordinary regression models in which the response variable is time. However, computing the likelihood function (needed for fitting parameters or making other kinds of inferences) is complicated by the censoring. The likelihood function for a survival model, in the presence of censored data, is formulated as follows. By definition the likelihood function is the joint probability of the data given the parameters of the model. It is customary to assume that the data are independent given the parameters. Then the likelihood function is the product of the likelihood of each datum. It is convenient to partition the data into four categories: uncensored, left censored, right censored, and interval censored. These are denoted "unc.", "l.c.", "r.c.", and "i.c." in the equation below.
For an uncensored datum, with Ti equal to the age at death, we have
For a left censored datum, such that the age at death is known to be less than Ti, we have
For a right censored datum, such that the age at death is known to be greater than Ti, we have
For an interval censored datum, such that the age at death is known to be greater than Ti,r and less than Ti,l, we have
[edit] See also
- Kaplan-Meier estimator
- Reliability theory
- Proportional hazards models
- Accelerated failure time model
- Failure rate
- Logrank test
- Survival function
- MTBF
- Censoring (statistics)
- Maximum likelihood
[edit] References
- David Collett. Modelling Survival Data in Medical Research, Second Edition. Boca Raton: Chapman & Hall/CRC. 2003. [1]
- Regina Elandt-Johnson and Norman Johnson. Survival Models and Data Analysis. New York: John Wiley & Sons. 1980/1999.
- Jerald F. Lawless. Statistical Models and Methods for Lifetime Data, 2nd edition. John Wiley and Sons, Hoboken. 2003.
- Terry Therneau. "A Package for Survival Analysis in S". http://www.mayo.edu/hsr/people/therneau/survival.ps, at: http://mayoresearch.mayo.edu/mayo/research/biostat/therneau.cfm
- "Engineering Statistics Handbook", NIST/SEMATEK, [2]
|