Poisson regression

From Wikipedia, the free encyclopedia

In statistics, Poisson regression is a form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable Y has a Poisson distribution, and assumes the logarithm of its expected value can be modelled by a linear combination of unknown parameters. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables.

In the simplest case with a single independent variable x, the model takes the form:

\log (\operatorname{E}(Y))=a+bx.\,

If Yi are independent observations with corresponding values xi of the predictor variable, then a and b can be estimated by maximum likelihood if the number of distinct x values is at least 2. The maximum-likelihood estimates lack a closed-form expression and must be found by numerical methods.

Poisson regression models are generalized linear models with the logarithm as the (canonical) link function, and the Poisson distribution function.

Contents

[edit] Poisson regression in practice

Poisson regression is appropriate when the dependent variable is a count, for instance of events such as the arrival of a telephone call at a call centre. The events must be independent in the sense that the arrival of one call will not make another more or less likely, but the probability per unit time of events is understood to be related to covariates such as time of day.

[edit] "Exposure" and offset

Poisson regression is also appropriate for rate data, where the rate is a count of events occurring to a particular unit of observation, divided by some measure of that unit's exposure. For example, biologists may count the number of tree species in a forest, and the rate would be the number of species per square kilometre. Demographers may model death rates in geographic areas as the count of deaths divided by person−years. More generally, event rates can be calculated as events per unit time, which allows the observation window to vary for each unit. In these examples, exposure is respectively unit area, person−years and unit time. In Poisson regression this is handled as an offset, where the exposure variable enters on the right-hand side of the equation, but with a parameter estimate constrained to 1.

\log{(\operatorname{E}(Y))} = \log{(\mbox{exposure})} + a+bx

which implies

\log{(\operatorname{E}(Y))} - \log{(\mbox{exposure})} = 
       \log{\left(\frac{\operatorname{E}(Y)}{\mbox{exposure}}\right)} = a+bx

[edit] Overdispersion

A characteristic of the Poisson distribution is that its mean is equal to its variance. In certain circumstances, it will be found that the observed variance is greater than the mean; this is known as overdispersion and indicates that the model is not appropriate. A common reason is the omission of relevant explanatory variables.

Another common problem with Poisson regression is excess zeros: if there are two processes at work, one determining whether there are zero events or any events, and a Poisson process determining how many events there are, there will be more zeros than a Poisson regression would predict. An example would be the distribution of cigarettes smoked in an hour by members of a group where some individuals are non-smokers.

Other generalized linear models such as the negative binomial model may function better in these cases.

[edit] Use in survival analysis

Algorithms and software for Poisson regression are sometimes used as a computational shortcut in survival analysis: see proportional hazards models.

[edit] References

  • Cameron, A.C. and P.K. Trivedi (1998). Regression analysis of count data, Cambridge University Press. ISBN 0-521-63201-3