Statistical model

A statistical model is a class of mathematical model, which embodies a set of assumptions concerning the generation of some sample data, and similar data from a larger population. A statistical model represents, often in considerably idealized form, the data-generating process.

The assumptions embodied by a statistical model describe a set of probability distributions, some of which are assumed to adequately approximate the distribution from which a particular data set is sampled. The probability distributions inherent in statistical models are what distinguishes statistical models from other, non-statistical, mathematical models.

A statistical model is usually specified by mathematical equations that relate one or more random variables and possibly other non-random variables. As such, a statistical model is "a formal representation of a theory" (Herman Adèr quoting Kenneth Bollen).^[1]

All statistical hypothesis tests and all statistical estimators are derived from statistical models. More generally, statistical models are part of the foundation of statistical inference.

Formal definition

In mathematical terms, a statistical model is usually thought of as a pair ( $S,{\mathcal {P}}$ ), where $S$ is the set of possible observations, i.e. the sample space, and ${\mathcal {P}}$ is a set of probability distributions on $S$ .^[2]

The intuition behind this definition is as follows. It is assumed that there is a "true" probability distribution induced by the process that generates the observed data. We choose ${\mathcal {P}}$ to represent a set (of distributions) which contains a distribution that adequately approximates the true distribution. Note that we do not require that ${\mathcal {P}}$ contains the true distribution, and in practice that is rarely the case. Indeed, as Burnham & Anderson state, "A model is a simplification or approximation of reality and hence will not reflect all of reality"^[3]—whence the saying "all models are wrong".

The set ${\mathcal {P}}$ is almost always parameterized: ${\mathcal {P}}=\{P_{{\theta }}:\theta \in \Theta \}$ . The set $\Theta$ defines the parameters of the model. A parameterization is generally required to have distinct parameter values give rise to distinct distributions, i.e. $P_{{\theta _{1}}}=P_{{\theta _{2}}}\Rightarrow \theta _{1}=\theta _{2}$ must hold (in other words, it must be injective). A parameterization that meets the condition is said to be identifiable.^[2]

An example

Height and age are each probabilistically distributed over humans. They are stochastically related: when we know that a person is of age 10, this influences the chance of the person being 5 feet tall. We could formalize that relationship in a linear regression model with the following form: height_i = b₀ + b₁age_i + ε_i, where b₀ is the intercept, b₁ is a parameter that age is multiplied by to get a prediction of height, ε is the error term, and i identifies the person. This implies that height is predicted by age, with some error.

An admissible model must be consistent with all the data points. Thus, the straight line (height_i = b₀ + b₁age_i) is not a model of the data. The line cannot be a model, unless it exactly fits all the data points—i.e. all the data points lie perfectly on a straight line. The error term, ε_i, must be included in the model, so that the model is consistent with all the data points.

To do statistical inference, we would first need to assume some probability distributions for the ε_i. For instance, we might assume that the ε_i distributions are i.i.d. Gaussian, with zero mean. In this instance, the model would have 3 parameters: b₀, b₁, and the variance of the Gaussian distribution.

We can formally specify the model in the form ( $S,{\mathcal {P}}$ ) as follows. The sample space, $S$ , of our model comprises the set of all possible pairs (age, height). Each possible value of $\theta$ = (b₀, b₁, σ²) determines a distribution on $S$ ; denote that distribution by $P_{{\theta }}$ . If $\Theta$ is the set of all possible values of $\theta$ , then ${\mathcal {P}}=\{P_{{\theta }}:\theta \in \Theta \}$ . (The parameterization is identifiable, and this is easy to check.)

In this example, the model is determined by (1) specifying $S$ and (2) making some assumptions relevant to ${\mathcal {P}}$ . There are two assumptions: that height can be approximated by a linear function of age; that errors in the approximation are distributed as i.i.d. Gaussian. The assumptions are sufficient to specify ${\mathcal {P}}$ —as they are required to do.

General remarks

A statistical model is a special class of mathematical model. What distinguishes a statistical model from other mathematical models is that a statistical model is non-deterministic. Thus, in a statistical model specified via mathematical equations, some of the variables do not have specific values, but instead have probability distributions; i.e. some of the variables are stochastic. In the example above, ε is a stochastic variable; without that variable, the model would be deterministic.

Statistical models are often used even when the physical process being modeled is deterministic. For instance, coin tossing is, in principle, a deterministic process; yet it is commonly modeled as stochastic (via a Bernoulli process).

There are three purposes for a statistical model, according to Konishi & Kitagawa.^[4]

Predictions
Extraction of information
Description of stochastic structures

Dimension of a model

Suppose that we have a statistical model ( $S,{\mathcal {P}}$ ) with ${\mathcal {P}}=\{P_{{\theta }}:\theta \in \Theta \}$ . The model is said to be parametric if $\Theta$ has a finite dimension. In notation, we write that $\Theta \subseteq {\mathbb {R}}^{d}$ where $d$ is a positive integer ( $\mathbb {R}$ denotes the real numbers; other sets can be used, in principle). Here, $d$ is called the dimension of the model.

As an example, if we assume that data arise from a univariate Gaussian distribution, then we are assuming that

{\mathcal {P}}=\{P_{{\mu ,\sigma }}(x)\equiv {\frac {1}{{\sqrt {2\pi }}\sigma }}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right):\mu \in {\mathbb {R}},\sigma >0\}

In this example, the dimension, $d$ , equals 2.

As another example, suppose that the data consists of points ( $x$ , $y$ ) that we assume are distributed according to a straight line with i.i.d. Gaussian residuals (with zero mean). Then the dimension of the statistical model is 3: the intercept of the line, the slope of the line, and the variance of the distribution of the residuals. (Note that in geometry, a straight line has dimension 1.)

A statistical model is nonparametric if the parameter set $\Theta$ is infinite dimensional. A statistical model is semiparametric if it has both finite-dimensional and infinite-dimensional parameters. Formally, if $d$ is the dimension of $\Theta$ and $n$ is the number of samples, both semiparametric and nonparametric models have $d \rightarrow \infty$ as $n\rightarrow \infty$ . If $d/n\rightarrow 0$ as $n\rightarrow \infty$ , then the model is semiparametric; otherwise, the model is nonparametric.

Parametric models are by far the most commonly used statistical models. Regarding semiparametric and nonparametric models, Sir David Cox has said, "These typically involve fewer assumptions of structure and distributional form but usually contain strong assumptions about independencies".^[5]

Nested models

Two statistical models are nested if the first model can be transformed into the second model by imposing constraints on the parameters of the first model. For example, the set of all Gaussian distributions has, nested within it, the set of zero-mean Gaussian distributions: we constrain the mean in the set of all Gaussian distributions to get the zero-mean distributions.

In that example, the first model has a higher dimension than the second model (the zero-mean model has dimension 1). Such is usually, but not always, the case. As a different example, the set of positive-mean Gaussian distributions, which has dimension 2, is nested within the set of all Gaussian distributions.

Comparing models

It is assumed that there is a "true" probability distribution underlying the observed data, induced by the process that generated the data. The main goal of model selection is to make statements about which elements of ${\mathcal {P}}$ are most likely to adequately approximate the true distribution.

Models can be compared to each other by exploratory data analysis or confirmatory data analysis. In exploratory analysis, a variety of models are formulated and an assessment is performed of how well each one describes the data. In confirmatory analysis, a previously formulated model or models are compared to the data. Common criteria for comparing models include R², Bayes factor, and the likelihood-ratio test together with its generalization relative likelihood.

Konishi & Kitagawa state: "The majority of the problems in statistical inference can be considered to be problems related to statistical modeling. They are typically formulated as comparisons of several statistical models."^[6] Relatedly, Sir David Cox has said, "How [the] translation from subject-matter problem to statistical model is done is often the most critical part of an analysis".^[7]

Notes

References

Adèr, H.J. (2008), "Modelling", in Adèr, H.J.; Mellenbergh, G.J., Advising on Research Methods: a consultant's companion, Huizen, The Netherlands: Johannes van Kessel Publishing, pp. 271–304 .
Burnham, K. P.; Anderson, D. R. (2002), Model Selection and Multimodel Inference (2nd ed.), Springer-Verlag, ISBN 0-387-95364-7 .
Cox, D.R. (2006), Principles of Statistical Inference, Cambridge University Press .
Konishi, S.; Kitagawa, G. (2008), Information Criteria and Statistical Modeling, Springer .
McCullagh, P. (2002), "What is a statistical model?", Annals of Statistics, 30: 1225–1310, doi:10.1214/aos/1035844977 .

Davison A.C. (2008), Statistical Models, Cambridge University Press.
Freedman D.A. (2009), Statistical Models, Cambridge University Press.
Helland I.S. (2010), Steps Towards a Unified Basis for Scientific Models and Methods, World Scientific.
Kroese D.P., Chan J.C.C. (2014), Statistical Modeling and Computation, Springer.
Stapleton J.H. (2007), Models for Probability and Statistical Inference, Wiley-Interscience.

Statistics

Descriptive statistics

Continuous data

Center	Mean arithmetic geometric harmonic Median Mode
Dispersion	Variance Standard deviation Coefficient of variation Percentile Range Interquartile range
Shape	Moments Skewness Kurtosis L-moments

Count data

Index of dispersion

Summary tables

Dependence

Graphics

Data collection

Study design	Population Statistic Effect size Statistical power Sample size determination Missing data
Survey methodology	Sampling stratified cluster Standard error Opinion poll Questionnaire
Controlled experiments	Design control optimal Controlled trial Randomized Random assignment Replication Blocking Interaction Factorial experiment
Uncontrolled studies	Observational study Natural experiment Quasi-experiment

Statistical inference

Statistical theory

Frequentist inference

Point estimation	Estimating equations Maximum likelihood Method of moments M-estimator Minimum distance Unbiased estimators Mean-unbiased minimum-variance Rao–Blackwellization Lehmann–Scheffé theorem Median unbiased Plug-in
Interval estimation	Confidence interval Pivot Likelihood interval Prediction interval Tolerance interval Resampling Bootstrap Jackknife
Testing hypotheses	1- & 2-tails Power Uniformly most powerful test Permutation test Randomization test Multiple comparisons
Parametric tests	Likelihood-ratio Wald Score

Specific tests

Z (normal) Student's t-test F
Goodness of fit	Chi-squared Kolmogorov–Smirnov Anderson–Darling Lilliefors Jarque–Bera Normality (Shapiro–Wilk) Likelihood-ratio test Model selection Cross validation AIC BIC
Rank statistics	Sign Sample median Signed rank (Wilcoxon) Hodges–Lehmann estimator Rank sum (Mann–Whitney) Nonparametric anova 1-way (Kruskal–Wallis) 2-way (Friedman) Ordered alternative (Jonckheere–Terpstra)

Bayesian inference

Correlation	Pearson product-moment Partial correlation Confounding variable Coefficient of determination
Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models Multivariate adaptive regression splines (MARS)
Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression
Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity
Generalized linear model	Exponential families Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance	Analysis of variance (ANOVA, anova) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis

Categorical

Multivariate

Time-series

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration Structural break Granger causality
Specific tests	Dickey–Fuller Johansen Q-statistic (Ljung–Box) Durbin–Watson Breusch–Godfrey
Time domain	Autocorrelation (ACF) partial (PACF) Cross-correlation (XCF) ARMA model ARIMA model (Box–Jenkins) Autoregressive conditional heteroskedasticity (ARCH) Vector autoregression (VAR)
Frequency domain	Spectral density estimation Fourier analysis Wavelet

Survival

Survival function	Kaplan–Meier estimator (product limit) Proportional hazards models Accelerated failure time (AFT) model First hitting time
Hazard function	Nelson–Aalen estimator
Test	Log-rank test

Applications

Biostatistics	Bioinformatics Clinical trials / studies Epidemiology Medical statistics
Engineering statistics	Chemometrics Methods engineering Probabilistic design Process / quality control Reliability System identification
Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population statistics Psychometrics
Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Commons
WikiProject

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.