Bayesian information criterion

From Wikipedia, the free encyclopedia

In order to describe a particular dataset, one can use non-parametric methods or parametric methods. In parametric methods, there might be various candidate models with different number of parameters to represent a dataset. The number of parameters in a model plays an important role. The likelihood of the training data is increased when the number of parameters in the model is increased but it might result in overtraining problem if the number of parameters is too large. In order to overcome this problem one can use Bayesian Information Criterion (parametric method) which is a statistical criterion for model selection.

The BIC is sometimes also named the Schwarz Criterion, or Schwarz Information Criterion (SIC). It is so named because Gideon E. Schwarz (1978) gave a Bayesian argument for adopting it.

1 Mathematically
2 Characteristics of Bayesian Information Criterion
3 Applications
4 References
5 See also
6 External links

[edit] Mathematically

The BIC is an asymptotic result derived under the assumptions that the data distribution is in the exponential family. Let:

n = the number of observations, or equivalently, the sample size;
k = the number of free parameters to be estimated. If the estimated model is a linear regression, k is the number of regressors, including the constant;
L = the maximized value of the likelihood function for the estimated model.

The formula for the BIC is:

$\mathrm{BIC} = {-2 \cdot \ln{L} + k \ln(n) }. \$

Under the assumption that the model errors or disturbances are normally distributed, this becomes (up to an additive constant, which depends only on n and not on the model):

$\mathrm{BIC} = n\ln\left({\mathrm{RSS} \over n}\right) + k \ln(n). \$

where RSS is the residual sum of squares from the estimated model.

Given any two estimated models, the model with the lower value of BIC is the one to be preferred. The BIC is an increasing function of RSS and an increasing function of k. That is, unexplained variation in the dependent variable and the number of explanatory variables increase the value of BIC. Hence, lower BIC implies either fewer explanatory variables, better fit, or both. The BIC penalizes free parameters more strongly than does the Akaike information criterion.

It is important to keep in mind that the BIC can be used to compare estimated models only when the numerical values of the dependent variable are identical for all estimates being compared. The models being compared need not be nested, unlike the case when models are being compared using an F or likelihood ratio test.

[edit] Characteristics of Bayesian Information Criterion

It is independent of the prior.
It can measure the efficiency of the parameterized model in terms of predicting the data.
It penalizes the complexity of the model where complexity refers to the number of parameters in model.
It is exactly equal to Minimum Description Length Criterion but with negative sign.
It can be used to choose the number of clusters according to the intrinsic complexity present in a particular dataset.
It is closely related to other penalized likelihood criterions such as RIC and AIC.

[edit] Applications

BIC has been widely used for model identification in time series and linear regression.

[edit] References

McQuarrie, A. D. R., and Tsai, C.-L., 1998. Regression and Time Series Model Selection. World Scientific.
Schwarz, G., 1978. "Estimating the dimension of a model". Annals of Statistics 6(2):461-464.