Multivariate statistics

From Wikipedia, the free encyclopedia

Multivariate statistics is a form of statistics encompassing the simultaneous observation and analysis of more than one outcome variable. The application of multivariate statistics is multivariate analysis.

Multivariate statistics concerns understanding the different aims and background of each of the different forms of multivariate analysis, and how they relate to each other. The practical implementation of multivariate statistics to a particular problem may involve several types of univariate and multivariate analysis in order to understand the relationships between variables and their relevance to the actual problem being studied.

In addition, multivariate statistics is concerned with multivariate probability distributions, in terms of both:

how these can be used to represent the distributions of observed data;
how they can be used as part of statistical inference, particularly where several different quantities are of interest to the same analysis.

Certain types of problem involving multivariate data, for example simple linear regression and multiple regression, are NOT usually considered as special cases of multivariate statistics because the analysis is dealt with by considering the (univariate) conditional distribution of a single outcome variable given the other variables.

Types of analysis

There are many different models, each with its own type of analysis:

Multivariate analysis of variance (MANOVA) extends the analysis of variance to cover cases where there is more than one dependent variable to be analyzed simultaneously: see also MANCOVA.
Multivariate regression analysis attempts to determine a formula that can describe how elements in a vector of variables respond simultaneously to changes in others. For linear relations, regression analyses here are based on forms of the general linear model.
Principal components analysis (PCA) creates a new set of orthogonal variables that contain the same information as the original set. It rotates the axes of variation to give a new set of orthogonal axes, ordered so that they summarize decreasing proportions of the variation.
Factor analysis is similar to PCA but allows the user to extract a specified number of synthetic variables, fewer than the original set, leaving the remaining unexplained variation as error. The extracted variables are known as latent variables or factors; each one may be supposed to account for covariation in a group of observed variables.
Canonical correlation analysis finds linear relationships among two sets of variables; it is the generalised (i.e. canonical) version of bivariate correlation.
Redundancy analysis (RDA) is similar to canonical correlation analysis but allows the user to derive a specified number of synthetic variables from one set of (independent) variables that explain as much variance as possible in another (independent) set. It is a multivariate analogue of regression.
Correspondence analysis (CA), or reciprocal averaging, finds (like PCA) a set of synthetic variables that summarise the original set. The underlying model assumes chi-squared dissimilarities among records (cases).
Canonical (or "constrained") correspondence analysis (CCA) for summarising the joint variation in two sets of variables (like redundancy analysis); combination of correspondence analysis and multivariate regression analysis. The underlying model assumes chi-squared dissimilarities among records (cases).
Multidimensional scaling comprises various algorithms to determine a set of synthetic variables that best represent the pairwise distances between records. The original method is principal coordinates analysis (PCoA; based on PCA).
Discriminant analysis, or canonical variate analysis, attempts to establish whether a set of variables can be used to distinguish between two or more groups of cases.
Linear discriminant analysis (LDA) computes a linear predictor from two sets of normally distributed data to allow for classification of new observations.
Clustering systems assign objects into groups (called clusters) so that objects (cases) from the same cluster are more similar to each other than objects from different clusters.
Recursive partitioning creates a decision tree that attempts to correctly classify members of the population based on a dichotomous dependent variable.
Artificial neural networks extend regression and clustering methods to non-linear multivariate models.
Statistical graphics such as tours, parallel coordinate plots, scatterplot matrices can be used to explore multivariate data.

Important probability distributions

There is a set of probability distributions used in multivariate analyses that play a similar role to the corresponding set of distributions that are used in univariate analysis when the normal distribution is appropriate to a dataset. These multivariate distributions are:

Multivariate normal distribution
Wishart distribution
Multivariate Student-t distribution.

The Inverse-Wishart distribution is important in Bayesian inference, for example in Bayesian multivariate linear regression. Additionally, Hotelling's T-squared distribution is a univariate distribution, generalising Student's t-distribution, that is used in multivariate hypothesis testing.

History

Anderson's 1958 textbook, An Introduction to Multivariate Analysis,^[1] educated a generation of theorists and applied statisticians; Anderson's book emphasizes hypothesis testing via likelihood ratio tests and the properties of power functions: Admissibility, unbiasedness and monotonicity.^[2]^[3]

Software & Tools

There are an enormous number of software packages and other tools for multivariate analysis, including:

High-D
JMP (statistical software)
MiniTab
Calc
PSPP
R: http://cran.r-project.org/web/views/Multivariate.html has details on the packages available for multivariate data analysis
SAS (software)
SciPy for Python
SPSS
Stata
STATISTICA
[http://tmva.sourceforge.net: TMVA] - Toolkit for Multivariate Data Analysis in ROOT
The Unscrambler
[http://www.smartpls.de: SmartPLS - Partial Least Square]
[http://matlab.com: MATLAB]

References

↑ T.W. Anderson (1958) An Introduction to Multivariate Analysis, New York: Wiley ISBN 0471026409; 2e (1984) ISBN 0471889873; 3e (2003) ISBN 0471360910
↑ Sen, Pranab Kumar; Anderson, T. W.; Arnold, S. F.; Eaton, M. L.; Giri, N. C.; Gnanadesikan, R.; Kendall, M. G.; Kshirsagar, A. M. et al. (June 1986). "Review: Contemporary Textbooks on Multivariate Statistical Analysis: A Panoramic Appraisal and Critique". Journal of the American Statistical Association 81 (394): 560–564. doi:10.2307/2289251. ISSN 0162-1459. JSTOR 2289251. |displayauthors= suggested (help)(Pages 560–561)
↑ Schervish, Mark J. (November 1987). "A Review of Multivariate Analysis". Statistical Science 2 (4): 396–413. doi:10.1214/ss/1177013111. ISSN 0883-4237. JSTOR 2245530.

External links

Statistics

Descriptive statistics

Continuous data

Location	Mean (Arithmetic, Geometric, Harmonic) Median Mode

Dispersion	Range Standard deviation Coefficient of variation Percentile Interquartile range

Shape	Variance Skewness Kurtosis Moments L-moments

Count data

Index of dispersion

Summary tables

Dependence

Statistical graphics

Data collection

Designing studies	Effect size Standard error Statistical power Sample size determination

Survey methodology	Sampling Stratified sampling Cluster sampling Opinion poll Questionnaire

Controlled experiment	Design of experiments Randomized experiment Random assignment Replication Blocking Factorial experiment Optimal design

Uncontrolled studies	Natural experiment Quasi-experiment Observational study

Statistical inference

Statistical theory	Sampling distribution Order statistic Scan statistic Record value Sufficiency Completeness Exponential family Permutation test (Randomization test) Empirical distribution Bootstrap U statistic Efficiency Asymptotics Robustness

Frequentist inference	Unbiased estimator (Mean unbiased minimum variance, Median unbiased) Biased estimators (Maximum likelihood, Method of moments, Minimum distance, Density estimation) Confidence interval Testing hypotheses Power Parametric tests (Likelihood-ratio, Wald, Score)

Specific tests	Z (normal) Student's t-test F Goodness of fit (Chi-squared, G, Sample source, sample normality, Skewness & kurtosis Normality, Model comparison, Model quality) Signed-rank (1-sample, 2-sample, 1-way anova) Shapiro–Wilk Kolmogorov–Smirnov

Bayesian inference	Bayesian probability Prior Posterior Credible interval Bayes factor Bayesian estimator Maximum posterior estimator

Correlation and regression analysis

Correlation	Pearson product–moment correlation Partial correlation Confounding variable Coefficient of determination

Regression analysis	Errors and residuals Regression model validation Mixed effects models Simultaneous equations models MARS

Linear regression	Simple linear regression Ordinary least squares General linear model Bayesian regression

Non-standard predictors	Nonlinear regression Nonparametric Semiparametric Isotonic Robust Heteroscedasticity Homoscedasticity

Generalized linear model	Exponential families Logistic (Bernoulli) Binomial Poisson

Partition of variance	Analysis of variance (ANOVA) Analysis of covariance Multivariate ANOVA Degrees of freedom

Categorical, multivariate, time-series, or survival analysis

Categorical data

Multivariate statistics

Time series analysis

General	Decomposition Trend Stationarity Seasonal adjustment Exponential smoothing Cointegration

Specific tests	Granger causality Q-Statistic Durbin–Watson

Time domain	ACF PACF XCF ARMA model ARIMA model ARCH Vector autoregression

Frequency domain	Spectral density estimation Fourier analysis

Survival analysis

Applications

Biostatistics	Bioinformatics Clinical trials & studies Epidemiology Medical statistics

Engineering statistics	Chemometrics Methods engineering Probabilistic design Process & Quality control Reliability System identification

Social statistics	Actuarial science Census Crime statistics Demography Econometrics National accounts Official statistics Population Psychometrics

Spatial statistics	Cartography Environmental statistics Geographic information system Geostatistics Kriging

Category
Portal
Outline
Index

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

Multivariate statistics

Types of analysis

Important probability distributions

History

Software & Tools

See also

References

Further reading

External links