Sample mean and sample covariance

From Wikipedia, the free encyclopedia

This article needs additional citations for verification.
Please help improve this article by adding reliable references. Unsourced material may be challenged and removed. (February 2008)

This article or section is in need of attention from an expert on the subject.

WikiProject Statistics may be able to help recruit one.

If a more appropriate WikiProject or portal exists, please adjust this template accordingly.

Sample mean and sample covariance are statistics computed from a collection of data, thought of as being random.

1 Sample mean and covariance
2 Weighted samples
3 References
4 See also

[edit] Sample mean and covariance

Given a random sample $\textstyle \mathbf{x}_{1},\ldots,\mathbf{x}_{N}$ from an $\textstyle n$ -dimensional random variable $\textstyle \mathbf{X}$ (i.e., realizations of $\textstyle N$ independent random variables with the same distribution as $\textstyle \mathbf{X}$ ), the sample mean is

$\mathbf{\bar{x}}=\frac{1}{N}\sum_{k=1}^{N}\mathbf{x}_{k}.$

In coordinates, writing the vectors as columns,

$\mathbf{x}_{k}=\left[ \begin{array} [c]{c}x_{1k}\\ \vdots\\ x_{nk}\end{array} \right] ,\quad\mathbf{\bar{x}}=\left[ \begin{array} [c]{c}\bar{x}_{1}\\ \vdots\\ \bar{x}_{n}\end{array} \right] ,$

the entries of the sample mean are

$\bar{x}_{i}=\frac{1}{N}\sum_{k=1}^{N}x_{ik},\quad i=1,\ldots,n.$

The sample covariance of $\textstyle \mathbf{x}_{1},\ldots,\mathbf{x}_{N}$ is the $\textstyle n$ by $\textstyle n$ matrix $\textstyle \mathbf{Q}=\left[ q_{ij}\right]$ with the entries given by

$q_{ij}=\frac{1}{N-1}\sum_{k=1}^{N}\left( x_{ik}-\bar{x}_{i}\right) \left( x_{jk}-\bar{x}_{j}\right)$

The sample mean and the sample covariance matrix are unbiased estimates of the mean and the covariance matrix of the random variable $\textstyle \mathbf{X}$ . The reason why the sample covariance matrix has $\textstyle N-1$ in the denominator rather than $\textstyle N$ is essentially that the population mean $E (X)$ is not known and is replaced by the sample mean $\textstyle\bar{x}$ . If the population mean $E (X)$ is known, the analogous unbiased estimate

$q_{ij}=\frac{1}{N}\sum_{k=1}^{N}\left( x_{ik}-E(X_i)\right) \left( x_{jk}-E(X_j)\right)$

with the population mean indeed does have $\textstyle N$ . This is an example why in probability and statistics it is essential to distinguish between upper case letters (random variables) and lower case letters (realizations of the random variables).

The maximum likelihood estimate of the covariance

$q_{ij}=\frac{1}{N}\sum_{k=1}^{N}\left( x_{ik}-\bar{x}_{i}\right) \left( x_{jk}-\bar{x}_{j}\right)$

for the Gaussian distribution case has $\textstyle N$ as well. The difference of course diminishes for large $\textstyle N$ .

[edit] Weighted samples

In a weighted sample, each vector $\textstyle \textbf{x}_{k}$ is assigned a weight $\textstyle w_{k}\geq0$ . Without loss of generality, assume that the weights are normalized:

$\sum_{k=1}^{N}w_{k}=1.$

(If they are not, divide the weights by their sum.) Then the weighted mean $\textstyle \mathbf{\bar{x}}$ and the weighted covariance matrix $\textstyle \mathbf{Q}=\left[ q_{ij}\right]$ are given by