Mixture model

From Wikipedia, the free encyclopedia

In mathematical statistics, the term mixture model has two different meanings.

Contents

[edit] First definition

A mixture model is a model in which the independent variables are measured as fractions of a total. For example, suppose researchers are trying to find the optimal mixture of ingredients for a fruit punch consisting of grape juice, mango juice, and pineapple juice. A mixture model is suitable here because the results of the taste tests will not depend on the amount of ingredients used to make the batch but rather on the fraction of each ingredient present in the punch. The sum of the mixture components is always 100%, and a mixture model takes this restriction into account.

[edit] Second definition

A mixture model can also be a formalism for modeling a probability density function as a sum of parameterized functions. In mathematical terms,

p_{X}(x) = \sum_{k = 1}^{K} a_{k} h(x|\lambda _k)

where pX(x) is the modeled probability distribution function, K is the number of components in the mixture model, and ak is mixture proportion of component k. By definition, 0 < ak < 1 for all k = 1,\dots,K and a_{1} +\cdots+ a_{K} = 1.

h(x | λk) is a probability distribution parameterized by λk.

Mixture models are often used when we know h(x) and we can sample from pX(x), but we would like to determine the ak and λk values. Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.

[edit] Common approaches for estimation in mixture models

It's common to think of mixture modeling (under the second definition) as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.

[edit] Expectation maximization

The Expectation-maximization algorithm is one way to compute the missing memberships of data points in our chosen distribution model. It is an iterative procedure, where we start with initial parameters for our model distribution (the ak's and λk's of the model listed above). The estimation process proceeds iteratively in two steps, the Expectation Step, and the Maximization Step.

[edit] The expectation step

With initial guesses for the parameters in our mixture model, we compute "partial membership" of each data point in each constituent distribution. This is done by calculating expectation values for the membership variables of each data point. A simple example will provide some clarity: We have a collection of data points xi that can be modeled as coming from a sum of two Gaussian distributions. The probability expression for our model is

P(x_{i}) = (1 - f) \mathcal{N}(x_{i};\mu_{1},\sigma) + f \mathcal{N}(x_{i};\mu_{2},\sigma)

Where f is the mixing coefficient in [0,1] (ak = 2 = f and ak = 1 = 1 − f), and we assume σ is known and constant. For each of our data points xi, we can compute a membership value for each of the two Gaussians as follows

y_{1,i} = \frac{(1 - f) \mathcal{N}(x_{i};\mu_{1},\sigma)}{(1 - f) \mathcal{N}(x_{i};\mu_{1},\sigma) + f \mathcal{N}(x_{i};\mu_{2},\sigma)}

and similarly for y2,i.

In the case of a Gaussian mixture model,

\mathcal{N}(x;\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2 \sigma^2}}

[edit] The maximization step

With our expectation values in hand for group membership, we can recompute plug-in estimates of our distribution parameters. For the mixing coefficient f this is simply the fractional membership of all data points in the second Gaussian.

f = \frac{\sum_{i} y_{2,i}}{N}

where N is the total number of data points. For μ1,

\mu_{1} = \frac{\sum_{i} y_{1,i}x_{i}}{\sum_{i} y_{1,i}}

With new estimates for f and the μ's, we proceed back to the Expectation step to recompute new membership values. The procedure is repeated until there is no further change in the mixture model parameters.

[edit] Markov chain Monte Carlo

As an alternative to the EM algorithm, we can use posterior sampling as indicated by Bayes' theorem to deduce parameters in our mixture model. Once again we regard this as an incomplete data problem where membership of data points is our missing data. We resort to a method called Gibbs sampling which is once again a two step iterative procedure.

We'll use the example from the previous section to demonstrate how the method works. We start again with initial guessed parameters for our mixture model. Instead of computing partial memberships for each elemental distribution, we draw a membership value for each data point from a binomial distribution (it will be assigned to either the first or the second Gaussian). The binomial parameter θ is determined for each data point on the basis of one of the constituent distributions. Draws from the distribution generate membership associations for each data point. We can then use plug-in estimators as in the M step of EM to generate a new set of mixture model parameters, and return to the binomial draw step.

[edit] Spectral method

Some problems in mixture model estimation can be solved using Spectral Techniques. In particular it becomes useful if data points xi are points in high-dimensional Euclidean space, and the hidden distributions are known to be log-concave (such as Gaussian distribution or Exponential distribution).

Spectral methods of learning mixture models are based on the use of Singular Value Decomposition of a matrix which contains data points. The idea is to consider the top k singular vectors, where k is the number of distributions we are trying to learn. The projection of each data point to a linear subspace spanned by those vectors, groups points originating from the same distribution very close together, and points from different distributions stay far apart.

One distinctive feature of the spectral method is that it allows to prove that if distributions satisfy certain separation condition (e.g. not too close), then the estimated mixture will be very close to the true one with high probability.

[edit] Other methods

Other methods which guarantee accurate estimation have emerged in the last few years. Some of them can even provably learn mixtures of heavy-tailed distributions including those with infinite moments (see links to papers below). In this setting, EM based methods would not work, since Expectation step would diverge due to presence of outliers.

[edit] Further reading

[edit] Books on Mixture Models

  1. Titterington, D., A. Smith, and U. Makov "Statistical Analysis of Finite Mixture Distributions," John Wiley & Sons (1985).
  2. G. McLachlan, D. Peel Finite Mixture Models, , Wiley (2000)
  3. Marin, J.M., Mengersen, K. and Robert, C.P. "Bayesian modelling and inference on mixtures of distributions". Handbook of Statistics 25, D. Dey and C.R. Rao (eds). Elsevier-Sciences (to appear). available as PDF

[edit] Recent papers

  1. S. Dasgupta (1999). "Learning Mixtures of Gaussians". Proc. of Symposium on Foundations of Computer Science (FOCS).
  2. S. Dasgupta and L. Schulman (2000). "A Two-Round Variant of EM for Gaussian Mixtures". Conference in Uncertainty in Artificial Intelligence (UAI).
  3. S. Arora and R. Kannan (2001). "Learning Mixtures of Arbitrary Gaussians". Symposium on Theory of Computing (STOC).
  4. S. Vempala and G. Wang (2004). "A spectral algorithm for learning mixture models". Journal of Computer and System Sciences.
  5. T. Batu, S. Guha, and S. Kannan (2004). "Inferring Mixtures of Markov Chains". Conference on Learning Theory (COLT).
  6. R. Kannan, H. Salmasian, and Santosh Vempala (2005). "The Spectral Method for General Mixture Models". Conference on Learning Theory (COLT).
  7. D. Achlioptas and F. McSherry (2005). "On Spectral Learning of Mixtures of Distributions". Conference on Learning Theory (COLT).
  8. A. Dasgupta, J. Kleinberg, J. Hopcroft, and M.Sandler (2005). "On Learning Mixtures of Heavy-Tailed Distributions". Symposium on Theory of Computing (STOC).
  9. J. Feldman, R. O'Donnell and R. Servedio (2005). "Learning Mixtures of Product Distributions over Discrete Domains". Symposium on Theory of Computing (STOC).

[edit] See also

[edit] External links

  • Mixture modelling page (and the Snob program for Minimum Message Length (MML) applied to finite mixture models), maintained by D.L. Dowe.
  • PyMix - Python Mixture Package, algorithms and data structures for a broad variety of mixture model based data mining applications in Python
In other languages