Non-negative matrix factorization (NMF) is a group of algorithms in multivariate analysis and linear algebra where a matrix, , is factorized into (usually) two matrices, and :
Factorization of matrices is generally non-unique, and a number of different methods of doing so have been developed (e.g. principal component analysis and singular value decomposition) by incorporating different constraints; non-negative matrix factorization differs from these methods in that it enforces the constraint that the factors W and H must be non-negative, i.e., all elements must be equal to or greater than zero.
Contents |
In chemometrics non-negative matrix factorization has a long history under the name "self modeling curve resolution".[1] In this framework the vectors in the right matrix are continuous curves rather than discrete vectors. Also early work on non-negative matrix factorizations was performed by a Finnish group of researchers in the middle of the 1990s under the name positive matrix factorization.[2][3] It became more widely known as non-negative matrix factorization after Lee and Seung investigated the properties of the algorithm and published some simple and useful algorithms for two types of factorizations.[4][5]
Usually the number of columns of W and the number of rows of H in NMF are selected so the product WH will become an approximation to X (it has been suggested that the NMF model should be called nonnegative matrix approximation instead). The full decomposition of X then amounts to the two non-negative matrices W and H as well as a residual U, such that: X = WH + U. The elements of the residual matrix can either be negative or positive.
When W and H are smaller than X they become easier to store and manipulate.
There are different types of non-negative matrix factorizations. The different types arise from using different cost functions for measuring the divergence between X and WH and possibly by regularization of the W and/or H matrices.[6]
Two simple divergence functions studied by Lee and Seung are the squared error (or Frobenius norm) and an extension of the Kullback-Leibler divergence to positive matrices (the original Kullback-Leibler divergence is defined on probability distributions). Each divergence leads to a different NMF algorithm, usually minimizing the divergence using iterative update rules.
The factorization problem in the squared error version of NMF may be stated as: Given a matrix find nonnegative matrices W and H that minimize the function
Another type of NMF for images is based on the total variation norm.[7]
There are several ways in which the W and H may be found: Lee and Seung's updates are usually referred to as the multiplicative update method, while others have suggested gradient descent algorithms and so-called alternating non-negative least squares and "projected gradient".[8][9]
The currently available algorithms are sub-optimal as they can only guarantee finding a local minima, rather than a global minimum of the cost function. A provably optimal algorithm is unlikely in the near future as the problem has been shown to generalize the k-means clustering problem which is known to be computationally difficult (NP-complete)[10]. However, as in many other data mining applications a local minimum may still prove to be useful.
In Learning the parts of objects by non-negative matrix factorization Lee and Seung proposed NMF mainly for parts-based decomposition of images. It compares NMF to vector quantization and principal component analysis, and shows that although the three techniques may be written as factorizations, they implement different constraints and therefore produce different results.
It was later shown that some types of NMF are an instance of a more general probabilistic model called "multinomial PCA".[11] When NMF is obtained by minimizing the Kullback–Leibler divergence, it is in fact equivalent to another instance of multinomial PCA, probabilistic latent semantic analysis,[12] trained by maximum likelihood estimation. That method is commonly used for analyzing and clustering textual data and is also related to the latent class model.
It has been shown [13][14] NMF is equivalent to a relaxed form of K-means clustering: matrix factor W contains cluster centroids and H contains cluster membership indicators, when using the least square as NMF objective. This provides theoretical foundation for using NMF for data clustering.
When using KL divergence as the objective function, it is shown [15] that NMF has a Chi-square interpretation and is equivalent to probabilistic latent semantic analysis.
NMF extends beyond matrices to tensors of arbitrary order.[16][17] This extension may be viewed as a non-negative version of, e.g., the PARAFAC model.
NMF is an instance of the nonnegative quadratic programming (NQP) as well as many other important problems including the support vector machine (SVM). However, SVM and NMF are related at a more intimate level than that of NQP, which allows direct application of the solution algorithms developed for either of the two methods to problems in both domains.[18]
The factorization is not unique: A matrix and its inverse can be used to transform the two factorization matrices by, e.g.,[19]
If the two new matrices and are non-negative they form another parametrization of the factorization.
The non-negativity of and applies at least if B is a non-negative monomial matrix. In this simple case it will just correspond to a scaling and a permutation.
More control over the non-uniqueness of NMF is obtained with sparsity constraints.[20]
NMF can be used for text mining applications. In this process, a document-term matrix is constructed with the weights of various terms (typically weighted word frequency information) from a set of documents. This matrix is factored into a term-feature and a feature-document matrix. The features are derived from the contents of the documents, and the feature-document matrix describes data clusters of related documents.
One specific application used hierarchical NMF on a small subset of scientific abstracts from PubMed.[21] Another research group clustered parts of the Enron email dataset[22] with 65,033 messages and 91,133 terms into 50 clusters.[23] NMF has also been applied to citations data, with one example clustering Wikipedia articles and scientific journals based on the outbound scientific citations in Wikipedia.[24]
NMF is also used to analyze spectral data; one such use is in the classification of space objects and debris.[25]
NMF is applied in scalable Internet distance (round-trip time) prediction. For a network with hosts, with the help of NMF, the distances of all the end-to-end links can be predicted by conduct only measurements. This kind of method was firstly introduced in Internet Distance Estimation Service (IDES).[26] Afterwards, as a fully decentralized approach, Phoenix network coordinate system [27] is proposed. It achieves better overall prediction accuracy by introducing the concept of weight.
Speech denoising has been a long lasting problem in audio processing community. There exist lots of algorithms for denoising is the noise is stationary. For example, Wiener filter is suitable for additive Gaussian noise. However, if the noise is non-stationary, the classical denoising algorithms usually have poor performance because the statistical information of the non-stationary noise is difficult to estimate. Schmidt [28] use NMF do speech denoising under non-stationary noise, which is completely different than classical statistical approaches.The key idea is that clean speech signal can be sparsely represented by a speech dictionary, but non-stationary noise cannot. Similarly, non-stationary noise can also be sparsely represented by a noise dictionary, but speech cannot.
The algorithm for NMF denoising goes as follows. Two dictionaries, one for speech and one for noise, need to be trained offline. Once a noisy speech is given, we first calculate the magnitude of the Short-Time-Fourier-Transform. Second, separate it into two parts via NMF, one can be sparsely represented by the speech dictionary, and the other part can be sparsely represented by the noise dictionary. Third, the part that is represented by the dictionary will be the estimated clean speech.
Current research in nonnegative matrix factorization includes, but not limited to,
(1) Algorithmic: searching for global minima of the factors and factor initialization.[29]
(2) Scalability: how to factorize million-by-billion matrices, which are commonplace in Web-scale data mining, e.g., see Distributed Nonnegative Matrix Factorization (DNMF)[30]
(3) Online: how to update the factorization when new data comes in without recomputing from scratch.