Matrix completion

Matrix completion of a partially revealed 5 by 5 matrix with rank-1. Left: observed incomplete matrix; Right: matrix completion result.

Matrix Completion is the task of filling in the missing entries of a partially observed matrix. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry (i,j) represents the rating of movie j by customer i if customer i has watched movie j and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the term-document matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.

Without any restrictions on the number of degrees of freedom in the completed matrix this problem is underdetermined since the hidden entries could be assigned arbitrary values. Thus matrix completion often seeks to find the lowest rank matrix or, if the rank of the completed matrix is known, a matrix of rank r that matches the known entries. The illustration shows that a partially revealed rank-1 matrix (on the left) can be completed with zero-error (on the right) since all the rows with missing entries should be the same as the third row. In the case of the Netflix problem the ratings matrix is expected to be low-rank since user preferences can often be described by a few factors, such as the movie genre and time of release. Other applications include computer vision, where missing pixels in images need to be reconstructed, detecting the global positioning of sensors in a network from partial distance information, and multiclass learning. The matrix completion problem is in general NP-hard, but there are tractable algorithms that achieve exact reconstruction with high probability.

In statistical learning point of view, the matrix completion problem is an application of matrix regularization which is a generalization of vector regularization. For example, in the low-rank matrix completion problem one may apply the regularization penalty taking the form of a nuclear norm R(X) = \lambda\|X\|_*

Low rank matrix completion

One of the variants of the matrix completion problem is to find the lowest rank matrix X which matches the matrix M, which we wish to recover, for all entries in the set E of observed entries.The mathematical formulation of this problem is as follows:

\begin{align}
& \underset{X}{\text{min}} & \text{rank} (X) \\
& \text{subject to} & X_{ij} = M_{ij} & \;\; \forall i,j \in E\\
\end{align}

Candes and Recht[1] proved that with assumptions on the sampling of the observed entries and sufficiently many sampled entries this problem has a unique solution with high probability.

An equivalent formulation, given that the matrix M to be recovered is known to be of rank r, is to solve for X where X_{ij} = M_{ij} \;\; \forall i,j \in E

Assumptions

A number of assumptions on the sampling of the observed entries and the number of sampled entries are frequently made to simplify the analysis and to ensure the problem is not underdetermined.

Uniform sampling of observed entries

To make the analysis tractable, it is often assumed that the set E of observed entries and fixed cardinality is sampled uniformly at random from the collection of all subsets of entries of cardinality |E|. To further simplify the analysis, it is instead assumed that E is constructed by Bernoulli sampling, i.e. that each entry is observed with probability  p . If p is set to \frac{N}{mn} where N is the desired expected cardinality of E, and m,\;n are the dimensions of the matrix (let m < n without loss of generality), |E| is within O(n \log n) of N with high probability, thus Bernoulli sampling is a good approximation for uniform sampling.[1] Another simplification is to assume that entries are sampled independently and with replacement.[2]

Lower bound on number of observed entries

Suppose the m by n matrix M (with m < n) we are trying to recover has rank r. There is an information theoretic lower bound on how many entries must be observed before M can be uniquely reconstructed. Firstly, the number of degrees of freedom of a matrix of rank r is 2nr - r^2. This can be shown by looking at the Singular Value Decomposition of the matrix and counting the degrees of freedom. Then at least 2nr - r^2 entries must be observed for matrix completion to have a unique solution.

Secondly, there must be at least one observed entry per row and column of M. The Singular Value Decomposition of M is given by U\Sigma V^\dagger. If row i is unobserved, it is easy to see the i^{\text{th}} right singular vector of M, v_i, can be changed to some arbitrary value and still yield a matrix matching M over the set of observed entries. Similarly, if column j is unobserved, the j^{\text{th}} left singular vector of M, u_i can be arbitrary. If we assume Bernoulli sampling of the set of observed entries, the Coupon collector effect implies that entries on the order of O(n\log n) must be observed to ensure that there is an observation from each row and column with high probability.[3]

Combining the necessary conditions and assuming that r \ll m, n (a valid assumption for many practical applications), the lower bound on the number of observed entries required to prevent the problem of matrix completion from being underdetermined is on the order of nr\log n .

Incoherence

The concept of incoherence arose in compressed sensing. It is introduced in the context of matrix completion to ensure the singular vectors of M are not too "sparse" in the sense that all coordinates of each singular vector are of comparable magnitude instead of just a few coordinates having significantly larger magnitudes.[4] The standard basis vectors are then undesirable as singular vectors, and the vector  \frac{1}{\sqrt{n}} \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} in \mathbb{R}^n is desirable. As an example of what could go wrong if the singular vectors are sufficiently "sparse", consider the m by n matrix \begin{bmatrix} 1 & 0 & \cdots & 0 \\ \vdots & & \vdots \\ 0 & 0 & 0 & 0 \end{bmatrix} with singular value decomposition I_m \begin{bmatrix} 1 & 0 & \cdots & 0 \\ \vdots & & \vdots \\ 0 & 0 & 0 & 0 \end{bmatrix} I_n. Almost all the entries of M must be sampled before it can be reconstructed.

Candes and Recht[1] define the coherence of a matrix U with column space an r-dimensional subspace of \mathbb{R}^n as \mu (U) = \frac{n}{r} \max_{i < n} \|P_U e_i\|^2 , where P_U is the orthogonal projection onto  U . Incoherence then asserts that given the singular value decomposition U\Sigma V^\dagger of the m by n matrix M,

  1. \mu (U), \; \mu (V) \leq \mu_0
  2. The entries of  \sum_k u_k v_k ^\dagger have magnitudes upper bounded by  \mu_1 \sqrt{\frac{r}{mn}}

for some \mu_0, \; \mu_1.

Low rank matrix completion with Noise

In real world application, one often observe only a few entries corrupted at least by a small amount of noise. For example, in the Netflix problem, the ratings are uncertain. Candes and Plan [5] showed that it is possible to fill in the many missing entries of large low-rank matrices from just a few noisy samples by nuclear norm minimization. The noisy model assumes that we observe

 Y_{ij} = M_{ij} + Z_{ij}, (i,j) \in \Omega,

where {Z_{ij}:(i,j) \in \Omega} is a noise term. Note that the noise can be either stochastic or deterministic. Alternatively the model can be expressed as

 P_\Omega(Y) = P_\Omega(M) + P_\Omega(Z),

where Z is an n \times n matrix with entries Z_{ij} for (i,j) \in \Omega assuming that \|P_\Omega(Z)\|_F\leq\delta for some \delta > 0 .To recover the incomplete matrix, we try to solve the following optimization problem:


\begin{align}
& \underset{X}{\text{min}} & \|X\|_* \\
& \text{subject to} & \|P_\Omega(X-Y)\|_F \leq \delta\\
\end{align}

Among all matrices consistent with the data, find the one with minimum nuclear norm. Candes and Plan [5] have shown that this reconstruction is accurate. They have proved that when perfect noiseless recovery occurs, then matrix completion is stable vis a vis perturbations. The error is proportional to the noise level \delta. Therefore when the noise level is small, the error is small. Here the matrix completion problem does not obey the restricted isometry property (RIP). For matrices, the RIP would assume that the sampling operator obeys


(1-\delta)\|X\|^2_F \leq \frac{1}{p}\|P_\Omega(X)\|^2_F \leq (1+\delta)\|X\|^2_F

for all matrices X with sufficiently small rank and \delta<1 sufficiently small. The methods are also applicable to sparse signal recovery problems in which the RIP does not hold.

High rank matrix completion

The high rank matrix completion in general is NP-Hard. However, with certain assumptions, some incomplete high rank even full rank matrix can be completed.

Eriksson, Balzano and Nowak [6] have considered the problem of completing a matrix with the assumption that the columns of the matrix belong to a union of multiple low-rank subspaces. Since the columns belong to a union of subspaces, the problem may be viewed as a missing-data version of the subspace clustering problem. Let X be an n \times N matrix whose (complete) columns lie in a union of at most k subspaces, each of rank \leq r < n, and assume N \gg kn. Eriksson, Balzano and Nowak [6] showed that under mild assumptions each column of X can be perfectly recovered with high probability from an incomplete version so long as at least CrN\log^2(n) entries of X are observed uniformly at random, with C>1 a constant depending on the usual incoherence conditions, the geometrical arrangement of subspaces, and the distribution of columns over the subspaces.

The algorithm involves several steps : (1) local neighborhoods; (2) local subspaces; (3) subspace refinement; (4) full matrix completion. This method can be applied to Internet distance matrix completion and topology identification.

Algorithms

Convex Relaxation

The rank minimization problem is NP-hard. One approach, proposed by Candes and Recht, is to form a convex relaxation of the problem and minimize the nuclear norm \|M\|_* (which gives the sum of the singular values of M instead of \text{rank}(M) (which counts the number of non zero singular values of M).[1] This is analogous to minimizing the L1-norm rather than the L0-norm for vectors. The convex relaxation can be solved using semidefinite programming (SDP) by noticing that the optimization problem is equivalent to

\begin{align}
& \underset{W_1, W_2}{\text{min}} & & \text{trace} (W_1) + \text{trace} (W_2) \\
& \text{subject to} & & X_{ij} = M_{ij}  \;\; \forall i,j \in E\\
& & &  \begin{bmatrix} W_1 & X \\ X^\dagger & W_2 \end{bmatrix} \succeq 0 
\end{align}

The complexity of using SDP to solve the convex relaxation is O(\text{max}(m,n)^4). State of the art solvers like SDP3 can only handle matrices of size up to 100 by 100 [7] An alternative first order method that approximately solves the convex relaxation is the Singular Value Thresholding Algorithm introduced by Cai, Candes and Shen.[7]

Candes and Recht show, using the study of random variables on Banach spaces, that if the number of observed entries is on the order of \max{\{\mu_1^2, \sqrt{\mu_0}\mu_1, \mu_0 n^{0.25}\}}nr \log n (assume without loss of generality m < n), the rank minimization problem has a unique solution which also happens to be the solution of its convex relaxation with probability 1-\frac{c}{n^3} for some constant c. If the rank of M is small ( r \leq \frac{n^{0.2}}{\mu_0}), the size of the set of observations reduces to the order of \mu_0 n^{1.2} r \log n. These results are near optimal, since the minimum number of entries that must be observed for the matrix completion problem to not be underdetermined is on the order of nr \log n.

This result has been improved by Candes and Tao.[3] They achieve bounds that differ from the optimal bounds only by polylogarithmic factors by strengthening the assumptions. Instead of the incoherence property, they assume the strong incoherence property with parameter \mu_3. This property states that:

  1. | \langle e_a, P_U e_{a'} \rangle - \frac{r}{m} 1_{a = a'} | \leq \mu_3 \frac{\sqrt{r}}{m}  for a, a' \leq m and | \langle e_b, P_U e_{b'} \rangle - \frac{r}{n} 1_{b = b'} | \leq \mu_3 \frac{\sqrt{r}}{n}  for b, b' \leq n
  2. The entries of \sum_i u_i v_i^\dagger are bounded in magnitude by \mu_3 \sqrt{\frac{r}{mn}}

Intuitively, strong incoherence of a matrix U asserts that the orthogonal projections of standard basis vectors to U has magnitudes that have high likelihood if the singular vectors were distributed randomly.[4]

Candes and Tao find that when r is O(1) and the number of observed entries is on the order of \mu_3^4 n(\log n)^2, the rank minimization problem has a unique solution which also happens to be the solution of its convex relaxation with probability 1-\frac{c}{n^3} for some constant c. For arbitrary r, the number of observed entries sufficient for this assertion hold true is on the order of \mu_3^2 nr (\log n)^6

Gradient Descent [8]

Keshavan, Montanari and Oh[8] consider a variant of matrix completion where the rank of the m by n matrix M, which is to be recovered, is known to be r. They assume Bernoulli sampling of entries, constant aspect ratio \frac{m}{n}, bounded magnitude of entries of M (let the upper bound be M_{\text{max}}), and constant condition number \frac{\sigma_1}{\sigma_r} (where \sigma_1 and \sigma_r are the largest and smallest singular values of M respectively). Further, they assume the two incoherence conditions are satisfied with \mu_0 and \mu_1 \frac{\sigma_1}{\sigma_r} where \mu_0 and \mu_1 are constants. Let M^E be a matrix that matches M on the set E of observed entries and is 0 elsewhere. They then propose the following algorithm:

  1. Trim M^E by removing all observations from columns with degree larger than \frac{2|E|}{n} by setting the entries in the columns to 0. Similarly remove all observations from rows with degree larger than \frac{2|E|}{n}.
  2. Project M^E onto its first r principal components. Call the resulting matrix \text{Tr}(M^E).
  3. Solve  \min_{X,Y} \min_{S \in \mathbb{R}^{r \times r}} \frac{1}{2} \sum_{i,j \in E} (M_{ij} - (XSY^\dagger)_{ij})^2 + \rho G(X,Y) where G(X,Y) is some regularization function by gradient descent with line search. Initialize X,\;Y at X_0,\;Y_0 where \text{Tr}(M_E) = X_0 S_0 Y_0^\dagger. Set G(X,Y) as some function forcing X, \; Y to remain incoherent throughout gradient descent if X_0 and Y_0 are incoherent.
  4. Return the matrix XSY^\dagger.

Steps 1 and 2 of the algorithm yield a matrix \text{Tr}(M^E) very close to the true matrix M (as measured by the root mean square error (RMSE) with high probability. In particular, with probability 1-\frac{1}{n^3}, \frac{1}{mnM_{\text{max}}^2} \| M - \text{Tr}(M^E) \|_F^2 \leq C \frac{r}{m|E|} \sqrt{\frac{m}{n}} for some constant C. \| \cdot \|_F denotes the Frobenius norm. Note that the full suite of assumptions is not needed for this result to hold. The incoherence condition, for example, only comes into play in exact reconstruction. Finally, although trimming may seem counter intuitive as it involves throwing out information, it ensures projecting M^E onto its first r principal components gives more information about the underlying matrix M than about the observed entries.

In Step 3, the space of candidate matrices X,\;Y can be reduced by noticing that the inner minimization problem has the same solution for (X,Y) as for (XQ,YR) where Q and R are orthonormal r by r matrices. Then gradient descent can be performed over the cross product of two Grassman manifolds. If r \ll m,\;n and the observed entry set is in the order of nr\log n, the matrix returned by Step 3 is exactly M. Then the algorithm is order optimal, since we know that for the matrix completion problem to not be underdetermined the number of entries must be in the order of nr\log n.

Alternating Least Squares Minimization

Alternating minimization represents a widely applicable and empirically successful approach for finding low-rank matrices that best fit the given data. For example, for the problem of low-rank matrix completion, this method is believed to be one of the most accurate and efficient, and formed a major component of the winning entry in the Netflix problem. In the alternating minimization approach, the low-rank target matrix is written in a bilinear form:

X= UV^\dagger;

the algorithm then alternates between finding the best U and the best V. While the overall problem is non-convex, each sub-problem is typically convex and can be solved efficiently. Jain, Netrapalli and Sanghavi [9] have given one of the first guarantees for performance of alternating minimization for both matrix completion and matrix sensing.

The alternating minimization algorithm can be viewed as an approximate way to solve the following non-convex problem:


\begin{align}
& \underset{U, V \in \mathbb{R}^{n\times k}}{\text{min}} & \|P_\Omega(UV^\dagger)-P_\Omega(M)\|^2_F \\
\end{align}

The AltMinComplete Algorithm proposed by Jain, Netrapalli and Sanghavi is listed here:[9]

  1. Input: observed set \Omega, values P_\Omega(M)
  2. Partition \Omega into 2T+1 subsets \Omega_0,\cdots,\Omega_{2T} with each element of \Omega belonging to one of the \Omega_t with equal probability (sampling with replacement)
  3. \hat{U}^0 = SVD(\frac{1}{p}P_{\Omega_0}(M), k) i.e., top-k left singular vectors of \frac{1}{p}P_{\Omega_0}(M)
  4. Clipping: Set all elements of \hat{U}^0 that have magnitude greater than \frac{2\mu\sqrt{k}}{\sqrt{n}} to zero and orthonormalize the columns of \hat{U}^0
  5. for t = 0, \cdots, T-1 do
  6. \quad \hat{V}^{t+1}\leftarrow \text{argmin}_{V\in \mathbb{R}^{n\times k}}\|P_{\Omega_{t+1}}(\hat{U}V^\dagger-M)\|^2_F
  7. \quad \hat{U}^{t+1}\leftarrow \text{argmin}_{U\in \mathbb{R}^{m\times k}}\|P_{\Omega_{T+t+1}}(U(\hat{V}^{t+1})^\dagger-M)\|^2_F
  8. end for
  9. Return X= \hat{U}^T(\hat{V}^T)^\dagger

They showed that by observing |\Omega| = O((\frac{\sigma_1^*}{\sigma_k^*})^6k^7\log n \log (k \|M\|_F/\epsilon)) random entries of an incoherent matrix M, AltMinComplete algorithm can recover M in O(\log(1/\epsilon)) steps. In terms of sample complexity (|\Omega|), theoretically, Alternating Minimization may require a bigger \Omega than Convex Relaxation. However empirically it seems not the case which implies that the sample complexity bounds can be further tightened. In terms of time complexity, they showed that AltMinComplete needs time

O(|\Omega|k^2\log(1/\epsilon)).

It is worth noting that, although convex relaxation based methods have rigorous analysis, alternating minimization based algorithms are more successful in practice.

Applications

Several applications of matrix completions is summarized by Candes and Plan[5] as follows:

Collaborative filtering

Collaborative filtering is the task of making automatic predictions about the interests of a user by collecting taste information from many users. Companies like Apple, Amazon, Barnes and Noble and Netflix are trying predict their user preferences from partial complete knowledge. In these kind of matrix completion problem, the unknown full matrix is often considered low rank because only a few factors typically contribute to an individual's tastes or preference.

System identification

In control, one would like to fit a discrete-time linear time-invariant state-space model

\begin{align}
x(t+1)&=Ax(t)+Bu(t)\\
y(t)&=Cx(t)+Du(t)
\end{align}

to a sequence of inputs u(t) \in \mathbb{R}^m and outputs y(t) \in \mathbb{R}^p, t = 0, \ldots, N. The vector x(t) \in \mathbb{R}^n is the state of the system at time t and n is the order of the system model. From the input/output pair, one would like to recover the matrices A,B,C,D and the initial state x(0). This problem can also be view as a low-rank matrix completion problem.

Global positioning

The global positioning problem emerges naturally in sensor networks. The problem is to recover the global positioning of points in Euclidean space from a local or partial set of pairwise distances. Thus it is a matrix completion problem with rank two if the sensors are located in a 2-D plane and three if they are in a 3-D space.

See also

References

  1. 1 2 3 4 E. J. Candès and B. Recht, "Exact matrix completion via convex optimization", Found. of Comput. Math., 2008
  2. B. Recht, "A simpler approach to matrix completion", Journal of Machine Learning Research, Vol 12, pp. 3413--3430, 2011.
  3. 1 2 E. J. Candès and T. Tao, "The power of convex relaxation: Near-optimal matrix completion",arXiv:0903.1476, 2009.
  4. 1 2 http://terrytao.wordpress.com/2009/03/10/the-power-of-convex-relaxation-near-optimal-matrix-completion/, T. Tao, "The power of convex relaxation: Near-optimal matrix completion", What's New, 2009
  5. 1 2 3 E. J. Candès and Yaniv Plan, "Matrix Completion with Noise", arxiv:0903.3131, 2009.
  6. 1 2 B. Eriksson, L. Balzano and R. Nowak, "High-Rank Matrix Completion and Subspace Clustering with Missing Data", arxiv:1112.5629, 2011.
  7. 1 2 J. F. Cai, E. J. Candès and Z. Shen, "A singular value thresholding algorithm for matrix completion", SIAM J. Optim., Vol. 20, pp. 1956-1982, 2010.
  8. 1 2 R. H. Keshavan, A. Montanari and S. Oh, "Matrix completion from a few entries", arXiv:0901.3150, 2009.
  9. 1 2 P. Jain, P. Netrapalli and Sanghavi, "Low-rank Matrix Completion using Alternating Minimization", arxiv:1212.0467, 2012
This article is issued from Wikipedia - version of the Wednesday, February 03, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.