K-SVD

Machine learning and data mining

Problems
Classification Clustering Regression Anomaly detection Association rules Reinforcement learning Structured prediction Feature learning Online learning Semi-supervised learning Grammar induction
Supervised learning (classification • regression)
Decision trees Ensembles (Bagging, Boosting, Random forest) k-NN Linear regression Naive Bayes Neural networks Logistic regression Perceptron Support vector machine (SVM) Relevance vector machine (RVM)
Clustering
BIRCH Hierarchical k-means Expectation-maximization (EM) DBSCAN OPTICS Mean-shift
Dimensionality reduction
Factor analysis CCA ICA LDA NMF PCA t-SNE
Structured prediction
Graphical models (Bayes net, CRF, HMM)
Anomaly detection
k-NN Local outlier factor
Neural nets
Autoencoder Deep learning Multilayer perceptron RNN Restricted Boltzmann machine SOM Convolutional neural network
Theory
Bias-variance dilemma Computational learning theory Empirical risk minimization PAC learning Statistical learning VC theory
Machine learning portal Computer science portal Statistics portal

In applied mathematics, K-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value decomposition approach. K-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the dictionary to better fit the data.^[1]^[2] K-SVD can be found widely in use in applications such as image processing, audio processing, biology, and document analysis.

Problem description

Main article: Sparse approximation

The goal of dictionary learning is to learn an overcomplete dictionary matrix $D \in \mathbb{R}^{n \times K}$ that contains $K$ signal-atoms (in this notation, columns of $D$ ). A signal vector $y \in \mathbb{R}^{n}$ can be represented, sparsely, as a linear combination of these atoms; to represent $y$ , the representation vector $x$ should satisfy the exact condition $y = Dx$ , or the approximate condition $y \approx Dx$ , made precise by requiring that $\|y - Dx\|_p \le \epsilon$ for some small value $ε$ and some $Lₚ$ norm. The vector $x \in \mathbb{R}^{K}$ contains the representation coefficients of the signal $y$ . Typically, the norm $p$ is selected as $L 1$ , $L 2$ , or $L \infty$ .

If $n < K$ and D is a full-rank matrix, an infinite number of solutions are available for the representation problem, Hence, constraints should be set on the solution. Also, to ensure sparsity, the solution with the fewest number of nonzero coefficients is preferred. Thus, the sparsity representation is the solution of either

(P_0) \quad \min \limits _x \|x\|_0 \qquad \text{subject to } y = Dx

(P_{0, \epsilon}) \quad \min \limits _x \|x\|_0 \qquad \text{subject to } \|y - Dx\|_2 \le \epsilon

where the $\|x\|_0$ counts the nonzero entries in the vector $x$ . (See the zero "norm".)

K-SVD algorithm

K-SVD is a kind of generalization of K-means, as follows. The k-means clustering can be also regarded as a method of sparse representation. That is, finding the best possible codebook to represent the data samples $\{y_i\}^M_{i=1}$ by nearest neighbor, by solving

\quad \min \limits _{D, X} \{ \|Y - DX\|^2_F\} \qquad \text{subject to } \forall i, x_i = e_k \text{ for some } k.

which is quite similar to

\quad \min \limits _{D, X} \{ \|Y - DX\|^2_F\} \qquad \text{subject to }\quad \forall i , \|x_i\|_0 = 1.

The sparse representation term $x_i = e_k$ enforces K-means algorithm to use only one atom (column) in dictionary D. To relax this constraint, the target of the K-SVD algorithm is to represent signal as a linear combination of atoms in D.

The K-SVD algorithm follows the construction flow of K-means algorithm. However, In contrary to K-means, in order to achieve linear combination of atoms in D, sparsity term of the constrain is relaxed so that nonzero entries of each column $x_i$ can be more than 1, but less than a number $T_0$ .

So, the objective function becomes

\quad \min \limits _{D, X} \{ \|Y - DX\|^2_F \} \qquad \text{subject to } \quad \forall i \;, \|x_i\|_0 \le T_0.

or in another objective form

\quad \min \limits _{D, X} \sum_{i} \|x_i\|_0 \qquad \text{subject to } \quad \forall i \;, \|Y - DX\|^2_F \le \epsilon.

In the K-SVD algorithm, the $D$ is first to be fixed and the best coefficient matrix $X$ . As finding the truly optimal $X$ is impossible, we use an approximation pursuit method. Any such algorithm as OMP, the orthogonal matching pursuit in can be used for the calculation of the coefficients, as long as it can supply a solution with a fixed and predetermined number of nonzero entries $T_0$ .

After the sparse coding task, the next is to search for a better dictionary $D$ . However, finding the whole dictionary all at a time is impossible, so the process then update only one column of the dictionary $D$ each time while fix $X$ . The update of $k-th$ is done by rewriting the penalty term as

\|Y - DX\|^2_F = \left| Y - \sum_{j = 1}^K d_j x^j_T\right|^2_F = \left| \left(Y - \sum_{j \ne k} d_j x^j_T \right) - d_k x^k_T \right|^2_F = \| E_k - d_k x^k_T\|^2_F

where $x^k_T$ denotes the k-th row of X.

By decomposing the multiplication $DX$ into sum of $K$ rank 1 matrices, we can assume the other $K-1$ terms are assumed fixed, and the $k-th$ remains unknown. After this step, we can solve the minimization problem by approximate the $E_k$ term with a $rank -1$ matrix using singular value decomposition, then update $d_k$ with it. However, the new solution of vector $x^k_T$ is very likely to be filled, because the sparsity constrain is not enforced.

To cure this problem, Define $\omega_k$ as

\omega_k = \{i \mid 1 \le i \le N , x^k_T(i) \ne 0\}.

Which points to examples $\{ y_i \}$ that use atom $d_k$ (also the entries of $x_i$ that is nonzero). Then, define $\Omega_k$ as a matrix of size $N\times|\omega_k|$ , with ones on the $( i\text{-th}, \omega_k(i))$ entries and zeros otherwise. When multiplying $x_R^k = x^k_T\Omega_k$ , this shrinks the row vector $x_T^k$ by discarding the nonzero entries. Similarly, the multiplication $Y^R_k = Y\Omega_k$ is the subset of the examples that are current using the $d_k$ atom. The same effect can be seen on $E^R_k = E_k\Omega_k$ .

So the minimization problem as mentioned before becomes

\| E_k\Omega_k - d_k x^k_T\Omega_k\|^2_F = \| E^R_k - d_k x^k_R\|^2_F

and can be done by directly using SVD. SVD decomposes $E^R_k$ into $U\Delta V^T$ . The solution for $d_k$ is the first column of U, the coefficient vector $x^k_R$ as the first column of $V \times \Delta (1, 1)$ . After updated the whole dictionary, the process then turns to iteratively solve X, then iteratively solve D.

Limitations

Choosing an appropriate "dictionary" for a dataset is a non-convex problem, and K-SVD operates by an iterative update which does not guarantee to find the global optimum.^[2] However, this is common to other algorithms for this purpose, and K-SVD works fairly well in practice.^[2]

References

↑ Michal Aharon, Michael Elad, and Alfred Bruckstein (2006), "K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation", IEEE Transactions on Signal Processing 54 (11): 4311–4322, doi:10.1109/TSP.2006.881199
↑ 2.0 2.1 2.2 Rubinstein, R., Bruckstein, A.M., and Elad, M. (2010), "Dictionaries for Sparse Representation Modeling", Proceedings of the IEEE 98 (6): 1045–1057, doi:10.1109/JPROC.2010.2040551

K-SVD

Problem description

K-SVD algorithm

Limitations

See also

References