Coordinate descent

Coordinate descent is a non-derivative optimization algorithm. To find a local minimum of a function, one does line search along one coordinate direction at the current point in each iteration. One uses different coordinate directions cyclically throughout the procedure. On non-separable functions the algorithm may fail to find the optimum in a reasonable number of function evaluations. To improve the convergence an appropriate coordinate system can be gradually learned, such that new search coordinates obtained using PCA are as decorrelated as possible with respect to the objective function (see Adaptive coordinate descent for more details).

Description

Coordinate descent is based on the idea that the minimization of a multivariable function F(\mathbf{x}) can be achieved by minimizing it along one direction at a time. Instead of varying descent direction according to gradient, one fixes descent direction at the outset. For instance, one chooses some basis as the search directions: \mathbf{e}_1, \mathbf{e}_2, \dots, \mathbf{e}_n . One cyclically iterates through each direction, one at a time, minimizing the objective function with respect to that coordinate direction. It follows that, if \mathbf{x}^k is given, the ith coordinate of \mathbf{x}^{k+1} is given by

 \mathbf{x}^{k+1}_i = \underset{y\in\mathbb R}{\operatorname{arg\,min}}\; f(x^{k+1}_1,...,x^{k+1}_{i-1},y,x^k_{i+1},...,x^k_n);

Thus, one begins with an initial guess \mathbf{x}^0 for a local minimum of F, and get a sequence \mathbf{x}^0, \mathbf{x}^1, \mathbf{x}^2, \dots iteratively.

By doing line search in each iteration, We automatically have

F(\mathbf{x}^0)\ge F(\mathbf{x}^1)\ge F(\mathbf{x}^2)\ge \cdots,

It can be shown that this sequence has similar convergence properties as steepest descent. No improvement after one cycle of line search along coordinate directions implies a stationary point is reached.

This process is illustrated below.

Examples

Coordinate descent has problems with non-smooth functions. The following picture shows that coordinate descent iteration may get stuck at a non-stationary point if the level curves of a function are not smooth.

Applications

Coordinate descent algorithms are used in machine learning, e.g. for training linear support vector machines[1] (see LIBLINEAR) and non-negative matrix factorization.[2]

See also

References

  1. Hsieh, C. J.; Chang, K. W.; Lin, C. J.; Keerthi, S. S.; Sundararajan, S. (2008). "A dual coordinate descent method for large-scale linear SVM". Proceedings of the 25th international conference on Machine learning - ICML '08 (PDF). p. 408. doi:10.1145/1390156.1390208. ISBN 9781605582054.
  2. Hsieh, C. J.; Dhillon, I. S. (2011). Fast coordinate descent methods with variable selection for non-negative matrix factorization (PDF). Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '11. p. 1064. doi:10.1145/2020408.2020577. ISBN 9781450308137.