Kernel methods for vector output

Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.

In typical machine learning algorithms, these functions produce a scalar output. Recent development of kernel methods for functions with vector-valued output is due, at least in part, to interest in simultaneously solving related problems. Kernels which capture the relationship between the problems allow them to borrow strength from each other. Algorithms of this type include multi-task learning (also called multi-output learning or vector-valued learning), transfer learning, and co-kriging. Multi-label classification can be interpreted as mapping inputs to (binary) coding vectors with length equal to the number of classes.

In Gaussian processes, kernels are called covariance functions. Multiple-output functions correspond to considering multiple processes. See Bayesian interpretation of regularization for the connection between the two perspectives.

History

The history of learning vector-valued functions is closely linked to transfer learning, a broad term that refers to systems that learn by transferring knowledge between different domains. The fundamental motivation for transfer learning in the field of machine learning was discussed in a NIPS-95 workshop on “Learning to Learn,” which focused on the need for lifelong machine learning methods that retain and reuse previously learned knowledge. Research on transfer learning has attracted much attention since 1995 in different names: learning to learn, lifelong learning, knowledge transfer, inductive transfer, multitask learning, knowledge consolidation, context-sensitive learning, knowledge-based inductive bias, metalearning, and incremental/cumulative learning.[1] Interest in learning vector-valued functions was particularly sparked by multitask learning, a framework which tries to learn multiple, possibly different tasks simultaneously.

Much of the initial research in multitask learning in the machine learning community was algorithmic in nature, and applied to methods such as neural networks, decision trees and k-nearest neighbors in the 1990s.[2] The use of probabilistic models and Gaussian processes was pioneered and largely developed in the context of geostatistics, where prediction over vector-valued output data is known as cokriging.[3][4][5] Geostatistical approaches to multivariate modeling are mostly formulated around the linear model of coregionalization (LMC), a generative approach for developing valid covariance functions that has been used for multivariate regression and in statistics for computer emulation of expensive multivariate computer codes. The regularization and kernel theory literature for vector-valued functions followed in the 2000s.[6][7] While the Bayesian and regularization perspectives were developed independently, they are in fact closely related.[8]

Notation

In this context, the supervised learning problem is to learn the function f which best predicts vector-valued outputs \mathbf{y_i} given inputs (data) \mathbf{x_i}.

f(\mathbf{x_i}) = \mathbf{y_i} for i=1, \ldots ,N
\mathbf{x_i} \in \mathcal{X}, an input space (e.g. \mathcal{X} = \mathbb{R}^p)
\mathbf{y_i} \in \mathbb{R}^D

In general, each component of (\mathbf{y_i}), could have different input data (\mathbf{x_{d,i}}) with different cardinality (p) and even different input spaces (\mathcal{X}).[8] Geostatistics literature calls this case heterotopic, and uses isotopic to indicate that the each component of the output vector has the same set of inputs.[9]

Here, for simplicity in the notation, we assume the number and sample space of the data for each output are the same.

Regularization perspective[8][10][11]

From the regularization perspective, the problem is to learn f_* belonging to a reproducing kernel Hilbert space of vector-valued functions (\mathcal{H}). This is similar to the scalar case of Tikhonov regularization, with some extra care in the notation.

Vector-valued case Scalar case
Reproducing kernel \mathbf{K}: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}^{D \times D} k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}
Learning problem f_* = \operatorname{argmin} \sum\limits_{j=1}^D \frac{1}{N} \sum\limits_{i=1}^N (f_j(\mathbf{x_i}) - y_{j,i})^2 + \lambda \Vert \mathbf{f} \Vert_\mathbf{K}^2 f_* = \operatorname{argmin} \frac{1}{N} \sum\limits_{i=1}^N (f(\mathbf{x_i}) - y_{i})^2 + \lambda \Vert \mathbf{f} \Vert_k^2
Solution

(derived via the representer theorem^{\dagger})
f_*(\mathbf{x}) =  \sum\limits_{i=1}^N \mathbf{K}(\mathbf{x_i},\mathbf{x})c_i

with \bar{\mathbf{c}} = (\mathbf{K}(\mathbf{X},\mathbf{X}) + \lambda N\mathbf(I))^{-1}\bar{\mathbf{y}} ,
where \bar{\mathbf{c}} \text{ and } \bar{\mathbf{y}} are the coefficients and output vectors concatenated to form ND vectors and \mathbf{K}(\mathbf{X},\mathbf{X}) \text{ is an } ND \times ND matrix of N \times N blocks: (\mathbf{K}(\mathbf{x_i},\mathbf{x_j}))_{d,d'}

f_*(\mathbf{x}) =  \sum\limits_{i=1}^N k(\mathbf{x_i},\mathbf{x})c_i = \mathbf{k}_\mathbf{x}^\intercal \mathbf{c}

Solve for \mathbf{c} by taking the derivative of the learning problem, setting it equal to zero, and substituting in the above expression for f_*:

\mathbf{c} = (\mathbf{K} + \lambda I)^{-1}\mathbf{y}

where \mathbf{K}_{ij} = k(\mathbf{x_i},\mathbf{x_j}) = i^{\text{th}} \text{ element of }\mathbf{k}_\mathbf{x_j}

^{\dagger}It is possible, though non-trivial, to show that a representer theorem also holds for Tikhonov regularization in the vector-valued setting.[8]

Note, the matrix-valued kernel \mathbf{K} can also be defined by a scalar kernel R on the space  \mathcal{X} \times \{1, \ldots ,D\}. An isometry exists between the Hilbert spaces associated with these two kernels:

(\mathbf{K}(x,x'))_{d,d'} = R((x,d),(x',d'))

Gaussian process perspective

The estimator of the vector-valued regularization framework can also be derived from a Bayesian viewpoint using Gaussian process methods in the case of a finite dimensional Reproducing kernel Hilbert space. The derivation is similar to the scalar-valued case Bayesian interpretation of regularization. The vector-valued function \textbf{f}, consisting of D outputs \left\{f_d\right\}_{d=1}^D, is assumed to follow a Gaussian process:

\textbf{f} \sim \mathcal{GP}(\textbf{m},\textbf{K})

where \textbf{m}: \mathcal{X} \to \textbf{R}^D is now a vector of the mean functions \left\{m_d(\textbf{x})\right\}_{d=1}^D for the outputs and \textbf{K} is a positive definite matrix-valued function with entry (\textbf{K}(\textbf{x},\textbf{x}'))_{d,d'} corresponding to the covariance between the outputs f_d(\textbf{x}) and f_{d'}(\textbf{x}').

For a set of inputs \textbf{X}, the prior distribution over the vector \textbf{f}(\textbf{X}) is given by \mathcal{N}(\textbf{m}(\textbf{X}),\textbf{K}(\textbf{X},\textbf{X})), where \textbf{m}(\textbf{X}) is a vector that concatenates the mean vectors associated to the outputs and \textbf{K}(\textbf{X},\textbf{X}) is a block-partitioned matrix. The distribution of the outputs is taken to be Gaussian:

p(\textbf{y}\mid \textbf{f},\textbf{x}, \Sigma) = \mathcal{N}(\textbf{f}(\textbf{x}),\Sigma)

where \Sigma \in \mathcal{\textbf{R}}^{D \times D} is a diagonal matrix with elements \left\{\sigma_d^2\right\}_{d=1}^{D} specifying the noise for each output. Using this form for the likelihood, the predictive distribution for a new vector \textbf{x}_* is:

p(\textbf{f}(\textbf{x}_*)\mid\textbf{S},\textbf{f},\textbf{x}_*,\phi) = \mathcal{N}(\textbf{f}_*(\textbf{x}_*),\textbf{K}_*(\textbf{x}_*,\textbf{x}_*))

where \textbf{S} is the training data, and \phi is a set of hyperparameters for \textbf{K}(\textbf{x},\textbf{x}') and \Sigma.

Equations for \textbf{f}_* and \textbf{K}_* can then be obtained:

\textbf{f}_*(\textbf{x}_*) = \textbf{K}_{\textbf{x}_*}^T(\textbf{K}(\textbf{X},\textbf{X}) + \boldsymbol\Sigma)^{-1}\bar{\textbf{y}}
\textbf{K}_*(\textbf{x}_*,\textbf{x}_*) = \textbf{K}(\textbf{x}_*,\textbf{x}_*) - \textbf{K}_{\textbf{x}_*}(\textbf{K}(\textbf{X},\textbf{X}) + \boldsymbol\Sigma)^{-1}\textbf{K}_{\textbf{x}_*}^T

where \boldsymbol\Sigma = \Sigma \otimes \textbf{I}_N, \textbf{K}_{\textbf{x}_*} \in \mathcal{\textbf{R}}^{D \times ND} has entries (\textbf{K}(\textbf{x}_*,\textbf{x}_j))_{d,d'} for j = 1,\cdots,N and d,d' = 1,\cdots,D. Note that the predictor \textbf{f}^* is identical to the predictor derived in the regularization framework. For non-Gaussian likelihoods different methods such as Laplace approximation and variational methods are needed to approximate the estimators.

Example kernels

Separable

A simple, but broadly applicable, class of multi-output kernels can be separated into the product of a kernel on the input-space and a kernel representing the correlations among the outputs:[8]

(\mathbf{K}(\mathbf{x},\mathbf{x'}))_{d,d'} = k(\mathbf{x},\mathbf{x'})k_T(d,d')
k: scalar kernel on \mathcal{X} \times \mathcal{X}
k_T: scalar kernel on \{1, \ldots ,D\} \times \{1, \ldots ,D\}

In matrix form: \mathbf{K}(\mathbf{x},\mathbf{x'}) = k(\mathbf{x},\mathbf{x'})\mathbf{B}    where \mathbf{B} is a D \times D symmetric and positive semi-definite matrix. Note, setting \mathbf{B} to the identity matrix treats the outputs as unrelated and is equivalent to solving the scalar-output problems separately.

For a slightly more general form, adding several of these kernels yields sum of separable kernels (SoS kernels).

From regularization literature[8][10][12][13][14]

Derived from regularizer

One way of obtaining k_T is to specify a regularizer which limits the complexity of f in a desirable way, and then derive the corresponding kernel. For certain regularizers, this kernel will turn out to be separable.

Mixed-effect regularizer

R(\mathbf{f}) = A_\omega(C_\omega \sum\limits_{l=1}^D \| f_l \|_k^2 + \omega D \sum\limits_{l=1}^D \| f_l - \bar{f} \|_k^2)

where:

where \mathbf{1} \text{ is a } D \times D matrix with all entries equal to 1.

This regularizer is a combination of limiting the complexity of each component of the estimator (f_l) and forcing each component of the estimator to be close to the mean of all the components. Setting \omega = 0 treats all the components as independent and is the same as solving the scalar problems separately. Setting \omega = 1 assumes all the components are explained by the same function.

Cluster-based regularizer

R(\mathbf{f}) = \varepsilon_1 \sum_{c=1}^r \sum_{l \in I(c)} \| f_l - \bar{f_c}\|_k^2 + \varepsilon_2 \sum\limits_{c=1}^r m_c \| \bar{f_c}\|_k^2

where:

where  \mathbf{G}_{l,q} = \varepsilon_1 \delta_{lq} + (\varepsilon_2 - \varepsilon_1)\mathbf{M}_{l,q}

This regularizer divides the components into r clusters and forces the components in each cluster to be similar.

Graph regularizer

R(\mathbf{f}) = \frac{1}{2} \sum\limits_{l,q=1}^D \Vert f_l - f_q \Vert_k^2 \mathbf{M}_{lq} + \sum\limits_{l=1}^D \Vert f_l \Vert_k^2 \mathbf{M}_{l,l}

where \mathbf{M} \text{ is a } D \times D matrix of weights encoding the similarities between the components

K(x,x') = k(x,x') \mathbf{L}^\dagger

where \mathbf{L} = \mathbf{D} - \mathbf{M},   \mathbf{D}_{l,q} = \delta_{l,q}(\sum\limits_{h=1}^D \mathbf{M}_{l,h} + \mathbf{M}_{l,q})

Note, \mathbf{L} is the graph laplacian. See also: graph kernel.

Learned from data

Several approaches to learning \mathbf{B} from data have been proposed.[8] These include: performing a preliminary inference step to estimate \mathbf{B} from the training data,[9] a proposal to learn \mathbf{B} and \mathbf{f} together based on the cluster regularizer,[15] and sparsity-based approaches which assume only a few of the features are needed.[16] [17]

From Bayesian literature

Linear model of coregionalization (LMC)

In LMC, outputs are expressed as linear combinations of independent random functions such that the resulting covariance function (over all inputs and outputs) is a valid positive semidefinite function. Assuming D outputs \left\{f_d(\textbf{x})\right\}_{d=1}^D with \textbf{x} \in \mathcal{\textbf{R}}^p, each f_d is expressed as:

f_d(\textbf{x}) = \sum_{q=1}^Q{a_{d,q}u_q(\textbf{x})}

where a_{d,q} are scalar coefficients and the independent functions u_q(\textbf{x}) have zero mean and covariance cov[u_q(\textbf{x}),u_{q'}(\textbf{x}')] = k_q(\textbf{x},\textbf{x}') if q=q' and 0 otherwise. The cross covariance between any two functions f_d(\textbf{x}) and f_{d'}(\textbf{x}) can then be written as:

\operatorname{cov}[f_d(\textbf{x}),f_{d'}(\textbf{x}')] = \sum_{q=1}^Q{\sum_{i=1}^{R_q}{a_{d,q}^ia_{d',q}^{i}k_q(\textbf{x},\textbf{x}')}} = \sum_{q=1}^Q{b_{d,d'}^qk_q(\textbf{x},\textbf{x}')}

where the functions u_q^i(\textbf{x}), with q=1,\cdots,Q and i=1,\cdots,R_q have zero mean and covariance cov[u_q^i(\textbf{x}),u_{q'}^{i'}(\textbf{x})'] = k_q(\textbf{x},\textbf{x}') if i=i' and q=q'. But \operatorname{cov}[f_d(\textbf{x}),f_{d'}(\textbf{x}')] is given by (\textbf{K}(\textbf{x},\textbf{x}'))_{d,d'}. Thus the kernel \textbf{K}(\textbf{x},\textbf{x}') can now be expressed as

\textbf{K}(\textbf{x},\textbf{x}') = \sum_{q=1}^Q{\textbf{B}_qk_q(\textbf{x},\textbf{x}')}

where each \textbf{B}_q \in \mathcal{\textbf{R}}^{D \times D} is known as a coregionalization matrix. Therefore, the kernel derived from LMC is a sum of the products of two covariance functions, one that models the dependence between the outputs, independently of the input vector \textbf{x} (the coregionalization matrix \textbf{B}_q), and one that models the input dependence, independently of \left\{f_d(\textbf{x})\right\}_{d=1}^D(the covariance function k_q(\textbf{x},\textbf{x}')).

Intrinsic coregionalization model (ICM)

The ICM is a simplified version of the LMC, with Q=1. ICM assumes that the elements b_{d,d'}^q of the coregionalization matrix \textbf{B}_q can be written as b_{d,d'}^q = v_{d,d'}b_q, for some suitable coefficients v_{d,d'}. With this form for b_{d,d'}^q:

\operatorname{cov}[f_d(\textbf{x}),f_{d'}(\textbf{x}')] =  \sum_{q=1}^Q{v_{d,d'}b_qk_q(\textbf{x},\textbf{x}')} = v_{d,d'}\sum_{q=1}^Q{b_qk_q(\textbf{x},\textbf{x}')} =  v_{d,d'}k(\textbf{x},\textbf{x}')

where k(\textbf{x},\textbf{x}') = \sum_{q=1}^Q{b_qk_q(\textbf{x},\textbf{x}')}. In this case, the coefficients v_{d,d'} = \sum_{i=1}^{R_1}{a_{d,1}^ia_{d',1}^i} = b_{d,d'}^1 and the kernel matrix for multiple outputs becomes \textbf{K}(\textbf{x},\textbf{x}') = k(\textbf{x},\textbf{x}')\textbf{B}. ICM is much more restrictive than the LMC since it assumes that each basic covariance k_q(\textbf{x},\textbf{x}') contributes equally to the construction of the autocovariances and cross covariances for the outputs. However, the computations required for the inference are greatly simplified.

Semiparametric latent factor model (SLFM)

Another simplified version of the LMC is the semiparametric latent factor model (SLFM), which corresponds to setting R_q = 1 (instead of Q = 1 as in ICM). Thus each latent function u_q has its own covariance.

Non-separable

While simple, the structure of separable kernels can be too limiting for some problems.

Notable examples of non-separable kernels in the regularization literature include:

In the Bayesian perspective, LMC produces a separable kernel because the output functions evaluated at a point \textbf{x} only depend on the values of the latent functions at \textbf{x}. A non-trivial way to mix the latent functions is by convolving a base process with a smoothing kernel. If the base process is a Gaussian process, the convolved process is Gaussian as well. We can therefore exploit convolutions to construct covariance functions.[20] This method of producing non-separable kernels is known as process convolution. Process convolutions were introduced for multiple outputs in the machine learning community as "dependent Gaussian processes".[21]

Implementation

When implementing an algorithm using any of the kernels above, practical considerations of tuning the parameters and ensuring reasonable computation time must be considered.

Regularization perspective

Approached from the regularization perspective, parameter tuning is similar to the scalar-valued case and can generally be accomplished with cross validation. Solving the required linear system is typically expensive in memory and time. If the kernel is separable, a coordinate transform can convert \mathbf{K}(\mathbf{X},\mathbf{X}) to a block-diagonal matrix, greatly reducing the computational burden by solving D independent subproblems (plus the eigendecomposition of \mathbf{B}). In particular, for a least squares loss function (Tikhonov regularization), there exists a closed form solution for \bar{\mathbf{c}}:[8][14]

\bar{\mathbf{c}}^d = (k(\mathbf{X},\mathbf{X}) + \frac{\lambda_N}{\sigma_d} \mathbf{I})^{-1}\frac{\bar{\mathbf{y}}^d}{\sigma_d}

Bayesian perspective

There are many works related to parameter estimation for Gaussian processes. Some methods such as maximization of the marginal likelihood (also known as evidence approximation, type II maximum likelihood, empirical Bayes), and least squares give point estimates of the parameter vector \phi. There are also works employing a full Bayesian inference by assigning priors to \phi and computing the posterior distribution through a sampling procedure. For non-Gaussian likelihoods, there is no closed form solution for the posterior distribution or for the marginal likelihood. However, the marginal likelihood can be approximated under a Laplace, variational Bayes or expectation propagation (EP) approximation frameworks for multiple output classification and used to find estimates for the hyperparameters.

The main computational problem in the Bayesian viewpoint is the same as the one appearing in regularization theory of inverting the matrix \overline{\textbf{K}(\textbf{X},\textbf{X})} =  \textbf{K}(\textbf{X},\textbf{X})  + \boldsymbol\Sigma. This step is necessary for computing the marginal likelihood and the predictive distribution. For most proposed approximation methods to reduce computation, the computational efficiency gained is independent of the particular method employed (e.g. LMC, process convolution) used to compute the multi-output covariance matrix. A summary of different methods for reducing computational complexity in multi-output Gaussian processes is presented in.[8]

References

  1. S.J. Pan and Q. Yang, "A survey on transfer learning," IEEE Transactions on Knowledge and Data Engineering, 22, 2010
  2. Rich Caruana, "Multitask Learning," Machine Learning, 41–76, 1997
  3. J. Ver Hoef and R. Barry, "Constructing and fitting models for cokriging and multivariable spatial prediction," Journal of Statistical Planning and Inference, 69:275–294, 1998
  4. P. Goovaerts, "Geostatistics for Natural Resources Evaluation," Oxford University Press, USA, 1997
  5. N. Cressie "Statistics for Spatial Data," John Wiley & Sons Inc. (Revised Edition), USA, 1993
  6. C.A. Micchelli and M. Pontil, "On learning vector-valued functions," Neural Computation, 17:177–204, 2005
  7. C. Carmeli et al., "Vector valued reproducing kernel hilbert spaces of integrable functions and mercer theorem," Anal. Appl. (Singap.), 4
  8. 8.0 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence, "Kernels for Vector-Valued Functions: A Review," Foundations and Trends® in Machine Learning 4, no. 3 (2012): 195–266. doi: 10.1561/2200000036 arXiv:1106.6251
  9. 9.0 9.1 Hans Wackernagel. Multivariate Geostatistics. Springer-Verlag Heidelberg New york, 2003.
  10. 10.0 10.1 C.A. Micchelli and M. Pontil. On learning vector–valued functions. Neural Computation, 17:177–204, 2005.
  11. C.Carmeli, E.DeVito, and A.Toigo. Vector valued reproducing kernel Hilbert spaces of integrable functions and Mercer theorem. Anal. Appl. (Singap.), 4(4):377–408, 2006.
  12. C. A. Micchelli and M. Pontil. Kernels for multi-task learning. In Advances in Neural Information Processing Systems (NIPS). MIT Press, 2004.
  13. T.Evgeniou, C.A.Micchelli, and M.Pontil. Learning multiple tasks with kernel methods. Journal of Machine Learning Research, 6:615–637, 2005.
  14. 14.0 14.1 L. Baldassarre, L. Rosasco, A. Barla, and A. Verri. Multi-output learning via spectral filtering. Technical report, Massachusetts Institute of Technology, 2011. MIT-CSAIL-TR-2011-004, CBCL-296.
  15. Laurent Jacob, Francis Bach, and Jean-Philippe Vert. Clustered multi-task learning: A convex formulation. In NIPS 21, pages 745–752, 2008.
  16. Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Convex multi-task feature learning. Machine Learning, 73(3):243–272, 2008.
  17. Andreas Argyriou, Andreas Maurer, and Massimiliano Pontil. An algorithm for transfer learning in a heterogeneous environment. In ECML/PKDD (1), pages 71–85, 2008.
  18. I. Maceˆdo and R. Castro. Learning divergence-free and curl-free vector fields with matrix-valued kernels. Technical report, Instituto Nacional de Matematica Pura e Aplicada, 2008.
  19. A. Caponnetto, C.A. Micchelli, M. Pontil, and Y. Ying. Universal kernels for multi-task learning. Journal of Machine Learning Research, 9:1615–1646, 2008.
  20. D. Higdon, "Space and space-time modeling using process convolutions, Quantitative methods for current environmental issues, 37–56, 2002
  21. P. Boyle and M. Frean, "Dependent gaussian processes, Advances in Neural Information Processing Systems, 17:217–224, MIT Press, 2005