Talk:Kernel trick

From Wikipedia, the free encyclopedia

[edit] Mercer's condition

I find this statement of Mercer's theorem very vague. Specifically, what is the space that the kernel operates in, and what is the higher-dimensional space? I would think it would be easy for anyone who knows this theorem to state the answers. Michael Hardy 22:29, 24 Aug 2003 (UTC)


After the last round of edits, the article is now factually incorrect. It is not that the kernel is decomposed into eigenfunctions. It is that the kernel is decomposed into a single dot product K(x,y) = φ(x)φ(y). The new feature space F is defined by the map φ:x − > F. φ may be a non-linear function of x and F may have a much higher (or infinite) dimension than x.


The whole point is that linear algorithms can operate in φ space without actually computing any values in F, since F may be very high (or even infinite dimension).


Please see Chapter 4 of Chris Burges' tutorial [1] for a more complete explanation, especially Section 4.1 .


In any event, the article is now incorrect and must be changed. I can edit the article, but I'll give you a first chance to correct it. -- hike395 04:05, 26 Aug 2003 (UTC)


P.S. Notice in Burges' tutorial that Support Vector Machine is capitalized. -- hike395


Went ahead and fixed article -- hike395 04:26, 27 Aug 2003 (UTC)




I edited the article on Mercer's theorem. As stated originally, it was not correct... not every positive definite function is a product of functions. What is true is that a positive definite function as an integral operator is the product of two non0negatiev operators.


This should be cleared up. Good chance for collaboration here between mathematicians and AI people.CSTAR 19:55, 11 May 2004 (UTC)



Well I think the issue about what the kernel trick actually is should be able to be resolved --- semi-definite functions K(x,y) are of the form <phi(x), phi(y)> for some function phi with values in some Hilbert space H. Could we agree to do this? To reiterate the point made on other page, the fact at this level of generality has nothing to do with Mercer's theoremCSTAR 19:32, 13 May 2004 (UTC)

I went back and checked Burges' tutorial on SVMs ([2]). Burges is very well-respected and very exacting. His tutorial says that it is Mercer's condition that guarantees that a semi-definite kernel is a dot product in a Hilbert space, not Mercer's theorem. He cites Courant and Hilbert, 1953.
I think things were mixed up between the two terms: condition and theorem. So, let's fix this article to say Mercer's condition, not Mercer's theorem, and then everything will be OK. Right? -- hike395 05:16, 14 May 2004 (UTC)

One more comment: As stated in the main article of the [Kernel trick]], Mercer's condition needs a measure μ on the space X in order for inequality stated there to make sense. A simpler condition, which does not involve integrals (and in some cases is equivalent to the integral form of Mercer's condition) is

\sum_{i,j} K(x_i,x_j) c_i c_j \geq 0

for any finite sequence of 'x1,..,xn of X and sequence c1,..,cnof real numbers.

Do you really need the integral form of Mercer's condition? Also I think it would be a lot clearer to mathematicians if you said:

  • φ is a function with values in H

instead of

  • φ is a function with image in H
I agree. Also changed image to range, because the word linked to the article function range. -- hike395 12:56, 14 May 2004 (UTC)

In my opinion the whole statement of Mercer's condition here could be replaced with a simpler and more germane statement of what it means for a kernel function to be symmetric positive semidefinite:

K(x_i,x_i) \ge 0 for all nonzero x_i \in X
K(xi,xj) = K(xj,xi) for allx_i,x_j \in X


Kenahoo 18:53, 15 September 2006 (UTC)

[edit] External link

Please put the new link in an external link section. CSTAR 12:42, 19 Aug 2004 (UTC)

[edit] Intro typo?

This sounds wrong:

by mapping the original observations into a higher-dimensional non-linear space so that linear classification in the new space is equivalent to non-linear classification in the original space.

Should it say this:

by mapping the original observations into a higher-dimensional linear space so that linear classification in the new space is equivalent to non-linear classification in the original space.

or is it correct as is?

— BenFrantzDale

Yes, I believe you are correct. Alternatively we could just remove the word non-linear altogether:

by mapping the original observations into a higher-dimensional space so that linear classification in the new space is equivalent to non-linear classification in the original space.

Kenahoo 18:41, 15 September 2006 (UTC)