Talk:Support vector machine
From Wikipedia, the free encyclopedia
[edit] Relation to Kernel Machines
Can someone explain the relation of SVM to KM?--Adoniscik (talk) 06:18, 2 January 2008 (UTC)
[edit] References
The second reference (van der Walt&Barnard) is a bad paper. This is a conference paper that was not peer reviewed. Maybe remove the reference (or replace it with a decent reference). —Preceding unsigned comment added by 196.207.47.60 (talk) 22:59, 6 November 2007 (UTC)
There is not enough information about solving SVMs. There is only the mention that it's a quadratic optimization problem. A discussion of solving and a brief example would be very helpful. 216.145.54.158 21:59, 4 August 2006 (UTC)
Hey - I dont think that kernel PCA should redirect here? Sort of like redirecting "dogs" to "Animals" It probably needs its own article.--137.215.9.20 09:07, 26 July 2006 (UTC)
- Indeed, kernel PCA should not redirect here - KPCA and SVM have only in common the kernel trick, but one is a feature extractor, the other is a classifier. KPCA is in fact closer to spectral clustering than it is to the SVM. I'll try to write something about KPCA some time soon.
Hello! I have a comment about the first sentence, which says "A SVM is a <blank>", and <blank> has been either "statistical classification model" and "supervised learning method" lately. I'm in favor of the former because it situates SVM in a very large group of related methods from both conventional statistics and machine learning. "Supervised learning" is deficient on two counts -- s.l. also includes regression as well as classification, and it suggests only a link to machine learning and not conventional statistics. So I'd like to hear what other people have to say. Happy editing, Wile E. Heresiarch 02:35, 29 Apr 2004 (UTC)
- Yes, indeed, classification is more specific than supervised learning. However, there are both classification and regression forms of a support vector machine. The latter is often called Support Vector Regression (SVR). I added a discussion of SVR to this article, although it is difficult for laypeople to understand. Anyway, I think that supervised learning is a more accurate description. The article supervised learning is much less stubby than classification, too.
- To me, machine learning is statistics, so I don't have a preference for "supervised learning" over "classification" on that basis. -- hike395 04:46, 29 Apr 2004 (UTC)
Hmm. Not quite the way I'd put it. However, I haven't got my references on me at the moment, so I can't come up with a precise description. -- 213.253.39.90
[edit] Needs to flesh out mathematics more
As someone who is taking Calc III(and has studied a bit of linear algebra), I feel very lost in some parts of the article. For example, what does w (dot) x + b mean in terms of the parameters? How do lagrange multipliers play a role in this(I know what they are, and how they are used in terms of Calc, but NOT in terms of SVMs).
What's REALLY unclear for me, though, is how to actually FIND the hyperplane separating the points. Perhaps I haven't done enough reading on machine learning, but I'd personally be very greatful for anyone who could flesh this out enough so that someone with a good foundation of calc and a minimal foundation of linear algebra could understand the concept of how to solve it better. Nothing I've seen covers this subject well enough that I am able to comprehend it conceptually. This should be fleshed out to the point where it's easily possible to write pseudocode from the ideas presented. —Preceding unsigned comment added by 69.120.16.79 (talk) 03:08, 26 November 2007 (UTC)
[edit] Non-linear? and origin of machine term?
Do SVMs have to be non-linear? I thought they could be either linear or non-linear. -- Oliver PEREIRA 13:18 Jan 26, 2003 (UTC)
No. See if my new edit makes you happier. By the way, does anyone know why they're called "machines"? Nobody seems to call older machine learning techniques (e.g. the perceptron, feed-forward networks, etc.) "machines". --Ryguasu 00:13 Apr 2, 2003 (UTC)
- According to Vapnik (The Nature Of Statistical Learning Theory, p. 133) they are non-linear: The Support Vector (SV) machine implements the following idea: it maps the input vectors x into a high-dimensional feature space Z through some nonlinear mapping, chosen a priori. In this space, an Optimal separating hyperplane is constructed. To be strict a so-called linear SVM is an optimal hyperplane (it has support vectors, but is not a support vector machine), although many authors ignore this. --knl 15:57, 28 Aug 2004 (UTC)
- The mapping of the feature vectors into the feature space (possibly one of a higher dimension) only makes sense if it is a nonlinear one. Would it be linear, no additional information could be gained by this step, because the feature vectors would still have the same relations, only modified by some scalar factor. So chosing a linear mapping would make no sense and only add to the computation expense. Hence, any but the most basic support vector machines, which are in fact called linear support vector machines, are nonlinear at a very fundamental level. —The preceding unsigned comment was added by 206.106.168.10 (talk) 01:58, 12 March 2007 (UTC).
- I learned that the reasoning behind the name support vector machine is the following: The support vectors, that is, the samples along the hyperplanes, are used to generate/solve for the maximum margin hyperplane between the two (i.e. the positive example support vector and the negative example support vector) support vectors. This maximum margin hyperplane (characterized by w and b) is the machine which returns 1 or -1 (or 0 if the test point/vector is right on the SVM) when given positive and negative points/vectors. So, machine in this case means "something that returns 1 (true) or -1 (false)" for the purposes of classification. Not sure how to find a reference for this though. --Rajah 16:27, 20 February 2007 (UTC)
- Another useful interpretation of the name is the analogy in Burges' paper "A Tutorial on Support Vector Machines for Pattern Recognition". If you treat the decision surface in feature space as a rigid sheet, and suppose each support vector exerts a force alpha (the value of the dual variable associated with it) perpendicularly against it (say by a rod from the decision sheet to the support vector (in feature space)) then at optimality the forces and torques associated all cancel out. So basically the margin is "supported" by these forces exerted by the "support vectors" and subsequently lies at a position of mechanical stability. This is a nice justification for the name support vector and also, being rather mechanistic, makes machine seem less of a stretch. Svm slave 12:25, 24 February 2007 (UTC)
Fast and lightweight? As compared to what? (Sitting in front of machine that's spent 3 days on a linearly separable dataset C4.5 takes a few minutes to chomp through). User:Iwnbap
Don't forget Sequential Minimal Optimization (SMO) [1]
As someone who doesn't know SVM very well yet, I say this article could really use a picture. The article is technically correct, but not enlightening to a newbie. Someone drew me a picture with two classes (+ and -), some separating planes (i.e., a 2-d example) and explained that many cases not near the planes are often thrown away. I thought it helped a lot. If I knew SVM well enough, I would upload and label the picture. Maybe I will if I learn more. dfrankow 03:54, 3 March 2006 (UTC)
- Hope the picture I added makes it more clear AnAj 19:23, 15 June 2006 (UTC)
- I can not see the picture in the article. Is the link wrong? —The preceding unsigned comment was added by 84.216.75.78 (talk • contribs) .
[edit] How to add a Loss-Matrix to SVM
Maybe anybody can point out how to incoperate a Loss matrix into the svm framework. I've been looking for this information on various places and I think this would add some great value to the article.
- I'm not quite sure in what sense you refer to the loss-matrix. Loss-matrices are usually used in bayesian learning as far as I know, a field I wouldn't put the SVM in. The basic SVM is defined for linear separable data, so there is no miss-classification on the training-data. Here are some suggestions how to "prefer" on class over the other, but I'm not sure whether these are correct: Of course you could move the separating hyperplane somewhat closer toward the less preferred class by adjusting the threshold-value b in the classificator, this should be pretty simple. The orientation of the hyperplane should stay the same anyway. I suppose you should receive the same results by using different thresholds c and d in the inequality-constraints ensuring the correct classification of the training data. The values ratio of c and d then describes the margin ratio on either side of the hyperplane. This way you could incorporate the preference into the soft-margin SVM as well. The slick-variables have to be greater for the preferred class to have a missclassification and as the sum of the slick-variables is part of the minimization-problem the "preferred" class should have less slick and the orientation of the hyperplane might indeed change. Anybody willing to check if this is true? --87.123.135.112 14:04, 28 October 2007 (UTC)
[edit] Need more practically USEFUL info.
Article is too academically minded. There should a paragraph on real-world applications, like SVMs and logistic regression method are being used nowadays in filtering e-mail spam messages. How and why? Thanks in advance! —The preceding unsigned comment was added by 195.70.32.136 (talk) 15:54, 18 January 2007 (UTC).
- Hmm I think would be nice to have some real world examples where SVM beats every competetor (should be easy to find). Maybe a good idea would be to create some images with the different cases that arise on real world data (e.g. linear seperable, linear seperable with soft margin, seperable with rbf kernel and seperable with rbf kernel and soft margin...) to clarify why there are different extentions on the original linear formulation.--Cyc 12:25, 5 February 2007 (UTC)
Not going to excise comments from a talk page and I agree that examples are useful. But the comment above is totally wrong.... The law of conservation of generalisation means that for every case where SVMs are better at classification than another method there is a corresponding case where they are worse. It's a fundamental principle of machine learning (however counter-intuitive it seems). Outside the world of the theoretical, there are many applications where SVM's are not the best approach (and I use SVMs in industry). It all depends on the data and the application. I'd agree it's worth including the RBF kernel as in my experience it's probably used more frequently in the real world than linear SVM's as data is rarely linearly separable in n-dimensions. A brief explanation and a link to Kernel methods is probably enough. Maybe I'll do it myself if I get time.... 212.84.127.134 09:13, 28 February 2007 (UTC)
- Totally wrong? Didn't claim anything that contradicts the No-Free-Lunch theorem. Cyc 23:58, 4 April 2007 (UTC)
[edit] This should tie into least-squares
SVMs and SVRs are a simple application of least squares which has been around a lot longer than either of these two. People from many disciplines have never heard of machine learning or SVM/Rs but they have heard of least squares. Even the non-linear adjustment is standard for least-squares polynomial fitting. The standard least squares is just SVR and SVM does a similar trick while trying to fit a hyperplane BETWEEN the data rather than fit a line ON the data.
- I'm not sure this is strictly true (though admittedly while SVMs are my specialty I know relatively little about advanced least-squares algorithms). I agree that least-squares SVMs (LS-SVRs and LS-SVMs as per Suykens' book) implement a regularised least-squares technique, but standard SVRs (and SVM classifiers) implement a cost function which is linear, not quadratic (specifically, SVRs usually use Vapnik's epsilon-insensitive cost). Actually, it might be good to cover least-squares SVMs and maybe mention (briefly, the details might be a bit excessive for now) very general extensions like Smola, Scholkopf and Muller's paper "General Cost Functions for Support Vector Regression". Svm slave 12:34, 24 February 2007 (UTC)
[edit] Secure Virtual Machine (SVM) Technology?
Is there a wiki page for this article, I typed in SVM and got here. I also typed in all the other obvious things and think that this term does not have a wiki article! TonyMartin 19:36, 06 March 2007 (UTC)
As far as I know, there is no such article on the english-language wikipedia. There is however a short article on the german-language wikipedia. Feel free to add Secure Virtual Machine to the list of requested article for computer science, computing, and Internet. (Or even better, start that article yourself if you feel knowledgable about this topic). — Tobias Bergemann 08:37, 7 March 2007 (UTC)
Addendum: Your signature appears to be broken, as the wiki link leads to a user page for a non-existing user "TonyMartin" instead of User:Tonymartin234567. — Tobias Bergemann 08:40, 7 March 2007 (UTC)
Superscript text The claim "they simultaneously minimize the empirical classification error and maximize the geometric margin" is impercise to say the least. More like: maximize margin hoping that error is minimized.
[edit] minor cleaning up
I did some cleaning up to the page. 1. In the introduction I added the word 'separating' to the sentence "The hyperplane is the hyperplane that maximises the distance between the two parallel hyperplanes." which I thought needed some clarification. If anyone disagrees with that change, feel free to find a better way of saying that. 2. I removed the "Generalization" section which seemed unneeded. 3. I replaced a sentence in "Motiviation" with a simpler sentence including information that had been in the "Generalization" section. 4. In "Formalization" I removed the distinction between statistics and computer science notation for the number of dimensions (p and n, respectively). Since n is used elsewhere as the number of points in the data sample, it seems better to refer to a p-dimensional input space. Hope I wasn't stepping on anyone's toes by making those changes! —The preceding unsigned comment was added by Digfarenough (talk • contribs) 17:46, 8 April 2007 (UTC).
- That's a fast bot! Clicked "save page", realized I forgot to sign, came right back, and the bot had already added that. Impressive :) digfarenough (talk) 17:47, 8 April 2007 (UTC)
[edit] Suggestions for clarity
I think the article loses its clarity, particularly for someone who isn't already well-versed on SVMs, when you get to the section "Non-linear classification". Up to that point, I felt it was pretty good, and in fact, even the first paragraph of the section is fine. But I can't imagine anyone who isn't already highly knowledgeable about SVMs getting anything out of that second paragraph of that section.
First, there seems to have been a sudden notation change without explanation. Above a training vector is denoted , but then in the kernel examples suddenly we have and . It looks like a vector transpose, which doesn't make any sense, but adds to the confusion. Why not: ? That would be more consistent with the earlier sections.
Next, the relationship between a kernel and a feature space hasn't been introduced. The earlier article leads the reader to focus on the primal form, so it needs to be explicitly explained that the dot product in the dual form is being replaced by the kernel function, , and that the foundation for solving these isn't destroyed by this non-linear substitution.
Since it is such a key conceptual part of SVMs, I think the decomposition of the kernel function on a higher dimensional space needs to be explained here. The part about the infinite-dimensional Hilbert space is non-interpretable until that this key concept is understood. Also the idea that the solution can be found without even knowing what is, or what its dimensionality is, is a really hard thing to grasp and a key concept that is not mentioned here. Transforming data through a pre-defined before applying the algorithm is not particularly interesting, but the idea that we're doing this with the kernel function in the dual space, possibly on an implied super-high-dimensional primary space, without a huge performance hit, seems like the defining characteristic of an SVM, but is not at all clear from this article.
Next, the article has never told the reader whether the selection of non-linear mapping (kernel function) has part of the algorithm, or selected before the algorithm. In addition, one is left wondering whether parameters of the kernel function (such as "d" in the polynomial kernels) are selected in advance or optimized by the algorithm. (The form and the parameters of the kernel are always fixed in advance, right?).
Anyway, I hope these suggestions are useful for cleaning this up. I'm a bit reluctant to make relevant changes to the article until they've been aired here. Thanks.
I do like the absence of convergence theory (e.g., VC dimensions) and details on techniques for solving these. I don't think those belong in the wikipedia introduction to SVMs, so that is a good attribute of the article as it exists.
- These seem like spot-on observations to me. I encourage you to make the relevant changes. 86.139.15.170 (talk) 01:25, 25 February 2008 (UTC)
[edit] What does the "Support" in "Support Vector" mean?
I know what a support vector is, but I don't know why we call it "support vector"? Does the vector "support" something? The hyperplane? Then, what does it "support"? Does anyone know this? Welcome to discuss this with me.E-mail:Kankaqi@gmail.com
- The physical interpretation is explained in Burges tutorial 3.6 page 15
They are called this because they "hold up" the separating plane
—AIMA/2E, pg.751
--Adoniscik (talk) 06:43, 2 January 2008 (UTC)
Thanks all for your help!I have found the answer right int Burges tutorials.
[edit] History of SVM begins in the 1960s with Tom Cover
Tom Cover developed the theory of support vectors in his 1964 thesis, which was published as "Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition," IEEE Trans. Elec. Comp, EC-14, 326-334. " This paper has many deep insights on the role of dimension. Spmeyn 12:23, 20 October 2007 (UTC)
[edit] Suggestions
- add explanation why ||w||² is used (done)
- remove scaling as it is a common technique in machine learning (done)
- link primal to dual. Emphasize role of dot product!
--Cyc (talk) 15:17, 27 January 2008 (UTC)
They are really good suggestions! —Preceding unsigned comment added by Visame (talk • contribs) 04:05, 30 January 2008 (UTC)
[edit] delete errorneous image
The image showing 3 Hyperplanes suggests that L2 is the optimal one. This is clearly wrong. It should therefor be replaced by a corrected version.
I suggest following improvements:
- svg as image format
- smaller circles. filled with white and black
- one line L1 that doesn't seperate the datapoints. one that does but with a small margin and the maximum margin one.
What do you guys think? --Cyc (talk) 09:13, 8 February 2008 (UTC)
Ok did that. Unfortunately I din't manage to get my inkscape svg file to be displayed properly. —Preceding unsigned comment added by Cyc (talk • contribs) 18:06, 16 February 2008 (UTC)
[edit] Verctor w in second diagram
I think the vector w in the second diagram is misleading. It is merely a vector that is orthogonal to the separating hyperplane, and not a vector that starts from the w.x - b = 0 hyperplane and ends at the w.x - b = 1 hyperplane. Otherwise, readers will be left wondering why the margin of separation is 2 / ||w|| instead of 2 ||w||. --unkx80 (talk) 15:59, 8 March 2008 (UTC)
Yes, you are right. Will fix this.Cyc (talk) 10:05, 11 March 2008 (UTC) Fixed. Thanks pointing this out. --Cyc (talk) 12:51, 11 March 2008 (UTC)