User talk:Crispy1989

From Wikipedia, the free encyclopedia

Welcome!

Hello, Crispy1989, and welcome to Wikipedia! Thank you for your contributions. I hope you like the place and decide to stay. Here are some pages that you might find helpful:

I hope you enjoy editing here and being a Wikipedian! Please sign your name on talk pages using four tildes (~~~~); this will automatically produce your name and the date. If you need help, check out Wikipedia:Questions, ask me on my talk page, or ask your question and then place {{helpme}} before the question on your talk page. Again, welcome! Acalamari 16:42, 4 October 2007 (UTC)

1 My recent RfA
2 Constructive training
3 Finding vandalism
4 Compiling Dataset
5 What about sneaky vandals?
6 Checking
7 Randomness of samples
8 Bayesian filtering
9 Spread the word
10 It won't work
11 Features
12 How are we doing?
- 12.1 general comment
13 You have more messages
14 A few more questions

[edit] My recent RfA

Thank you for supporting my RfA, which unfortunately didn't succeed. The majority of the opposes stated that I needed more experience in the main namespace and Wikipedia namespace, so that is what I will do. I will go for another RfA in two month's time and I hope you will be able to support me then as well. If you have any other comments for me or wish to be notified when I go for another RfA, please leave them on my talk page. If you wish to nominate me for my next RfA, please wait until it has been two months. Thanks again for participating in my RfA! -- Cobi^(t|c|b|cn) 01:42, 12 October 2007 (UTC)

[edit] Constructive training

Hi, I have put some entries on the Constructive and vandalism pages. One idea that you may (or not) have considered is to identify reliable editors known not to vandalise. All changes made by these users would be constructive. This would give a large population for training the bot. A large set of reliable vandalism is harder to obtain, but it might be possible to identify some by using the reverse edits made by these same reliable editors rather than manually identify the original vandalism. --Brian R Hunter (talk) 23:14, 1 April 2008 (UTC)

The problem with this is that a huge difference in the number of vandalism vs nonvandalism edits could cause unreliable operation of the neural network. It's a good idea, and if we can find a way to reliably identify vandalism en mass as well, it could work. Crispy1989 (talk) 20:32, 10 April 2008 (UTC)

I agree that it would be good to balance the numbers. There are two approaches which combined might give useful results in mass identifying vandalism.

As above, look for undo's self identified as vandalism performed by our reliable editors. Take the reverse of this edit to be vandalism.
Identify editors whose only contribution is vandalism, possibly by following the contribs link of the originator of the vandalism identified in 1 above.

--Brian R Hunter (talk) 20:46, 10 April 2008 (UTC)

I've been considering this too; I think a major problem with using trusted editors as a source for good edits is that it'll bias it hugely. Trusted editors tend to type coherent sentences, have good spelling, not make so many markup errors because they're experienced and because they preview, use edit summaries, and all the other things that tend to be good practice, but the absence of which is not vandalism - we don't want an automatic-newbie-biting-machine ;)

I like the idea of going through reversions, possibly reversions accompanied by suitably strong warnings, to get a vandalism set, though. Pseudomonas_(talk) 10:25, 19 May 2008 (UTC)

[edit] Finding vandalism

How about contacting someone like Gurch and asking him if huggle can compile a list for you of all the reverts? That would be quite accurate. Tiddly-Tom 13:01, 3 April 2008 (UTC)

A possible idea but do not blindly follow all reverts as many are done in error and are themselves reverted. Even manual reverts by trusted editors can be wrong - people and bots make mistakes. Brian R Hunter (talk) 13:36, 3 April 2008 (UTC)

[edit] Compiling Dataset

Hello Crispy1989

I read your request for examples of vandalism. This part of the request has me puzzled:

"Part of the preprocessor groups words into categories (based on wiktionary categories) for processing by the neural network. If there are any additional wiktionary categories that you think might be pertinent to vandalism (ie, vandalism will show a marked difference in words from those categories that normal edits in a reasonable number of cases), add it to the bottom of the list:"

Is this saying you are looking for categories of articles that are likely to be vandalized?

Or are you looking for something more specific? Wanderer57 (talk) 17:28, 10 April 2008 (UTC)

No, basically there are three things we want.

Thousands of examples of vandalism (diffs).
Thousands of examples of good edits (diffs).
Possibly more Wiktionary (not Wikipedia) categories of words likely to be used in vandalism.

Thanks for your interest in helping, and I hope I've answered your questions. -- Cobi^(t|c|b) 17:54, 10 April 2008 (UTC)

To expand on that, the list of categories isn't only a list where words in the categories make it more likely to be vandalism, but also words that make the edit less likely to be vandalism. The neural network will "figure out" which it is, and by how much. Even if a category is completely arbitrary, it will figure that out as well. Ideally, we would use all categories, but the more categories that we use, the more computing power is needed to process the neural network, and we only have enough computing power for a few select categories. Crispy1989 (talk) 20:32, 10 April 2008 (UTC)

[edit] What about sneaky vandals?

Is it possible to train a neural network to detect vandalism such as this one? If I understand correctly, your approach would fail outright because vandal just changed a number. Fireice (talk) 02:08, 11 April 2008 (UTC)

I guess it depends on the factors included for consideration by the neural network. In this case a number of factors combine to aid identification. In themselves they do not indicate vandalism but the combination of the changed text and these may be sufficient to allow a bot to revert with a 'possible vandalism' justification.

The user is not registered.
There is no edit summary.
The user has made few other edits
Other edits by the user have been identified as vandalism.
Other recent edits by the user have been reverted.
The replacement number format (a low number with two decimals) compared with the previous number.
The number value combined with the words 'film' and 'budget'.

A neural network will typically discover a lot of hidden factors given sufficient examples.

--Brian R Hunter (talk) 22:12, 12 April 2008 (UTC)

[edit] Checking

Hey. on User:Crispy1989/Dataset/Vandalism do you want me to go through the list like someone has started doing and check the diffs to make sure they are vandalism? And any that some people would say no to have a discussion about or something? Please let me know what you think. ·Add§hore· ^Talk/_Cont 21:40, 12 April 2008 (UTC)

I think they should be checked, the bot should only work with blatant vandalism. More subtle cases of what is generally thought to be vandalism, like removing deletion tags or discussions on talk page can be easily handled by humans. Fireice (talk) 23:57, 13 April 2008 (UTC)

Sure, go through both lists and remove any wrongly added entries, however, please do not remove subtle vandalism. -- Cobi^(t|c|b) 07:56, 14 April 2008 (UTC)

What about if many people disagree over whether it i vandalism or not. ? ·Add§hore· ^Talk/_Cont 15:34, 14 April 2008 (UTC)

The Wikipedia vandalism policy and "assume good faith" should be followed. In the cases where it's ambiguous from that standpoint, it shouldn't be listed in either list. Crispy1989 (talk) 04:33, 18 April 2008 (UTC)

[edit] Randomness of samples

Hmm, shouldn't the sampling be random? I mean, I started to collect links and realized that if I reported every (obvious) vandalism to the list, I could report every revert of that vandalism as constructive. But that wouldn't be a random sampling of constructive edits, it would only be vandalism reverts. Thinking a bit more on this: it's probably reasonable to assume the constructive, as well as the vandalism links reported actually won't represent a random sample at all?

I don't really know how this works, but I would suspect that is a problem?.

Suggestion: Have the bot collect random edit samples, and let editors wanting to help classify them as vandalism/constructive/uncertain or something. Then you could have more than one editor check every link also. I mean, to prevent a vandal to mark valid links as vandalism and so on. Then, if you identified someone making bad faith reports you could remove his input from the dataset. And if different users make different classifications then that would mark the edit as uncertain...

I hope this makes sense? :) Apis (talk) 04:19, 18 April 2008 (UTC)

Maybe it didn't sound like one but it was actually more of a question than a suggestion (that part was just an idea). So is it ok to add a lot of samples that are not the slightest random? --Apis (talk) 03:13, 20 April 2008 (UTC)

Yes, nonrandom samples are OK, as long as it's not too outrageously nonrandom. Crispy1989 (talk) 03:43, 20 April 2008 (UTC)

[edit] Bayesian filtering

Hi Crispy1989 - I'm happy that someone is finally working on an ANN vandalism solution. I've long since thought about undertaking such a project, but the time requirements were just too much for me. Have you considered adding a Naive Bayes classifier also? This is how some spam filters work and it was another idea I had for fighting vandalism. Once you guys collect a large dataset, I might be able to use it to create a Bayes vandalism detector to run in addition to your ANN one. Oh, and I've got experience with neural nets, adaptive algorithms, and Bayes nets/classifiers, so I would be glad to help in any capacity. --CapitalR (talk) 17:09, 23 April 2008 (UTC)

Hi - I was also thinking about undertaking a similar project, not with neural networks probably but with support vector machines, but wasn't sufficiently ambitious. It seems like once a database is compiled, it would be useful to mess around with all kinds of classification techniques, see what works best. I'll go do a little manual adding to the edit lists now. I don't know PHP, but I do have some experience with machine learning, so if there's any way I can help further, let me know. Kalkin (talk) 21:12, 11 May 2008 (UTC)

Before I considered an ANN, I did consider a Bayesian Classifier, but decided against it for many reasons. One is that it only takes content into account. It also doesn't consider relationships between words, only proportions of words, which would leave a lot of error when detecting vandalism. The neural network as it is does have a component which considers content (although less specifically than a Bayesian Classifier). Crispy1989 (talk) 21:43, 11 May 2008 (UTC)

[edit] Spread the word

Most editors are not aware of the rewrite. I was thinking that we should post a message to the village pump looking for ways to spread the word around to experienced editors, perhaps on a notice board. --209.244.31.53 (talk) 20:29, 2 May 2008 (UTC)

[edit] It won't work

Seriously, I tried this years ago. Neutral networks are great in theory but useless for anything other than trivial classification problems in the real world. You're better off going after specific characteristics as is done with ClueBot at the moment -- Gurchzilla (talk) 08:02, 11 May 2008 (UTC)

Actually, neural networks are a lot more flexible than that. In this specific instance, the neural network performs so well that it's even identifying errors in its own training dataset. The key in getting a neural network to work in instances like this is to be able to effectively convert the input data into a format that will work with the neural network, and that part forms the majority of the code. I'm not sure what you used to convert edits into the neural network's input layer, but if you didn't carefully consider that, it may be the reason why your attempt failed. If you give more specifics on the approach you used, I might be able to shed some light on why it didn't work. But like I said, my approach is already outperforming the existing Cluebot to the extent that it's identifying classification errors in its own training dataset, so it would appear that this is indeed a superior method. Crispy1989 (talk) 21:51, 11 May 2008 (UTC)

[edit] Features

Beyond wiktionary categories, what else is likely to get into the feature vectors for a given edit? I ask mainly from curiosity - I'd be interested to see, for instance, whether using POS tagging + WordNet synset groupings gives any interesting results, since one could get useful category information from previously unseen words, and potentially use it to build custom features like "replacement with antonym". I do realise that this isn't a collaborative effort, so you may save yourself work by not letting the likes of me interject... Pseudomonas_(talk) 10:00, 19 May 2008 (UTC)

It's definitely a collaborative effort, and thanks for the suggestions, I'll see what I can do. The problem with adding too many inputs is that the precision limitations on floating point values may limit the effect of the more important values when in the input layer alongside many less important ones. The key is finding which values are the most important. Currently, the limited set of wikionary categories was hand-picked because I thought those categories might be important. After I figure out the accuracy of the current network (after there's enough manually picked training data), I can experiment with what might make it more accurate. Thanks. Crispy1989 (talk) 19:54, 26 May 2008 (UTC)

[edit] How are we doing?

Well, quite a few of us have been adding our contributions for a while now. It would be nice to know how useful they've been. Are you able, on the basis of what's been submitted so far, to sharpen up your requirements? What do you need more of?

One thing that occurs to me is that it isn't really easy to visualise whether a diff will add any value or not - I tend to add ones which the current bots missed but which I feel I could hope that a really, really good bot would catch. Might a better way be to publish a list of (potentially, thousands of) edits which the current network rates as 'borderline', and have a team of volunteers adjudicate? I can see some disadvantages with that, but wouldn't it make better use of the available brainpower? Philip Trueman (talk) 11:36, 22 May 2008 (UTC)

Good idea - I'll work on a way of doing that. My eventual plan is to train the base neural network with the output of the current ClueBot (still running the old code) to get a sort-of average baseline training. Then, I'll retrain it with all of the manually selected edits, to fine tune it. After the training is finished, I'll run the entire training set through the network again, and group the incorrectly classified or borderline edits into a category for further review, and if remaining, more intensive training. At that point, I'll run a set of random (untrained) edits through the bot and post them here for manual review for correctness. Thanks for the suggestion. Crispy1989 (talk) 19:51, 26 May 2008 (UTC)

I'm also looking at building a tool for curating a corpus of unclassified edits for this - I'll let you know how I get on. Pseudomonas_(talk) 13:13, 27 May 2008 (UTC)

[edit] general comment

Edits that are suicide threats should probably not be reverted by bot, even if they are/contain vandalism (although from the definition of vandalism, they aren't vandalism). 69.140.152.55 (talk) 21:23, 29 May 2008 (UTC)

A users edits to their own user page should mostly not count as good or bad unless they are a revert of trolling and bad behavior, vandalizing a sock/ban tag, et cetera. 209.244.31.53 (talk) 04:29, 31 May 2008 (UTC)

[edit] You have more messages

USER TALK:Crispy1989/Dataset

All subpages have talk pages, maybe it was a bad idea to start it, but this may be used to dataset discussion and then main talk for neural net/engine. 209.244.31.53 (talk) 04:29, 31 May 2008 (UTC)

[edit] A few more questions

What aspects, such as the edit window contents, user rights, edit summary, reference websites, page type or a special page (something like WP:AIV needs to be treated differently than other pages, if it is opted in), and user pattern will the engine detect? Should edit wars or trivial good faith edits be on the constructive list? What about accumilated vandalism by one user where each edit by itself is not enough to be reverted? Diffs for nominating a page for deletion, when page is afterwards deleted? --209.244.31.53 (talk) 19:09, 14 June 2008 (UTC)