User:Franamax/wpW5
From Wikipedia, the free encyclopedia
wpW5 - Wikipedia - who, what, when, where, why
The wpW5 application is the centrepiece of the wpW5tools project. wpW5 will eventually answer everything you want to know about an article: how did it get this way? who said the world is flat, when did they say that, did they discuss it first? All those questions will be answered. Also, how did this article get to the shape it's in: when were sections added, who made major contributions, is it vandalized/reverted a lot, is it constantly worked on or changed in bursts of activity?
This is an ambitious effort - for now wpW5 is at the "advanced proof-of-concept" stage. wpW5 Version 0.3 (PxC) lets me load the article diff history, retrieve individual prior pages from the history, and perform various searches for text that is or was in the article, in order to find out exactly when that piece of text first appeared, when is was last seen there, who put the text in and who took it out again.
If you are interested in trying out this software, send me an e-mail to get a copy. It only runs on Windows and is an .exe file that you install and run on your own system.
wpW5 v0.3 is described a little more completely here:
[edit] wpW5 v0.3(PxC)
(PxC means proof-of-concept or pretty-xxxing-crude)
Operation:
1. Enter the article name and click "Get Article" - this loads the diff history (just like "History" does - 5000 changes for now)
2. Click on "Get Diffs" if you want to pre-load all the diff pages - then go change your spark plugs, it may take awhile! Or you can use Quick Search, or Full Search and let it load the pages as it goes.
3. Pick your search options, insert the text you want to find, click on either Quick or Full search, sit back and watch the fun.
Search Options:
1. The most basic search option is "Verify" - this makes sure that your search text is actually in either the current article or some version you specify and won't go on if it's not there. Very useful to catch typo's - otherwise you will sit and watch as all 7000 versions are fetched and checked.
2. Next you have to pick either the right or left-hand search box - "Edit text" or "Browser text". Edit text is for text you are pasting in from the wiki "edit this page" function: it is the actual wiki markup text and wpW5 WILL find it, guaranteed. Browser text is for when you are looking at the actual display in your browser and want to search for something. This is a lot more complicated - in the browser a wiki-link is just blue text but in wiki markup it really looks like [[Page_Name | this page ]]; and when you see italics what you really want to search for is ''italics''. But that's no problem - use the box on the left and ask for a translation! I can't handle everything of course, so there is also:
2a. "Find Parts" - this is one of the semi-cool things :) Just paste in the browser text you see and click on Find Parts, wpW5 will break it all down, look for the pieces of text, find the most compact version in the edit text (because "the" could be in there a few different places), show it to you, then search for whatever you pick. AND verify it just like option 1. It's not foolproof, but I am a fool too, I made it to survive myself!
3. Use Cache - sticking this in here, there is an option to locally cache files. If you are doing a one-time search for something, turn it off. If you are going to be re-visiting this article, switch it on. The diff history gets saved to disk so when you come back, wpW5 can just splice in the most recent 20 (or whatever) changes - be nice to the servers! Also the individual diff pages get cached locally as they are examined. Net result is that everything just gets faster and faster as you go. The other result is that you will eventually end up with all of Wikipedia and all its history sitting on your hard drive! We'll manage that better in the next version.
Search Methods:
1. Full Search - this is kind of boring, it just fetches every page from the article history and looks for the text you are searching. However it is definitely accurate and will find what you're looking for.
2. Quick Search - this is actually semi-ultra-cool, works well when some piece of text has been in the article for a continuous defined period, works fantastic when that text is in the current article. Quick search is a binary search algorithm that is vandal-tolerant, it can find the first and last occurence of anything in a matter of seconds - less than a minute is seconds, right?? The hardest test I've found for it takes almost five minutes, 20 occurences in a 4900-diff x 100K article that existed 3 years ago. Normally you are looking at 20-30 seconds to find exactly when and who, i.e. how it got there. Of course, if "the world is flat" was in the article for a year, then gone for a year, back for a year, gone for the last two months, binary search ain't gonna cut it - but full search will kill it for sure! Next version will flail around the binary search points to look for gaps - wpW5 knows all!