Wikipedia:Bots/Requests for approval/BJBot
From Wikipedia, the free encyclopedia
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
[edit] BJBot
tasks • contribs • count • sul • logs • page moves • block user • block log • flag log • flag bot
Operator: Bjweeks
Automatic or Manually Assisted: Automatic
Programming Language(s): Python (pywikipedia)
Function Summary: Add orfud to orphaned fair use images.
Edit period(s) (e.g. Continuous, daily, one time run): daily
Edit rate requested: 1 edits per minute
Already has a bot flag (Y/N): N
Function Details: The bot will go through a list provided by the bot operator (for now) it will then check that A) the link is in fact a image, B) that no other pages link to it (orphan), C) it is tagged with a fair use template. The idea for the bot came from Mecu, so any questions on what it should be doing or why will most likely be answered by him.
[edit] Discussion
- Sounds OK: assuming this is trialed and approved, will you keep BRFA posted about changes to the "for now" basis? Alai 07:04, 21 January 2007 (UTC)
- Can I add a MECUBot proposal to this? It would be a clone of BJBot. The theory is there are over 100,000 orphaned images. Going through them all at 1 a minute will take 69+ days. By having clone bots to this bot, the time period will be reduced to go through that list, so there could be quite a few clone bots to go through this. The plan is to use a downloaded/created list of all the orphaned images, run through that list, and once done, grab another list, compare the entries, whatever is new, run through that list, and run several times through this. We could then switch over to a more constant cycling mode, with 1 day intervals on the list perhaps. Some administration and coordination between the clone bots would be needed to accomplish this. --MECU≈talk 15:04, 21 January 2007 (UTC)
- This seems to raise several issues beyond the original proposal. Firstly, what's the likely number of orphaned fair use image? The main Cat:fair use images category currently contains < 4200 images (there's more lurking in subcats, but none of those I've noticed seem especially huge). Fetching every orphaned image page seems a very inefficient way of doing this: why not do so only with those in both the orphaned list, and the fair use categories? (Either from a database dump -- if we ever see one of those again -- or from category listing pages from the live wiki.) Secondly, why such repeated and frequent "cyclings"? While there might be something of a backlog, what's the likely level of throughput on this? Thirdly, why would it be better to have multiple bots doing the same thing, and thus causing the co-ordination issues you mention, as against just having a single bot working at twice (say) the rate? The server hit is certainly no better, and is possibly worse, if they end up duplicating large numbers of page fetches. In short, I'd like more assurance on the resource implications of the how you plan on tackling this. Alai 02:34, 22 January 2007 (UTC)
- I would say that 40% of the orphaned images are fair use. I have personally gone through 1500 images and can give that estimate. So that's 40,000 images, whereas most of the images are likely in the more recent uploads (ie, images in 2006 are more likely to be fair use orphans that from 2003 or 2004 or 2005). The main category is for use of the {{fairuse}} tag (which is depreciated), so most images should be in the subcategories. I went through the first 20 pages of the {{logo}} subcat (4000 images) and was only on the B's. Fetching every orphaned seems a better route, as fetching the orphan list and comparing it to the fair use list doesn't seem useful, as the bot can determine what is a fair use image, so doubly cross-checking with the fair use category seems redundant. The bot can also determine if the image is still orphaned by the time it checks it, as using an out-of-date list it must do this. A database dump is out of date. (It's currently 11-30-2006 for en.wiki) I have sought permission for a toolserver account to use a mysql database, and the en.wiki is only ~13 days out of date for that. But just generating the lists manually may be better (and less time consuming, I've been running a script on toolserver that someone else generated to produce all en.wiki orphaned images for 8.5 hours and it's given me nothing: producing the image list manually would be more efficient). As for multiple bots, I agree it may not be the best method. Increasing the speed of a single bot would accomplish the same purpose and reduce the overhead of maintaining this project. So, I'd like to switch from 1/minute to 2 images processed per minute in "peak" times of en.wiki and a higher rate (TBD) during off-periods, which would be self-controlled by the bot. If my "cyclings" you mean download a list, process (the 100000 images), download a new list and re-process... it's because the orphaned list goes by when the image was uploaded to Wikipedia. So if an image was uploaded in 2004 and is orphaned tomorrow, the orphaned list will show it in uploaded order date in 2004 compared to the rest of the images (somewhere in the low thousands, 1000-10000). "Starting" from the 100,000 numbers of these will miss this images. Also, if we get 30,000 of the orphaned fair use images deleted in the first round, the orphaned list would decrease to 70,000 so new images could appear anywhere inbetween. Bjweeks had the idea to monitor new uploads to fair use to see if they were used after 7 days of uploading, but even this will miss images that were uploaded years ago who get orphaned and appear on the orphaned list. --MECU≈talk 03:50, 22 January 2007 (UTC)
- I'm a bit skeptical of your 40,000 estimate: if your 1500 images are relatively recent, then for the very reason you make reference to, they're likely to have a higher rate than the main body. I'd admittedly missed some of the largest subcats, as they're in categories that don't use the term "fair use", just to add to the confusion (Cat:album covers is bigger still). Looks like there was something on the order of 300,000 fair use images in toto last November. But my whole point is we shouldn't need to deal in guestimates, you should be able to calculate how many candidates there are in advance: even the last db dump would give you a pretty good idea of that, and as I say, you can fetch the fair use categories from the live wiki to get a more up to date list of candidates. (Admittedly it's likely to take about over 1000 category listing pages fetches, for all the "fair use" categories, but that seems to me to be a clear winner over 100,000 image description page fetches being done repeatedly and speculatively. 100,000 pages is an appreciable fraction of the whole db, and seems to me to be sailing pretty close to the wind on the "web spider" clause of the bot policy.) I don't see how you can characterise that as "not useful": it's precisely what you've said you want, just whether you get it by intersection, vs. exhaustive search of a single list. I'm not suggesting you don't fetch the image description pages at all (which wouldn't make a lot of sense, given that you have to fetch it before you re-write it), just that you do so only for a smaller list of (very) likely candidates, double-checking, rather than testing on spec.
- "Cycling" was your term: you tell me what you mean by it. If you fetch 100,000 pages, tag 40,000 and delete 30,000 (again, I have concerns about the accuracy of the estimate), then simply re-running the same bot on the remaining 70,000 is going to be extremely resource-extravagant: you'll find a handful that have had fair use tags added, and do nothing with the vast majority (since the vast majority will have not changed their copyright-status tag since the last time). And even that handful you could have found much more readily by monitoring changes to the fair use categories. As about 87,000 orphaned images were uploaded before the last database dump, and as those are what you're basing the claim of the size of the problem on, at the very least I'd like to see that tackled with a targetted, non-speculative one-off run before any open-ended approval is considered for any process that's going to be a long-running mini-spider. Alai 06:21, 22 January 2007 (UTC)
- I'm tending to agree with the above, it shouldn't be very hard to get list of orphans, a list of fair use images and images to be deleted and cross reference them. (toolserver or database dump) What I'm thinking is a script on the tool server that queries the DB and produces a current list of images to be checked (once), the bot would then request part of the list and begin working. The toolserver script then removes the chunk that the bot requested till the list is complete. The toolserver script would be run x (replication lag) days after the first list was totally finished. Wash, rinse, repeat. BJTalk 06:47, 22 January 2007 (UTC)
- That sounds sensible to me. For the orphans alone, I don't see any need for the toolserver: the special page is actually more up-to-date (completely live?) (you lucky things, look at how limited the uncategorised pages special is...). However, it would be a convenient way of getting an integrated list of tagged-as-fair-use-and-orphaned, which otherwise you'd need a middling-large number of category listing queries to the wiki to get. Alai 07:10, 22 January 2007 (UTC)
- You're assuming that during cycling we'll double check the previous 70,000 images. That doesn't need to be done, since we "just" checked them, and I agree that they likely won't change license status. By having the list that we know we've checked, getting a new list, comparing to the old list of checked images, we can eliminate the images we've already checked. Thus, if a new image appears at #35958, we'll see it as new and check it.
- I still don't see how grabbing all 500,000 (guess) fair use images in whatever category they are in will help speed up the process. Pulling the 100,000 image pages will give us 100% accuracy on what images are fair use (an error rate by the bot shall be ignored). Comparing the two and going for "likely" images would have some kind of error rate (above the error rate of the bot). Processing and comparing the two lists would slow us down. As we would need to repeat this for each cycle. If a new image shows up, we have to grab the entire categories again and then re-compare that one image to the entire 500,000 (guess) list. I think I see where you're driving at this: That you want to know better before we begin how many of these orphans are fair use. Comparing the lists would give us knowledge that there are 47,839 (example) images that are orphaned and fair use. We've cut the bot running time by 2.
- There are 452 fair use image categories. See User:Mecu/AllFairUseImageCats. I used [1], there are some duplicates as some are under several locations and the Royal Air Force structure loops and is redundant. I believe I cleaned all the duplicates out, but I may have missed a few. --MECU≈talk 15:22, 22 January 2007 (UTC)
- The special page is completely live, which is the problem. If a new orphan shows up as #1, every other image drops down by one number. Even more true and likely is if they get deleted. We don't really need (or want) this accuracy. You can only get 5000 images listed at one time, so it would take 20+ pages to get the list with possible duplicates (though few). This method could be (might be) faster than using a toolserver code (that I gave up processing on after 8.5 hours! There is a warning that the en.wiki is too large for such tools). --MECU≈talk 13:56, 22 January 2007 (UTC)
- I don't think I'm making "assumptions" about the nature of the cycling, as making such inferences as I'm able on the basis of your incomplete description thereof. If you can make this more precise, then so much the better.
- I'm not talking about loading all the fair use pages (which would indeed be horrific): I'm talking about obtaining a list of the contents of (i.e., page titles in) the fair use categories (which is about 200 times fewer page fetches, even if you do it from the live wiki, and none at all if you do it via the others methods I've suggested). Yes, this could mean that fewer pages reads are required by a factor of two... or by any other factor, depending on how many images are actually in said intersection: could be almost any number.
- Number of FU categories: if you're counting all subcats of the main one, it was just less than 400 in the November dump, so your current total is pretty plausible.
- Why is the special page being too live a problem? You had the opposite objection to the db dump! If you're going to keep a log of all pages previously inspected, it's just a matter of looking for new entries (in either the fair use cats, or the orphaned list).
- Given especially your very valid point about possible backlogging at IFD, wouldn't it be sensible to defer discussion of a continously running bot until such time as things are clearer with regard to the (hidden) backlog? If the backlog is anything like as large as you suggest, most of it will show up in the last database, and by the time it's tagged and cleared, we ought to have a fresh one, which would further quantify the rate at which they're accreting. Alai 06:38, 23 January 2007 (UTC)
- That sounds sensible to me. For the orphans alone, I don't see any need for the toolserver: the special page is actually more up-to-date (completely live?) (you lucky things, look at how limited the uncategorised pages special is...). However, it would be a convenient way of getting an integrated list of tagged-as-fair-use-and-orphaned, which otherwise you'd need a middling-large number of category listing queries to the wiki to get. Alai 07:10, 22 January 2007 (UTC)
- I'm tending to agree with the above, it shouldn't be very hard to get list of orphans, a list of fair use images and images to be deleted and cross reference them. (toolserver or database dump) What I'm thinking is a script on the tool server that queries the DB and produces a current list of images to be checked (once), the bot would then request part of the list and begin working. The toolserver script then removes the chunk that the bot requested till the list is complete. The toolserver script would be run x (replication lag) days after the first list was totally finished. Wash, rinse, repeat. BJTalk 06:47, 22 January 2007 (UTC)
- I would say that 40% of the orphaned images are fair use. I have personally gone through 1500 images and can give that estimate. So that's 40,000 images, whereas most of the images are likely in the more recent uploads (ie, images in 2006 are more likely to be fair use orphans that from 2003 or 2004 or 2005). The main category is for use of the {{fairuse}} tag (which is depreciated), so most images should be in the subcategories. I went through the first 20 pages of the {{logo}} subcat (4000 images) and was only on the B's. Fetching every orphaned seems a better route, as fetching the orphan list and comparing it to the fair use list doesn't seem useful, as the bot can determine what is a fair use image, so doubly cross-checking with the fair use category seems redundant. The bot can also determine if the image is still orphaned by the time it checks it, as using an out-of-date list it must do this. A database dump is out of date. (It's currently 11-30-2006 for en.wiki) I have sought permission for a toolserver account to use a mysql database, and the en.wiki is only ~13 days out of date for that. But just generating the lists manually may be better (and less time consuming, I've been running a script on toolserver that someone else generated to produce all en.wiki orphaned images for 8.5 hours and it's given me nothing: producing the image list manually would be more efficient). As for multiple bots, I agree it may not be the best method. Increasing the speed of a single bot would accomplish the same purpose and reduce the overhead of maintaining this project. So, I'd like to switch from 1/minute to 2 images processed per minute in "peak" times of en.wiki and a higher rate (TBD) during off-periods, which would be self-controlled by the bot. If my "cyclings" you mean download a list, process (the 100000 images), download a new list and re-process... it's because the orphaned list goes by when the image was uploaded to Wikipedia. So if an image was uploaded in 2004 and is orphaned tomorrow, the orphaned list will show it in uploaded order date in 2004 compared to the rest of the images (somewhere in the low thousands, 1000-10000). "Starting" from the 100,000 numbers of these will miss this images. Also, if we get 30,000 of the orphaned fair use images deleted in the first round, the orphaned list would decrease to 70,000 so new images could appear anywhere inbetween. Bjweeks had the idea to monitor new uploads to fair use to see if they were used after 7 days of uploading, but even this will miss images that were uploaded years ago who get orphaned and appear on the orphaned list. --MECU≈talk 03:50, 22 January 2007 (UTC)
- This seems to raise several issues beyond the original proposal. Firstly, what's the likely number of orphaned fair use image? The main Cat:fair use images category currently contains < 4200 images (there's more lurking in subcats, but none of those I've noticed seem especially huge). Fetching every orphaned image page seems a very inefficient way of doing this: why not do so only with those in both the orphaned list, and the fair use categories? (Either from a database dump -- if we ever see one of those again -- or from category listing pages from the live wiki.) Secondly, why such repeated and frequent "cyclings"? While there might be something of a backlog, what's the likely level of throughput on this? Thirdly, why would it be better to have multiple bots doing the same thing, and thus causing the co-ordination issues you mention, as against just having a single bot working at twice (say) the rate? The server hit is certainly no better, and is possibly worse, if they end up duplicating large numbers of page fetches. In short, I'd like more assurance on the resource implications of the how you plan on tackling this. Alai 02:34, 22 January 2007 (UTC)
- BJ brought up another issue: The bot will create a HUGE backlog of images in the ORFU categories for admins to delete. At processing 2 images/min and a "success" rate of 40% (which may not be accurate if we cross-check the lists), that 2880 images processed per day and 1152 images in the ORFU category for each day. At the 40,000 images I guessed were ORFU, that's 34+ straight days of a category of 1000+ images to be deleted. If we cross-check the lists, a likely 100% rate could put 2880 images in the categories for 13+ days. The delay will likely be months before this hump gets finished off. I personally don't have a problem with the categories sitting around with the images waiting for deletion but some may. The good thing is that these images are tagged and known at least and will eventually get deleted. Perhaps we should warn the CSD admins or create a seperate category for images processed ORFU by the bot so that the "regular" ORFU categories don't get clogged with it, and the bot's category can get worked on the side from CSD, though it would certainly be applicable under CSD, just held off to the side to prevent clogging. --MECU≈talk 13:56, 22 January 2007 (UTC)
- FYI the bot is code complete except for the way it acquires the list to check. If I should upload the source somewhere please advise and if so should I comment it first. BJTalk 07:28, 22 January 2007 (UTC)
- On that note, would I be allowed to trial the bot on 10 or so images to test the non-list getting parts of the code? BJTalk 09:49, 22 January 2007 (UTC)
- I'm inclined to note that it appears to be "accepted practice" to make a small number of test edits prior to formal approval of a trial per se, so unofficially, I'd just go ahead and make those. Though easy on the "and so": if you want to make more than 10, it might be better to wait on the latter, from the BAG. (BTW, I have no objection to such a trial, where you're working from some relatively "clean" list of predominantly true positives, and working on the order of small-hundreds of images.) On the backlog: I'm not too worried, as moving a hidden backlog to a visible one is generally a good thing. But it would be both prudent and courteous to give IFD "regulars" a heads-up about this discussion, and the possible consequences, yes. Alai 06:03, 23 January 2007 (UTC)
- Test finished with no errors, had to fix two bugs but it all worked. BJTalk 07:22, 23 January 2007 (UTC)
- Great! These images won't go through WP:IFD though. They will go through WP:CSD (It applies under I5) under the CAT:ORFU categories, unless we do something special. There currently is no backlog and the existing days have between 100-300 images (with the higher ones probably days I had been going through and doing the work this bot will manually).
- Also, another note to say the the plan is now to just use the dump that should occur today and then weekly and process the lists offline (on BJ's box) to determine the working list. This method both reduces load on the tool server and increases the likelyhood of finding true images to seek out. --MECU≈talk 14:23, 23 January 2007 (UTC)
- Nice job, BJ. See no reason a larger trial shouldn't be approved. My mistake, Macu; perhaps at WT:CSD, then, or wherever it is the image speedy-deleters hang out. As much as anything to satisfy my curiousity, I've run a query on the November db dump to try to identify what are (or at least were) candidates. The result was a suspiciously exact 19000 in categories in the "fair use" tree, of which something over 4000 were already tagged as orphans. I've summarised the results here, and uploaded a list of the untagged images here. Obviously things have moved on from there, with about 10,000 more orphaned images now existing, and several thousand deletions (hence the copious redlinks), but if it's of any use... Alai 01:35, 24 January 2007 (UTC)
- Two things. Fist I would like to run the bot on that list unless you feel it is too outdated as it would give a large head start on the backlog. Also could you post the queries you used? (I must admit my SQL knowledge is almost nonexistent) BJTalk 01:50, 24 January 2007 (UTC)
- I'd support running the 'bot on that list; it's possible that it's now uselessly outdated, but my guess would be that it won't be too bad. (Assuming it gets the OK, maybe you could let me know what the false-positive rates of various kinds actually are, whether on the whole list or some portion thereof.) The SQL queries are fairly straightfoward, but I have to reconstruct parts of them, and there's currently a somewhat ugly dependency on a table I build anyway for other purposes, but that would be rather painful to run just to do this. I'll get back to you on this shortly. Alai 04:03, 24 January 2007 (UTC)
- I've made a request at [2] for comments on the potential large backlog created by the bot. --MECU≈talk 02:08, 24 January 2007 (UTC)
- Two things. Fist I would like to run the bot on that list unless you feel it is too outdated as it would give a large head start on the backlog. Also could you post the queries you used? (I must admit my SQL knowledge is almost nonexistent) BJTalk 01:50, 24 January 2007 (UTC)
- Nice job, BJ. See no reason a larger trial shouldn't be approved. My mistake, Macu; perhaps at WT:CSD, then, or wherever it is the image speedy-deleters hang out. As much as anything to satisfy my curiousity, I've run a query on the November db dump to try to identify what are (or at least were) candidates. The result was a suspiciously exact 19000 in categories in the "fair use" tree, of which something over 4000 were already tagged as orphans. I've summarised the results here, and uploaded a list of the untagged images here. Obviously things have moved on from there, with about 10,000 more orphaned images now existing, and several thousand deletions (hence the copious redlinks), but if it's of any use... Alai 01:35, 24 January 2007 (UTC)
- I'm inclined to note that it appears to be "accepted practice" to make a small number of test edits prior to formal approval of a trial per se, so unofficially, I'd just go ahead and make those. Though easy on the "and so": if you want to make more than 10, it might be better to wait on the latter, from the BAG. (BTW, I have no objection to such a trial, where you're working from some relatively "clean" list of predominantly true positives, and working on the order of small-hundreds of images.) On the backlog: I'm not too worried, as moving a hidden backlog to a visible one is generally a good thing. But it would be both prudent and courteous to give IFD "regulars" a heads-up about this discussion, and the possible consequences, yes. Alai 06:03, 23 January 2007 (UTC)
- On that note, would I be allowed to trial the bot on 10 or so images to test the non-list getting parts of the code? BJTalk 09:49, 22 January 2007 (UTC)
- FYI the bot is code complete except for the way it acquires the list to check. If I should upload the source somewhere please advise and if so should I comment it first. BJTalk 07:28, 22 January 2007 (UTC)
- It appears User:Roomba Bot, which does this, may be resurrected soon (User talk:Gmaxwell#Roomba Bot). Though that bot doesn't notify users. --MECU≈talk 17:11, 22 January 2007 (UTC)
OK, just to sum things up and be perfectly clear before approval: this bot will be run with no clones, will get the list of images to process from User:Alai/Fairuse-orphans-untagged for now, and later database dumps or the toolserver, and will place the tagged images into the regular "orphaned fair use" categories. Correct? —Mets501 (talk) 18:11, 25 January 2007 (UTC)
- All correct. :) BJTalk 18:13, 25 January 2007 (UTC)
Approved for trial. Please tag/warn uploader of 100 images or so from User:Alai/Fairuse-orphans-untagged and report back here with a few diffs, and preferably the success rate of the part of the list that was processed. —Mets501 (talk) 18:25, 25 January 2007 (UTC)
- Done! The bot tagged 3/4 (71/100) of all images in the list. For this run it with only the pywikipedia rate limiting on and it made about 3-6 EPM, I was wondering if I could keep it this way as it will take forever at 1 EPM. Diffs, ([3], [4]) ([5], [6]) ([7], [8]). Also I found this funny User talk:Iluvchineselit :). BJTalk 02:29, 26 January 2007 (UTC)
Approved. Looks great. This bot shall run with a flag. Feel free to make up to six edits per minute. —Mets501 (talk) 02:43, 26 January 2007 (UTC)
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.