Wikipedia talk:Bots/Control proposals
From Wikipedia, the free encyclopedia
Control proposals
The following proposals are split to make it a bit easier to consider them individually. Their intent is to provide me and the other technical team members with ways to limit bots other than using the site firewall. For those unfamiliar with me, I'm the member of the technical team who does most of the watching of the database servers. The limits are because of the major load that write bots can impose on the system and are intended to provide a way for those operating the systems to limit the bots to levels which do not unduly harm the response time of other users. These are not en Wikipedia proposals. They are intended for discussion here as proposals applying to the Wikimedia servers and are here to make you aware of them and provide a way for you to discuss them and the reasons for them.
Please don't suggest things like adding more database servers as a way of raising the limits: that won't make disks seek any faster. Even if it did, bots are unlikely to be sufficient reason for the expenditure.
Please don't hesitate to discuss the proposals with me in IRC - I'm not an ogre but I do have to deal with load issues and these are the tools required. Jamesday 04:51, 13 Dec 2004 (UTC)
- This is an interesting addition considering all the prior discussion on the "negligible" load bots have on the servers. I have questioned the load that bots have on the servers and people have replied that developers have told them there is negligible impact. Now, we have someone who monitors the database load and now wishes to place tigher controls on bots due to resource issues which now seems to backup my earlier concerns about the impacts of bots on server load. As for the policies stated below, I think some will require changes to the core component of the pywikipediabot program. Alternatively, the server could do its own monitoring of bots (as bot accounts will be marked with the bot flag) and could conceivably give lower priority to bot requests. There are of course issues with doing that as well though. Alternatively, a system load factor could be returned on requests made by bots which could trigger the bots to increase/decrease its request rate accordingly. RedWolf 05:56, Dec 13, 2004 (UTC)
-
-
- Bots already get the best practical measure of load: how long it took their last request to complete. Even 10 second reporting of load isn't good enough to keep up with real time fluctuations, so the response time you see is the best guide available. Waiting as long between requests as the last request took would be fairly good for real time load adjustment. Default might be 3-5 times that, since that's about once every ten seconds much of the time. Lower priority for bot requests is doable but would take programming which isn't likely to happen soon. Not that easy to set up a queue with PHP or a bot could be placed at the back of the queue. Would work badly anyway - bots would never get any work done with that method because there are too many humans. Jamesday 15:24, 17 Dec 2004 (UTC)
-
-
- FWIW, the comment that bot performance was "negligible" was based on a comment made a developer (LDC) a long time ago when the rambot first ran almost two years ago. The rambot has never increased the amount of editing that it has performed. Since then, this is the first time someone has ever mentioned anything different, which is great because developers have been extremely silent on this issue. What I don't understand is how with all the new hardware and newer software the bot suddenly is slowing down Wikipedia. At the time it represented a tiny fraction of the server load, how can it suddenly be taking up so much? – Ram-Man (comment) (talk)[[]] 15:26, Dec 13, 2004 (UTC)
-
-
- How can it suddenly be taking so much is the wrong question. The right one is how close are the systems to their limits? For disks, it's easy at low load - any disk can handle a lot of load since drives doing 150-200 seeks per second are readily available. If you ask while the site is well below using that many seeks, you'll get the answer that it's not a problem. Ask now and you get a different answer because a 5+spare drive RAID5 array on a machine with a gigabyte of RAM has a limit of about 30 write queries per second (however many seeks it takes). That's Albert, the terrabyte machine which just became the main image/download server for the site. Today I get the decidedly dubious pleasure of contemplating exponential growth with doubling periods in the 8-12 week range, while the systems are today at or exceeding the capacity of a single drive to keep up. That means that I'm asking the developers to start planning now, with deadline summer to this time next year, to have support in the software for splitting a single project across a farm of database servers, similar in concept to the way Google and LiveJournal split their database into chunks. Please also review the recent medium term (one year or less threshold) capacity planning discussion at OpenFacts Wikipedia Status. Jamesday 05:23, 17 Dec 2004 (UTC)
-
-
-
-
- Is it possible to put the ext3 journal (or whatever filesystem you're using, if it has a journal) on a separate disk? Alternatively, maybe it's possible to put the innodb journal on a separate disk. If so, and you're not doing this already, this should increase performance many many times over, as writes to the journal are going to take very few seeks. I find it strange you'd have so many seeks anyway, as I would think the vast majority of the reads would be in the article space which easily fits in memory. I suppose compressing the article space would save even more memory, though. Splitting the database is surely going to be necessary eventually (well, I assume Wikipedia growth will outpace the declining costs of random access memory), but I would think there's some time for that, as anything important can fit in memory. anthony 警告 13:59, 17 Dec 2004 (UTC)
-
-
It's possible to put journals and logs on different disks. Has to be RAID 1 so InnoDB can reconstruct without data loss if there's a failure though. We currently have 2U 6 bay boxes and I'd rather have 6 drives doing the full work load. With more bays I've certainly considered it and may well go that way. The article space doesn't fit in memory. En cur is about 2.673GB in InnoDB database form. About 4GB of 8GB total on the master is available for InnoDB caching. Slaves have 4GB but only have to cache half of the data. With indexes and other tables it would take several times the RAM to hold it all and that's not really doable in the few months ahead timeframe even in boxes with a 16GB (using 2GB modules) RAM capacity. Getting the titles split from the article text is doable and being worked on. Compressing cur text is an option - it would save perhaps 30% of the space. But at the cost of complicating quite a few other things and adding some Apache load. Probably not worth doing. Might be worth it when InnoDB supports compression within the database engine, though. MySQL Cluster is interesting because it's spread around many machines but the problem is the growth rate and it currently being too immature to trust, as well as not yet having a properly segmented data set so "cold" data can go somewhere other than RAM. We use memcached to cache rendered pages and we can eventually cache a pretty high proportion of parsed pages in that, saving significant database load. Jamesday 15:24, 17 Dec 2004 (UTC)
Proposal: Automated content policy
unlike the others, this is en Wikipedia only
- Bot operators are reminded of de facto English language policy, in response to complaints about the Ram-bot created city articles, is that mass automate adding of content is prohibited. Any bot intending such action must obtain broad community consensus, involving, at a minimum, 100 votes, none of which can be made by bots. Jamesday 04:51, 13 Dec 2004 (UTC)
- Having done the rambot articles, I can't remember such a "de facto" policy ever being made. Of course such a mass automatic adding as the rambot articles has not been done since, but I was unaware of any substantial policy discussion of this. Now many of the complaints had nothing to do with the content of the messages but when there were votes on the issue (before 100 was a reasonable number) they were consistently supported. Here is my problem with this: If the content is acceptable in the small scale, it should be acceptable on the large scale. We don't require voting for the small scale content. Despite the popular view, the fact is that there is no functional difference between a bot doing the work and me doing the work. The only difference is the amount of time it takes! I can't support this proposal because it assumes that if a user performs large number of edits that the content must suddenly be given a stronger standard of review than a small number of edits. It violates the spirit of Wikipedia's openness and that anyone can edit. – Ram-Man (comment) (talk)[[]] 14:35, Dec 13, 2004 (UTC)
-
- The philosophy has been that if a human is willing to do the work, it's more likely to be worth having, coupled with widespread dislike for the work of the first mass bot, rambot: the US city articles. As a result of repeated complaints from many querters the "random page" feature was made non-random earlier this year. It now retries, trying to dodge any article where rambot is the last editor. That seems to have muted the complaints and it is possible that a mass-adding bot will receive wider support now. Also, there are now a sufficiently large number of other articles that relatively large data dumps won't be as high profile as the cities were, so that should help. The quantity of the content seemed to be a bigger problem than the nature of the content, because that caused it to turn up very regularly, though there has been pretty regularly expressed dislike for the small city articles as well. Speaking personally only, they didn't bother me. Jamesday 10:33, 14 Dec 2004 (UTC)
- I agree with Ram-Man. Adding three articles is no different than addind 3000 articles. The only difference is the time that it takes. Kevin Rector 17:36, Dec 13, 2004 (UTC)
-
- The quantity, both of articles and complaints, is what prompted the special handling to dodge rambot-created articles. Jamesday 10:33, 14 Dec 2004 (UTC)
- I agree with the theory, but 100 seems a little high. anthony 警告 05:43, 15 Dec 2004 (UTC)
Proposal: DBA discussion mandatory
- No bot may operate at more than 10 requests (read and write added together) per minute without discussing the bot's operations, and getting approval from, the people who look after the Wikimedia database servers. I'm the person who does most of that.
- This proposal is intended to make sure that the operators of potentially disruptive bots have discussed their operation with those who will see and have to deal with that disruption. As time allows it also provides the opportunity for test runs and other tuning. Jamesday 04:51, 13 Dec 2004 (UTC)
- The speed of the requests should be tied directly to physical performance or other database concerns and not set at an arbitrary value. Wikipedia bot "policy" already dictates that approval for a bot should be given, and has been working spendidly for us so far. – Ram-Man (comment) (talk)[[]] 14:38, Dec 13, 2004 (UTC)
-
- This one isn't a limit. It's how to get the capacity limits on a bot changed. We've seen that bots generally aren't harmful to system performance at current rates. There's now a proposal to abandon those effective limits. We don't know the limiting rate. Discussion and real time monitoring will let the limiting rates be determined for each high speed bot, based on its properties (those include things like the rate at which it touches seldom-edited articles). Peak time limiting rates are likely to be quite low - the systems are sized (and money raised) based on what is needed to handle those peak rates, with some margin for growth. If a bot operator is unwilling to work with the systems people to do that, why would anyone other than the bot operator want the bot able to run at high speed?
- If a bot is going to do high speed adding of articles, there are far better ways than using a bot. Things like bulk loads into the database (after broad community approval of such bulk loads). Discuss the task and how to get it done efficiently - something requiring talking with the people who can do such things can help with. If the broad community approves, you might find it more convenient to hand me 20,000 records in a CSV file and have me load them in half an hour one night than run a bot for weeks. It would certainly be a lot nicer for the database servers, which have optimisations for such bulk data loading.
- Asking in bots pages on this project (or any other single project) isn't the right place to get load-related approvals. Nobody dealing with load would normally notice such discussions. System load discussions generally happen in #mediawiki, where the people who operate the systems discuss and react in real time to whatever is happening and discuss back and forth how to adjust designs and what to ask for. It's also the place where the people who understand the system load issues and the real time response of the systems are to be found. The proposed and actual purchases which result from those discussons are placed on meta. I've written most of those this year and it was me who suggested raising the last fund-raising target from $25,000 to $50,000. If you'd like to understand the systems better, please read my contributions on meta, then follow up with questions on the appropriate meta talk pages. m:Hardware capacity growth planning is one to start with. Jamesday 03:38, 14 Dec 2004 (UTC)
I assume this only applies to read/write bots, and read-only bots would be limited to 60/minute, as before? anthony 警告 05:51, 15 Dec 2004 (UTC)
- Bots have been limited to 10 per minute by the bot policy, with crawlers subject to the robots.txt limit of one per second. Reads of normal article pages (not category pages, image description pages for much-linked images, allpages pages) are generally sufficiently inexpensive that one bot at 60 per minute probably wouldn't be so bad. Ideally it would be reading while not logged in, so the Squid cache servers can read the pages - that's the assumption behind the one per second limit in robot.txt. Hiting the database server at one page (many requests) per second would be unpleasant if done by many bots at once. As always, doing these things off peak is much preferred. Jamesday 05:23, 17 Dec 2004 (UTC)
- My understanding is actually that crawlers have to wait one second between hits, which is actually somewhat less than 1 per second (to much less when things are slow). That's how I implement it, anyway, after every read I just sleep(1) (or sleep(5) if there was an error, but I plan to tweak that to use more of an exponential backoff based on the number of errors in a row). But my write bot (which I currently use only about once an hour) uses my crawler to do the reading (just adds the read request to the top of the crawler's queue). So I'm really not sure how I'd even implement something like 10/minute across reads and writes, without slowing the crawler way down. anthony 警告 14:17, 17 Dec 2004 (UTC)
- Discuss with the DBA in advance so the results can be watched during a run. Then I can give you a personal, task-based, limit, based on what I see. It's the get out of jail almost free card.:) Jamesday 14:57, 17 Dec 2004 (UTC)
- Discussing with the DBA in advance would defeat the whole purpose of using the bot, which is to make the changes faster and with less effort. If I've gotta discuss things with a DBA before I can make the edits, I'll just use an interactive script (call it an "alternative browser", not a "bot") and run it under my own name. Can you imagine if Britannica required its writers to check with the DBA every time they want to add 50 see-alsos? Most of them would probably quit, and they're getting paid for it. There are already a huge number of useful edits I don't make because of the overrestrictive bot policy. Let's not add these to the list. anthony 警告 15:28, 17 Dec 2004 (UTC)
- Discuss with the DBA in advance so the results can be watched during a run. Then I can give you a personal, task-based, limit, based on what I see. It's the get out of jail almost free card.:) Jamesday 14:57, 17 Dec 2004 (UTC)
- My understanding is actually that crawlers have to wait one second between hits, which is actually somewhat less than 1 per second (to much less when things are slow). That's how I implement it, anyway, after every read I just sleep(1) (or sleep(5) if there was an error, but I plan to tweak that to use more of an exponential backoff based on the number of errors in a row). But my write bot (which I currently use only about once an hour) uses my crawler to do the reading (just adds the read request to the top of the crawler's queue). So I'm really not sure how I'd even implement something like 10/minute across reads and writes, without slowing the crawler way down. anthony 警告 14:17, 17 Dec 2004 (UTC)
I'd rather see 600/hour than 10/minute. Since a bot pretty much has to do a read before it can do a write, 10 requests/minute would be 5 edits/minute. That means if I want to add a category to 50 articles, I have to wait 10 minutes before I can check the last of the edits to make sure everything went smoothly. This is even more problematic if the limit is intended to be binding when I run the bot manually as a script (I've got a script which allows me to type on the command line to create a redirect, requiring me to wait 5 seconds between each command line argument is not reasonable, in fact, if this is required I'll just change my command line scripts to use my regular username and no one will know the difference). anthony 警告 05:51, 15 Dec 2004 (UTC)
- Depends on the times. 600 in one second would always be a denial of service attack, it would be noticed and you'd be firewalled. Off peak, faster transient rates, throttled by running single-threaded and not making a request until the last one is fully complete, would probably not be unduly problematic, even though that may be faster than robots.txt permits. They key is to always delay quickly once you see responses slowing, so you adjust at the very rapid rate at which the site response time can change. Within seconds, a link from a place like Yahoo can add several hundred requests per second and make all talk of "this is a good time" obsolete. Yahoo isn't hypothetical - Yahoo Japan has done it quite a few times now - makes for an interesting, nearly instant, spike on the squid charts. Fortunately the squids do do most of the required work in that case. Stick to normal pages, wait as long between requests as your last request took and do it off peak (based on the real time Ganglia data) and you should be fine for reads. Jamesday
- 600 in one second would always be a denial of service attack, it would be noticed and you'd be firewalled. I'm assuming 600 in a second would be impossible without multithreading. Off peak, faster transient rates, throttled by running single-threaded and not making a request until the last one is fully complete, would probably not be unduly problematic, even though that may be faster than robots.txt permits. That's basically what I figured, and why I'm hesitant to apply such legalistic limits. anthony 警告 14:17, 17 Dec 2004 (UTC)
- The discussions should be giving everyone a fair idea of what's likely to be a problem for high speed bots.:) Jamesday 14:54, 17 Dec 2004 (UTC)
- 600 in one second would always be a denial of service attack, it would be noticed and you'd be firewalled. I'm assuming 600 in a second would be impossible without multithreading. Off peak, faster transient rates, throttled by running single-threaded and not making a request until the last one is fully complete, would probably not be unduly problematic, even though that may be faster than robots.txt permits. That's basically what I figured, and why I'm hesitant to apply such legalistic limits. anthony 警告 14:17, 17 Dec 2004 (UTC)
Now, for something somewhat off topic, would you suggest using "Accept-Encoding: gzip" or not? If it's equal either way, I'll probably turn it on, because it'll at least save bandwidth on my end, but if this is going to have a negative impact, I won't use it in general. anthony 警告 14:17, 17 Dec 2004 (UTC)
-
- Please accept gzip. If not logged in (ideal for checking if a page has been changed or not) it's particularly good because the Squids will have that already. Jamesday 14:54, 17 Dec 2004 (UTC)
Proposal: bot identification
All bots operating faster than one edit per minute must uniquely, correctly and consistently identify themselves in their browser ID string, including contact information in that string, and must register that string a unique ID in that string and at least some contact information in a central location on the site shared by all wikis. The purpose of this is to allow the technical team to identify the bot, reliably block that bot and contact the owner of the bot to discuss the matter. Jamesday 04:51, 13 Dec 2004 (UTC)
- This sounds like a great idea, but blocking the bot's username and/or password seems like a simpler solution. Why is this new restriction needed? For instance, most bot user pages should have that kind of information and "email this user" links. – Ram-Man (comment) (talk)[[]] 14:40, Dec 13, 2004 (UTC)
-
- It's not close to fast enough and doesn't fit with the way we identify crawlers to stop them and keep the site online in an emergency. The proposal to lift the limits changes disruptive bots from "slow enough to take some time" to "get that stopped immediately" response timeframe. So, I have to tell you what we need to know, and how, and how we react, when operating in that emergency mode. It's very different from the way the wiki normally works, because there is minimal time available for discussion and if you're not in the room at the time, or available nearly instantly, you can't participate because the site may be offline while we wait. Some of my other proposals decrease the chance of a bot being sufficiently disruptive for this sort of emergency mode action being needed. I'd rather discuss for months. I just don't have that luxury in real time operations. Jamesday 06:34, 14 Dec 2004 (UTC)
- Would there be any method to automatically throttle a bot's responses on the server side dynamically based on server load, or does this have to be done exclusively with the bot itself? – Ram-Man (comment) (talk)[[]] 18:20, Dec 13, 2004 (UTC)
-
- Single-threaded, waiting for a complete response, and backing off on errors are the currently available methods. Those provide some degree of automatic rate limiting and might be enough to keep the rate low enough to prevent emergency measures like firewalling. Maybe - I haven't watched a high speed editing bot in real time yet. I'd like to. In #mediawiki we also have bots posting periodic server load statements, aborted query notices and emergency denial of service load limit exceeded warnings. For myself, I also operate all the time I'm around with the "mytop" monitoring tool running on all five of the main production database servers, refreshing every 30 seconds (though I'm adjusting to 15, because that's not fast enough to reliably catch the causes of some transient events). There's no reason someone couldn't write something to help bots limit rate but it's not around yet. I asked Ask Jeeves to wait doing nothing for as long as the last operation took (we had to firewall them for a while). At the moment, wait times aren't a reliable indicator of database load because Apache web server CPU is the limit (we have 4 out of service and five ordered, and another order being thought about but not yet placed. Also the benchmarking subtantially faster code in MediaWiki 1.4, being rolled out now, countered by using more compression of old article revisions, which might help or hurt). A modified servmon/querybane reporting on a private IRC channel might be one approach. Jamesday 06:34, 14 Dec 2004 (UTC)
- Object: I'm happy to include a browser string like this that a handful of server admins can see: "LinkBot, by Nickj, Contact email: something@someplace.com, in emergencies please SMS +610423232323". However, I'm definitely not happy for everyone to see it "in a central location". I don't get to see their email addresses, or their mobile phone numbers, so why should they see mine? Secondly, each bot is already registered to a user, so you can already see who runs the bot anyway, and therefore email them. Would you actually want to phone or SMS people (which is the only extra information)? If so, you're welcome to have my cell number, provided it's only visible to server admins (i.e. the people with root access on the Wikipedia servers). -- Nickj 22:53, 13 Dec 2004 (UTC)
-
- Exactly what is up to you. It determines how likely we are to decide we can accept people getting connection refused errors or other forms of site unavailable. If we can reach you fast we can take a few more seconds. If we aren't sure, and hundreds of queries are being killed by the load limiter every 20 seconds, we'll firewall instead. If there's ID and a request to call you in the brower ID string, and a member of the tech team is on the same continent, one might call and ask you to stop the bot instead of firwalling it. More likley after firewalling it to let you know and make it possible to more quickly end the firewalling. At the current rates, it's very unlikely that this will be necessary - it's the proposed new rates which change the picture and mean I have to talk about emergency realtime activities. I hope that no bot will be affected - but I want all high speed bot operators to know how we think and react in those situations, so nobody is surprised by what happens if rates are too high. Jamesday 06:34, 14 Dec 2004 (UTC)
- Do note that I didn't ask for email address or telephone number. The harder it is to find and quickly contact the bot operator, the more likely it is that we'll have to firewall the IP address to block it. When there is crawler-type load issue, first stop is usually the database servers rising from 5-20 to hundreds of queries outstanding and then getting mass query kills by the load limiter. Except that for safety it only kills reads, while the bots would be doing writes and wouldn't be stopped. Next step after noticing that is a search of the logs to get the IP address and browser ID string. If that provides enough information to identify and contact the bot operator predictably, in real time, within seconds or minutes, then we may be able to risk waiting instead of firewalling the IP address to keep the site online. Email won't be used until after the fact - wiki account would be as good. IM address could be useful. To dramatically reduce your chance of being firewalled when crawling at high speed, monitor #mediawiki in real time while your bot is running, because that's where we'll be asking questions, asking others in the team to do something or telling others what we're doing. You might be fast enough at noticing to stop the firewalling by stopping the bot. Fast matters here: crawling and some other nasty things have taken servers from under ten to 800 or even up to 2,000 outstanding queries and denial of service mode emergency load shedding (mass cancelling of hundreds of queries, continuing until load drops) in less than 30 seconds. Such events are usually started, noticed and dealt with inside five minutes, so that's the effective upper bound on whether your contact information will do any immediate good. Just be sure to include a unique ID of some sort in the central spot - no need to include phone or IM or whatever, but if you want those used, the browser ID string is the best place for them. The logs are kept private - we aren't going to do something like post a phone number to a mailing list. Jamesday 06:34, 14 Dec 2004 (UTC)
-
- We can devide all bots on two groups: slow and quick. Slow ones work for them selves. As no operator is around these bots, then the query rate can be lowered (it is better that the bots are waiting, not the real people).
- Quick ones are working in real time, and they work with and for their operator. So, it would be logical if the operator is required to provide some kind of e-contact (nick on special channel on IRC or in special multi-user chat room for IM or anything else).
- Resume: for "slow" bots the query rate can be lowered (to lower server loading), and any registerd bot can run in this mode; if operator wants to run his bot in "quick" mode, then his e-presence can be required.
- --DIG 06:33, 8 Jan 2005 (UTC)
Proposal: robots.txt
All bots operating faster than one edit per minute must follow any robots.txt instructions and limits, whether they are specifically directed at that bot or generic. No more than one operation (read or write) per second is the most significant of the current limits. The robots.txt file must be checked no less often than once per 100 operations or once per hour of operation. The purpose of this limit is to provide a way for the technical team to limit the activities of the bot without firewalling it and blocking all read and write acess to any Wikimedia computer (which includes all sites and the mailing lists). Jamesday 04:51, 13 Dec 2004 (UTC)
- After reading up on this issue, I agree with this proposal in principal, assuming the robots.txt file accuratly represents current server load. In the ideal solution the bots would be able to use just enough of the remaining server load to maximize the server usage to get the maximum speed without hitting a critical slowdown. Over the last week or so I have not been running a bot, but Wikipedia's read and write response has seemed quite slow. Don't we have a bigger problem than bots if this is the case? Maybe that's a bit of a red-herring, but I wonder if there is an easy way to perform a more dynamic way for bots to check current server load and to be able to use as much of it as possible without causing problems. – Ram-Man (comment) (talk)[[]] 18:33, Dec 13, 2004 (UTC)
-
- We're short of apaches right now, so it's slow for that reason, even thought the database servers are OK speed-wise. On saving, there's an unfortunate bit of software design which causes writes to sometimes be delayed several minutes or fail, triggered only by edits to an article not changed within the last week. Those sometimes cause a lock conflict on the recentchanges table. A quick hack to possibly deal with this is in MediaWiki 1.4, a nicer solution is on the to do list. This one is one reason why I commented about by write rates to infrequently changed articles.
- Beyond site down, there's the "how slow is too slow" question. That's a tough one and people differ significantly - I tend to be more accepting of slowness than most, but it varies. Bots doing bulk updates which can be done in other ways aren't something I'd be paricularly keen on, just because it's inefficient. Those who want to rapidly do very large numbers of edits might consider working together to develop things like server-side agents with vetted code verifying job files with carefully vetted allowed transaction types. Those could do things like apply batches of 100 edits to one table at the same time, exploiting the database server optimisations for a single transaction or statement doing multiple updates, then moving on to the next table. More efficient ways of operating like that are perhaps better than running client-side bots at higher speeds. Jamesday 06:51, 14 Dec 2004 (UTC)
Checking robots.txt every 100 edits seems like overkill to me. If there's a problem I'd rather you just firewalled me and we can talk about why when I figure it out. I have considered setting up a page on the wiki that anyone can edit to automatically stop my bot. Whether or not this would be abused is my only concern. Maybe if the page were protected. Then any admin could stop my bot by just editing the page. But I suppose checking that page every edit would cause a bit too much load in itself. anthony 警告 06:02, 15 Dec 2004 (UTC)
- If you were to check while not logged in it would be fairly cheap. One database query for the IP blocked or not check per visit (might be more, none come to mind right now) then the squids would serve the rest. That would only change if someone edited the page and flushed the page from the Squid cache. Every 10-50 edits might be OK. Jamesday 14:51, 17 Dec 2004 (UTC)
Proposal: write robots.txt limits
- Robots.txt limits are intended for readers. The write rate limit is ten times the read limit at peak times, five times the read limit off peak. This is because writes are on average far more costly than reads. At the time of writing this would impose a one update per five seconds plus the time for a read limit on write bots.
- Any bot must also follow any bot-specific value specified in robots.txt (including by a comment directed at the operator) and must check the limit no more than once per hour of operation or once every 100 writes.
- Peak hours are subject to change at any time but are currently the 7 hours either side of the typical busiest time of the day.Jamesday 04:51, 13 Dec 2004 (UTC)
- No attack intended, but I'm curious how these limits were created. I would imagine that number of currently running bots would be a more important factor. 10 bots running at that speed are worse than 1 bot running at 5 times the speed. In practice I don't think many bots are running at the same time and I'd like to know what percentage of total load is actually done by bots vs regular users. Might it be more important to coordinate various bot runs to optimize server usage? – Ram-Man (comment) (talk)[[]] 14:52, Dec 13, 2004 (UTC)
-
- They were set based on painful experience with bots crawling too fast and overloading the site and were orignally set before my time. Today, we see about one too-fast crawler every week or two. As the capability grows both the instantaneous capacity to handle things and the number of cralwers and pages to be crawled rise, so it's tough to get and stay ahead. One second is really too fast but we gamble on bots/crawlers not hitting too many costly pages at once. That generally works tolerably well. Yes, currently running bots are a factor but there are few enough bots and enough human convenience variations that per-bot limits and throttling based on response time seem likely to be reasonably sufficient for now. Figures for total load by bots v. humans aren't around. Editing is subtantially more costly than viewing but I've no really good data for you. Likely to hit uncached and seldom-visited pages more often than humans, so more costly than a human edit overall. Jamesday 14:47, 17 Dec 2004 (UTC)
Proposal: no multithreading
Multithreaded bots are prohibited. All bots must carry out an operation and wait for a complete' response from the site before proceeding. This is intended to act both as a rate limiter and to prevent a bot from continually adding more and more load at times when the servers are having trouble. A comlete response is some close approximation to waiting for the site to return all parts of all of the output it would normally send if a human had been doing the work. It's intended to rule out a send, send, send without waiting for results operating style. Jamesday 04:51, 13 Dec 2004 (UTC)
- I assume they can do http over persistent TCP connections though? IIRC that's specifically allowed in the relevant RFCs, see RFC 2616 section 8.1 in particular. Just thought this should be mentioned to avoid confusion :) Pakaran (ark a pan) 14:54, 13 Dec 2004 (UTC)
-
- Just to clarify, I just think we should define "complete response." Not trying to be a pedant, but it saves you explaining that to each bot operator etc. Pakaran (ark a pan) 15:01, 13 Dec 2004 (UTC)
-
- Persistent connections are good. Just don't send a command and move on to the next without waiting for the reply. That is, be sure that if the site is slow, your bot is automatically slowed down in proportion as well. Jamesday 07:18, 14 Dec 2004 (UTC)
- I've never personally performed multi-threading and I don't intend to, but I tend to be a minamalist. Don't add rules that you don't need to. So my question is similar to the one above. What are the actual load numbers of bots on Wikipedia, and if a bot did use multi-threading, would that really cause a significant issue? As it seems from your statement on server load, the answer would be yes, but it may not always be that way. A more dynamic policy would be preferred. – Ram-Man (comment) (talk)[[]] 14:56, Dec 13, 2004 (UTC)
-
- That's part of why I want discussions about specific high speed bots, while leaving bots which are following the current policy and known to be safe enough alone. If I and the others dealing with load are in touch with those running high speed bots, we're better able to adjust things. For background, I've typically been adjusting the percentage of search requests which we reject a dozen or more times a day, because installation of the two new database servers we ordered in mid October was delayed. Haven't had to since they entered service, it's been full on. Given viable cross-project communications channels (and central and automatic bot settings for on/off and such) I'll do what I can to help high speed bots run quickly. But those things do need to be arranged. Frankly, as long as one human is considering each edit intreractively in real time, I doubt the technical team will notice the bot. Bulk data loads without live human interaction are more of a concern. I do expect/hope that MediaWiki 1.4 will cause many more queries to leave the master server and be made on the slaves. That will hopefully make that system far less sensitive than it is today. We're still in the process of moving from one server to master/slave write/read division and 1.4 should be a major step in that direction. Today the master sees a far higher proportion of total site reads than it should and the stress is showing. Just part of moving from three servers this time last year to 30 this year, and redesigning and reworking to handle the load. Jamesday 07:18, 14 Dec 2004 (UTC)
Let's say I'm running two different scripts. Is that multithreading? I would think as long as I limit each script to half the edits per minute it's fine. anthony 警告 06:06, 15 Dec 2004 (UTC)
Proposal: no use of multiple IPs
Use of more than one IP address by bots is prohibited unless that bot is forced to do so by its network connection. Jamesday 04:51, 13 Dec 2004 (UTC)
- I use multiple IP addresses all the time because I run the bot from different locations, not just while using a dynamic IP with an ISP. This has nothing to do with my network connection but with my personal usage. If I was prohibited from this, I would just run the bot unsupervised, which would be far worse. As with most of these proposals, if they are listed as guidelines, I would be fine with that. If they are listed as strict rules, I will not support them. Bot owners have been fairly upstanding Wikizens, and criminals should be dealt with on a case-by-case basis, much like regular users. If these rules are suggested because of a specific bot causing trouble, then use diplomacy with that user. – Ram-Man (comment) (talk)[[]] 15:19, Dec 13, 2004 (UTC)
I assume this is more than one IP address at the same time? While we're at it, what is "a bot". As far as I'm concerned bots are uncountable. I assume all automated tools being run by a single person is considered a single bot for the purpose of all these proposals. anthony 警告 06:08, 15 Dec 2004 (UTC)
Proposal: back off on errors
- Whenever a bot receives any error message from the site twice in succession it must cease operation for at least
105 minutes. - Whenever a bot receives a message from the site indicating a database issue it must cease operation for at least 30 minutes.
- Whenever a bot receives a message from the site indicating a maintenence period or other outage it must cease operation for at least 60 minutes and at least 30 minutes beyond the end of the outage. This is to back off the load while the site is down and to help deal with the crash high loads experienced on restarts of the database servers at times when they have cold (unfilled) caches.
- The limits are intended to prevent bots from piling on more and more requests when the site is having load or other problems. Jamesday 04:51, 13 Dec 2004 (UTC)
- Shouldn't all users have to follow this rule? I receive dozens of errors while editing with my main user account every day, but I just try again and it works 95% of the time. I suspect that most obsessed Wikipedians do the same thing. Also detecting errors could be difficult, as they can come from a wide range of sources. – Ram-Man (comment) (talk)[[]] 15:08, Dec 13, 2004 (UTC)
-
- It would be nice if all users backed off at high load. Since bots do more work per person, as a whole, the bot operators are a group which needs to be more aware than the general community. The actual behavior of users is often to keep retrying, piling on more and more load. It's pretty routine for me to see many copies of the same slow search, for example. With load sharing, that can end up slowing down all of the database servers handling that wiki, as it's randomly sent to each of them on successive tries. So, people do need to be told to back off on error or slowness, because they don't normally do that. If you're willing, I'd be interested in a log of the errors you see with timestamps and a note of what you were doing. Jamesday 09:54, 14 Dec 2004 (UTC)
-
-
- My preferred backoff routine during normal operation is to take the response time of the last action and add double that amount to the next delay time (response time is time in addition to minimum delay). Before doing that, compare the response time to the delay time and if less then subtract 1 second and 10% of the previous delay time. If smaller than the minimum delay time, reset to minimum. This makes for a quick backoff and a slow return to faster action while only needing to remember the delay value and configured limits. When an error of any kind occurs, double the delay time. Have a maximum delay limit to avoid month-long delay times. Include a random factor to prevent synchronization problems. I've used such methods in other automation; I just started using Wiki bots and haven't implemented this algorithm yet here. (SEWilco 18:09, 21 Jun 2005 (UTC))
-
Can we get some more information on how to determine the type of error? Otherwise I suppose this is reasonable, though there should be an exception for any time that the status is checked by a real live human. anthony 警告 06:13, 15 Dec 2004 (UTC)
Proposal: cease operation on request
When a cessation or delay of operations is requested by a member of the technical team, the bot operator is required to promptly comply. The purpose of this proposal is to allowthe technical team to deal with any previously unmentioned ssues. Jamesday 04:51, 13 Dec 2004 (UTC)
One example of such an issue is the current space shortage on the primary database server, which may force us to switch to the secondary database server next weekend. While an extra 50,000 old article revisions may normally be OK, they aren't helpful when we're tring to avoid a forced move to a lower power, larger capacity, computer. This is a temporary situation: compression has left 40GB unusable due to fragmentation and new compression promises to free up a further 60-80GB and allow the existign 40GB to be made available, leaving about 100-120GB free of the 210GB capacity. But today we don't have that space available and won't until the comrpession has been fully tested and used on all wikis. Jamesday 04:51, 13 Dec 2004 (UTC)
- Is there some page where this kind of information can be found? It should either be placed on the Bots page or a link to it should be placed there. In any case, the spirit of our current policy would suggest that when major bot edits would cause major hardship on Wikipedia that they should not be run. I think if our database is nearing a critical mass, then we don't need to vote on a proposal! Just ask any bot users to cease bot operations until the fix is made. I assume this will only be temporary. I would be more than happy to comply with such a request, and I don't know of any bot owner that wouldn't willingly comply to such a request. Skip the legal process and JUST ASK! – Ram-Man (comment) (talk)[[]] 15:08, Dec 13, 2004 (UTC)
- I'm sure everybody here would already stop immediately if asked anyway at the moment, so I really don't see the point of the proposal. As Ram-Man so succinctly put it: "Skip the legal process and JUST ASK!". -- Nickj 22:56, 13 Dec 2004 (UTC)
- I'd probably go so far as to say when a cessation or delay of operations is requested by anyone, the bot operator is required to promptly comply. anthony 警告 06:16, 15 Dec 2004 (UTC)
-
- What sort of "anyone" are you thinking of? Should random anonymous users be allowed to request a stop? What about a user who has a philosophical objection to any bot edits? --Carnildo 22:58, 16 Dec 2004 (UTC)
-
-
- I wouldn't include anonymous users. If a user has a philosophical objection to any bot edits I would hope he or she would be convinced not to object to every bot on those grounds. Does anyone actually feel that way? I'd say we burn that bridge when we get to it. Temporarily stopping the bot while the issue is dealt with doesn't seem to be a very big deal, and I would think we already have a clear consensus that some bots are acceptable. anthony 警告 02:08, 17 Dec 2004 (UTC)
-
- A protected subpage can be used as a message board to bots. Individual bots could be named, and have a message which controls all cooperating bots. The general message should always be there, with either a STOP or GO text, as inability to access the message can imply to a bot that problems exist. By protecting the page it can be trusted more than an unprotected page. Any unprotected page could be used by a bot as an advisory, and it might stop permanently or only stop for an hour. (SEWilco 18:18, 21 Jun 2005 (UTC))
Proposal: work together, consolidate updates
The greatest space growth in the database servers is the old article revisions. For this reason it's desirable that mass operations like bots seek to reduce the number of new article revisions they create. Bot operators are strongly enouraged to:
- combine multiple operations in one edit.
- work with other bot operators on data interchange formats to facilitate combined edits. Jamesday 04:51, 13 Dec 2004 (UTC)
- This is common sense and you can add it as a note to the bot page if you'd like as recommended guidelines. – Ram-Man (comment) (talk)[[]] 15:09, Dec 13, 2004 (UTC)
- Oppose. It's often easier for people to read and check the diffs if logically separate changes are committed separately. Human convenience (keeping separate changes separate) should be more important than disk space (combining multiple logically-separate changes into a single change). If disk space really is an issue, then why not change the implementation to store things more efficiently? —AlanBarrett 19:09, 14 Dec 2004 (UTC)
-
- Personally I prefer reading combined bot edits - easier to check them all at once than individually and less work for the RC patrol people. However, that's personal preference. Onto the technical side. en Wikipedia old uncompressed is about 80GB. using bzip compression for individual articles takes that to 40GB (though that 40GB saved isn't yet free because of fragmentation). If the new compression which combies multiple revisions together works as well for en as it did for meta, that would fall to 15GB, of which about 25% is database overhead. About 19% of the uncompressed size. In development but not yet available is diff-based compression. ETA whenever it's done. That may be smaller but will probably be faster for the apache web servers than compression and decompression. This is for 7,673,690 old records, so each with the best compression we have available now is taking about 2k. Of that 2k, there are fields for the id (8 bytes), namespace (2 bytes), title (I forget, call it about 30 average), comment (call it about 50 average), user and text of user (about 20), two timestamps (28), minor edit flag (1), variable flags (about 4) plus some database record keeping (call it 16). Plus the article text. Eliminating all of the text, which we're planning to do by moving it to different database servers, once the programming for that has been done, and we're left with perhaps 160 bytes per revision for metadata. About 1.2GB of that when you have 7,673,690 records. And exponential growth with 8-12 week doubling time (but this may be worse). How long before that fills the 210GB on 6 of the 15,000 RPM 73GB $750 drives in the current master server? Quite a while - it's actually 28-29 doublings away. About 4 to 7 years away. Or it could be 20 if growth slows. Or longer. By which time we'll have more disk space (though it won't be plain drive arrays because in 29 doublings we'll either have stopped growing or split en across many different database servers or failed). Of course some things can't double that many times... but editing seems unlikely to stop even if the number of articles becomes stable, which is why old record growth is the one I watch most carefully. Of course, there are things to do about this and those are being investigated as well - like rewriting the code so en Wikipedia can be diced across multiple servers. Dicing en may be necessary in about 9 months because of the disk write rate, though again I've suggested things which can stretch that a bit. I've also been discussing introducing a feature to optionally automatically merge a minor edit by the same person with their previous edits, within a certain time threshold, keeping two recent changes records and only combining if the comments can be combined. This would get rid of the pretty common major edit/typo fix pattern. Now, nothing will break because of this. It'd just take more money raised to buy increasingly esoteric storage systems to keep the capacity and performance tolerable if nothing was changed. But since I have to think ahead, I'll take any edge I can get which might give more time for fund raising and coding to keep up with the growth rate. I'm contemplating suggesting $100,000 as the next quarterly fundraising target. We'll be spending the last of the $50,000 from the previous one in a few weeks, though there are some reserves available in a crunch. Please let me or any of the technical team know if you can think of better ways to do things (and if you can find the human time or money to get them done, that's even better!). Jamesday 14:37, 17 Dec 2004 (UTC)
Proposal: Seek diplomatic remedies to problems
The current four point policy (1. The bot is harmless, 2. The bot is useful, 3. The bot is not a server hog, and 4. The bot has been approved) should be the main bot guidelines/policy. If a user violates these, the bot user account may be temporarily blocked if necessary while a diplomatic solution directly with the user can be reached. Bot owners have been historically very helpful and should be more than happy to comply with any reasonable request. In the past, the bot policy has worked just fine as is without problem. (A separate policy on bot write speeds should be decided on as well) – Ram-Man (comment) (talk)[[]] 15:35, Dec 13, 2004 (UTC)
- Support: It concerns me that there seems to be an exponentially growing amount of red-tape to run a bot. You'd think bot authors were hardened criminals or sociopaths, the amount of stuff they have to go through, and the suspicion with which they are regarded - but the simple fact is that they're volunteers trying to constructively improve the Wikipedia, who do listen to feedback, who will stop if asked, and who are already probably the most highly "regulated" contributors to the Wikipedia. -- Nickj 23:03, 13 Dec 2004 (UTC)
- What prompted the other proposals was a proposed change to the bot policy which dramatically increases the scope for bots causing problems. If you want to run more quickly, you need to deal with what is required to do that without breaching point 3. A bot which follows the policies I proposed is unlikely to be disruptive enough to be blocked. That's why I proposed them. Jamesday 10:03, 14 Dec 2004 (UTC)