Wikipedia talk:Bots/Archive 1

From Wikipedia, the free encyclopedia

Archive This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

One presumes that the problem which the following proposals address is Ram-Man's automatic geography-article generator.

Contents

Benefits bots can offer

  1. Provides a good template of pre-formatted data for contributors (see how the Newton, Massachusetts entry has been expanded; imagine ith the Periodic Table were used to start the 100+ articles for the elements)
  2. Potentially provides a unique resource not directly available elsewhere on the web (the small-town bot is a good example of a well-designed bot--see Ram-Man's description of the data acquisition process - uck!)
  3. Provides full coverage in cases where an a priori undeterminable subset of the data has a high likelihood of being (or becoming) interesting even though a randomly chosen entry has a low probability of being interesting / useful.

Inherent drawbacks of using bots in current system

  1. Adds tens of thousands of entries to Wikipedia that are unlikely to see a human edit any time soon (in fact, we could probably extrapolate the nearly exact rate at which they will get edited by seeing how many have been edited so far)
  2. Artifically inflates the perceived activity of Wikipedia
  3. Can be perceived as tilting (and possibly could tilt) the purpose of Wikipedia away from being an encyclopedia and towards being a gazetteer / Sports Trivia Reference / etc. This is also a problem with hand-generated imports from other resources.
  4. Danger of abuse by "vandal-bots", or just "clueless-bots". A bot running out of control could potentially cause heavy server load or even a denial of service attack.
  5. General complaints about interference with normal contributor operations, esp. Special:RecentChanges.

Specifications

The goal is to end up with one proposal. All proposals should

  1. allow all of the above benefits and
  2. neutralize/eliminate all of the drawbacks.

Other specifications by which to judge proposed changes:

  1. it should be easier for people to create well-behaved bots than it is to create malformed bots
  2. En-masse rollback should be possible in general, not just for known and approved bots.
  3. Automatic grouping of large numbers of page creations on Recentchanges should help on the RC-flooding front.

Proposal #1

Any graceful solution would provide the automatic functionality of the pros without the negative consequences of the cons. Bots would continue to be expected to meet the current criteria of usefulness and harmlessness--a good solution should reduce the potential for harm.

The general rule would continute to be "Avoid using Bots" unless it is the only practical option.

One example of the user experience for how the small-town data would optimally work, for example, would be to have a reasonably limited number of pages listing all the possible towns (perhaps by state), with links. If someone clicks on that link, they have the option of importing the small-town-bot entry.

An implementation of the solution is one that others have mentioned--tag the entries as "imported entries". Bots that add entries without tagging the entries as such would be banned, if there's no functionality for a random Wikipedian to do so mark the bot as an "importer" (that functionality lends itself to abuse--people could mark contributors they don't like as importers, however, so it's probably not a good idea unless that abuse is expressly forbidden and harshly dealt with).

It may be necessary to require/request that bots be registered beforehand, so that bots that run amok can be blocked. At a minimum there should be the encouragement to warn people about importation projects. Registration allows for quick reference by everyone, accountability for large edits, and the ability to block bad things quickly.

Imported entries would

  • be marked as ? pages (or at a minimum, ! pages)
  • not be listed on default RecentChanges
  • be listed on RecentImports (or some such) and/or BotsInProgress (or Currently Running Bots)
  • not appear on default RandomPage
  • show up under searches-by-name normally
  • appear if someone clicks on a link to the entry.

When someone clicks on that link, they would get an entry

  • clearly marked as an automated addition
  • with the choice to untag the entry on the edit page

People who want to hand-import entries from public/GFDL sources would use the same tag (thus "imported entries" rather than "bot entries". Are there any problems with killing two birds with one stone?)

A benefit of bot/import-registration is that users could change their preferences to make the bot's additions show up as normal links, or show/hide from Recent Changes.

It might be necessary to develop a "revert-bot"/"revert-import" functionality.

A scripting-friendly remote interface to encourage well-behaved bots might be a good idea as well.

Proposal #2

Similar to proposal above, but without the "imported entry" tagging. This is a fundamental different. There should *not* be a special feature for a user to optionally "import" data into an article. Data from bots or humans should be treated on its own merits and not from who put it there.

Proposal #1 assumes that bot entries are a subset of imported entries.

Proposal #3

No special marking of bot articles, but all bots should "register" their plans in some place. Possibly, this could be technically enforced by having some kind of code which indicates an allowed bot (I'm not sure exactly how this should work).

Needs fleshing out: this proposal does not seem to answer drawbacks 1,2,3,5

Uploadable Bots (server processes)

Only processes that run on the Wikipedia server would be allow, written in a restricted but useful language.

See m:Uploadable Bots for details.


"No bots" supporters: The Cunctator

"Keep bots" supporters: The Cunctator

"Avoid bots" supporters: The Cunctator Ram-Man (Only if it is better done manually. Humans are better at such things)

"Bots with restrictions" supporters: The Cunctator Ortolan88, Ram-Man (proposal #2), Clutch, user:sjc, --KQ Chas_zzz_brown, fonzy (with restrictions as in Proposal #1), Kpjas, The Anome "Uploadable bot-scripts" supporters: (explained far below as talk), Rlee0001


What is the Wikipedia policy on automated page creation? I notice that Ram-Man is currently entering statistics for every town in the US using some sort of script (unless he's a very fast typer.) I'm not sure that this is a particularly great idea. In the more general sense, I think that script-generation could get us into a lot of trouble (how do you revert vandalism when it's spread across five thousand pages?) Is there a page that would be more suitable for this discussion? Dachshund

I think it is fine. There have been a couple bumps in the road, but his bot's entries now appear to be correctly named, wikified, NPOVed and also have good and factual information. The only real policy on auto page creation is that you need to be very careful when you are doing it. For example somebody started importing hundred year old Eastman Bible Dictionary entries via bot and that caused an uproar: The entries were highly POV, incorrectly named, written in a pedantic Victorian prose and were incorrectly wikified (self links, multiple links, incorrectly named edit links...). The bot's IP was temporarily blocked and we worked everything out with the bot's creator on the Wikipedia mailing list. The city entries don't have these problems and also have the bare essentials that are needed for any city article; population and geography. And on top of this there is also demographic information. When complete this will be a unique resource on the net. What is better is that whenever somebody in the US looks

up their town they will find an entry in Wikipedia (and hopefully they will add some historical info to the article after finding it). If an actual vandal uses a bot then we will block that bot's IP. --mav

Three thoughts on batch page creation:
1) Special:Recentchanges is presently useless due to the town & county bot. This is the source of the irritation which led me to notice that:
2) The main page's count of Wikipedia articles is increasingly inflated -- we've gone from 60k to 70k awfully quickly, but:
3) These thousands of town and county pages are not encyclopedia articles, nor are the bulk of them ever likely to become same. They are atlas or gazetteer entries that have been converted to useless paragraphs rather than useful tables. The data are potentially valuable as such -- perhaps there should be a WikiAtlas? -- but they are no more encyclopedic than would be batch-added dictionary entries. --FOo
They are a bit telephone-directory-ish. I hope in future people will add colour and detail to them. It would be good though if bots like this went a little slower -- that was discussed before with the Eason's bot: only 100 every hour or less, please! Otherwise, as said above, RC is unusable, even with number of edits set to 1,000. We may have added 10k articles, but we haven't really added any value. Hundreds of core topics are still uncovered or amateurishly-written, and here we have a page for every one-horse town across the US. It won't project a terribly good image of wikipedia; that concerns me. -- Tarquin 20:20 Oct 21, 2002 (UTC)
I disagree that these entries are harmful. I just came across Auburn, California which is a small city near where I live. I've been meaning to write an article about Auburn ever sine I started the project in January but never did so because finding boring yet vital to have up-to-date population and geographic information isn't fun at all -- this is a perfect thing for a bot to do. So since all the boring to find info was already there I simply added a few external links, a history section and a short line in the intro on why this city is interesting. Granted many small towns won't ever be updated with more than what is there now, but most towns don't have much of any historical significance outside their own counties. So what if they exist in our database? They have correct info, are correctly wikified and named. Having every town, city and village in our database ensures that anybody in the US who is looking up information on their hometown via an external search engine will find that info here -

which makes these entries an important reader/contributor recruitment tool. Many of the same people will then update the articles with historical and other information. Yes, the US Census has this info but it isn't very readable or accessible and it can't be added to or its presentation improved. Can you think of another resource like this on the net (with 2000 data)? With that said, I also agree that Recent Changes is useless while Ram-Man's bot is at work. I wish there were a back-end way to import the 20,000 remaining cities/towns/villages/places. --mav

they are not bad pages -- but it's the Mithril argument again: newcomers clicking Random Page, finding pages and pages of middle-earth may think "encyclopedia! tolkienopedia, more like!"; finding hundreds of jargon file pages may think it's just a ton of hacker slang; finding these thousands of pages may think it's largely an encyclopedia of US towns. I am probably overreacting a bit, but we seem to be leaning every which way but toward serious core encyclopedic subjects: Arts, literature, science. There are plenty of minor novelists of the past centuries we don't say anything about, who are more important that these towns. I'm not against these town pages, but we must balance them! -- Tarquin
Obviously I agree with having the articles since I am the one making them. One thing I could do would be to make all the changes minor and then those changes could be filtered out by those who set up the option in their preferences. They are not minor, but maybe no one cares.
Since starting to add the information I have gotten comments from a number of people. One common idea is that without the articles in some form, people don't bother to add one line descriptions about a town because they want to avoid stub articles. I have had a number of people say that now they can add some information because the articles exist. In fact the RC's shows that people have been modifying their own town articles and adding some misc information. Unlike Maverick, I think that with an influx of users if many of them update their own home cities, then we can add quite a bit of new information. Also there is the possibility of adding other information automatically such as latitude and longitude, county seat information, etc.
I would vote to modify the "random" option to give city, state articles a lower priority. -- Ram-Man
You mean "like maveric" right? I was arguing for keeping the entries and allowing you to finish. --mav
Well you think that most of these entries will never be filled up with data. You once thought that these entries would never even be created. I think I did this just because you said it couldn't be done. So while I agree with you one everyone else, I don't believe that this wikipedia cannot grow to have those entries become much more complete entries. -- Ram-Man
There's nothing precisely wrong with the articles. In the future, perhaps automated pages could be saved on some other website as a static page, and only a link added from the Wiki page? Dachshund
I certainly don't think we should wipe them. -- Tarquin

Although I have been (and continue to be) a vocal opponent of automatic content creation and editing processes on Wikipedia, I think creating these articles is on balance a good thing. As Ram-Man suggests, they make good "seed" articles for people to add a sentence or two about their own town, and as long as they don't interfere with existing articles, that's good. However, I'd like to see the bot slowed down for two reasons: one, there is a strong presumption against bots here in general; the burden of proof is on the bot-maker to demonstrate that the bot is (1) useful, (2) harmless, and (3) not a server hog. If there's any doubt about any of these, the bot should be slow enough that humans have time to find problems, report them, and get them fixed. Secondly, the "Recent Changes" page is an important part of the Wikipedia user experience, and the fact that it is essentially useless while the bot is running is very annoying. Slowing down to, say, a page a minute would greatly improve the usability

of the system.

At any rate, I think it meets the "useful" test, and as far as I can tell from server logs, the bot isn't a major factor in server load, so that's good, but I think "harmless" should include not hogging the recent changes list, so let's keep it running, but at a leisurely pace. --LDC

It should be noted that I made a mistake of invalid data in some 2,000 articles. The bot repaired all of these. That is to say that if I make a stupid mistake, I will do my best to fix it. However going slowly has an important disadvantage, as pointed out by Maverick. The orphans page, which a lot of people apparently use, is full of lots of cities and townships. To fix these, I have to use the bot to update all the entries. At 1 modification per minute, this will mean that the orphan page is going to be unusable for possibly weeks or months. As I have suggested, I can make my changes minor and people can filter some of them out (partial solution). Going slow severly limits the progress I can make at fixing the various quirks that are introduced in all these entries because I simply wait for it to finish. This should be noted!

Let's say for instance that I use one modification every 30 seconds. That would be about 3,000 modifications per day. Essentially it would take me about 2 weeks for any change I decide to make to all the entries, such as adding latitude and longitude or fixing mistakes. -- Ram-Man

Theoretically speaking, we could set something up where the bot's modification times are fudged back a bit, so they wouldn't cover up the actual most recent changes. I don't know if that's a good idea, it's just a though. Bring it up on the mailing list. --Brion
Hm. Perhaps there should be another option in our prefs where we can turn off anything submited by a registered bot? Just give the bot's IP to the developers and then perhaps they could make each entry you sumbit marked with a B for bot. Displaying bot edits would be turned off by defualt in user preferences. But it is important that bots get registered somehow before this is allowed. --mav
I like the idea of registering bots, however, when could such a feature be implemented? When might *any* solution be accomplished? -- Ram-Man
Please keep tables as tables, instead of converting them to prose.

Anyway, it would seem that Wikipedia was never designed to handle bots.

And while you're at it, why limit it to the USA? Why not do England, Canada, Australia... why limit it to English-speaking countries? Why not do the wole world?? Clearly there is something absurd about this!

Besides, if you want to know about a town, do you really want just a bunch of numbers? Or do you want to know what is actually IN the town, such as malls, arcades, parks, etc.? User:Juuitchan

I would assume that Ram-Man is doing the US because that's what he's got census data for and that's what he's interested in. As far as additional data, yes, we want all that. But we can't have everything at once, now can we? --Brion 12:03 Oct 22, 2002 (UTC)
If I were to post every baseball score of every Major League baseball game ever played, with all the statistics and all that, it would be roughly analogous to what Ram-Man is doing. --User:Juuitchan
Not at all. Any encyclopedia would have the census information about the population, etc. The "problem" with the rambot is that it doesn't distinguish between the major leagues and the low minor leagues (and it doesn't fill in county seats), but just as I have beefed up his form letter for Newton, Massachusetts, where I live, and Valdosta, Georgia, where I was born, you can do the same for wherever you live and eventually we'll have them all, and, if you don't, we'll still have the basic information about your town. Ortolan88
County seats are on the agenda, along with latitude and longitude. And I agree, if I were the only one to ever work on these articles, *maybe* it would be a terrible idea. But I am naive and hope that other people beef up articles! -- Ram-Man
Speaking of baseball statistics, what is so terrible about adding them? I can understand if the only thing you added was the so-called unimportant ones, but this is supposed to be an all-encompassing (read: never-ending process) encyclopedia. -- Ram-Man
What I would really like is a complete set of football (soccer to you Americans) statistics. I will see about knocking up a bot to do this stuff. user:sjc
See above cited Valdosta, Georgia for some truly amazing football (football to us Americans) statistics. Ortolan88
I suspect that bots will do a lot of work on this encyclopedia as it grows into a bigger thing. I don't think this is all bad and it is to be expected. We have articles on places like Y, Alaska which has about 1,000 people in it. Other encyclopedia's would consider such a place worthless. But it has a cool name and 1,000 people care about it! This encyclopedia can grow huge, however, but without people flushing out articles, it will never be good enough. That's why I don't always *just* do geographic topics. -- Ram-Man
I for one am ecstatic to have these articles here. Even if they're light on local color, they're something, and the stats do give some general gist of the locality. I recall looking up my home town in Encarta lo some years past and being thrilled at the one or two meagre sentences I got along with a woefully out of date population figure; I'm sure I'm not the only one who looks up local stuff when discovering a new encyclopedia, and having a tantalizing beginning is both heartening (Wikipedia cares enough about my hometown to put in stats!) and encourages to direct action (and I can add more info!) Basic info on other subjects can I'm sure be similarly useful. --Brion 11:06 Oct 26, 2002 (UTC)
I just added to the Erick, Oklahoma (pop. 1023) entry that two wacked-out country-music stars came from there. Ortolan88

We need a policy on bots. It was grand and fun running amuck with my bot, but it did inconvienance some people. I'd suggest that if some bots are allowed (like mine for the geographic articles), that they be more controlled, for instance, have a section on the subject page which lists the currently running bots, the IP addresses (so blocks can easily be made), and what the bot is doing (explanation). -- Ram-Man


Ram-Man, I have little interest in your articles, but I'm wouldn't call them inappropriate. I actually find your effort comforting and reassuring. My only complaint is that the Recent-Changes page is pretty useless. Maybe we could petition the developers for a feature where we can filter out the entries by certain users? That is,if I don't want to see entries by Ram-Man, I could enable that in my Preferences.

But I also like the idea of having "bot" accounts identified; then we could have a Recent Changes page, and a Recent Changes by Bots page. That would clear up all my beefs; I don't really feel comfortable with adding a feature for users to filter out other users in the list of changes that they see. --Clutch 03:14 Oct 26, 2002 (UTC)

I think we have to wait and hope one of the developers finds time to do such a thing (It has been suggested above). -- Ram-Man

To address the issue of "random" pages and page count, for statistically scraped pages such as baseball games or towns: Could we set a flag indicating that (1) the page was generated by a bot; (2) the page has never been edited by a non-bot (henceforth called "somebody" ;) ).

Then, allow random pages and a page count only for pages that are not so flagged.

This would still allow people to look up and edit their hometowns, in order to add the information which makes it truly an article, and not "just" a row in database (albeit, a very nicely formatted database display).

It also allows us to consider a set of bot-generated pages as a single article; essentially equivalent to an article consisting of a (huge) table of data entries. We have the convenience of viewing each row in the table in a nicely formatted fashion.

Once a page is edited, we have distinct information supplied; as well as a smidgin of evidence that at least somebody gives a hoot about Ice Worms, Alaska (or game 3 of the 1907 hockey playoffs). Chas zzz brown 03:21 Oct 26, 2002 (UTC)


I changed the voting categories above. "Supporters of 'avoid bots'" and "Opponents" weren't going to be clear categories. I also added a third vote, "Support with restrictions", by which I mean things like, require labelling of bot-produced articles, hold bot-produced articles back until requested by reader, as suggested passim, such as just above here. Ortolan88

Maybe this whole naming thing is going to be messy anyway! The categories are not necessarily mutually exclusive. I just copied the format from other policy pages! -- Ram-Man

I meant that "I support avoiding" and "I oppose avoiding" read like conceptual "double negatives" and so I restated positively to make them clearer. The residual category I added so there would be something I could vote for. Ortolan88


"Keep bots with restrictions" is obviously a difficult category. Some bots -- the "automatic spell checker" -- I would oppose entirely. Others -- "the ancient Bible dictionary" -- I would make into some kind of request-filling engine, if someone wanted an article on Hagar or Haman she could request it without our accepting the entire musty content of that old dictionary. On the towns and counties, all I would expect would be a note "This entry is derived from census data." with a suggestion that users are invited to extend it. In other words, the restrictions would be imposed on a bot-by-bot basis. Ortolan88

When I was running the bot I posted my IP address on my talk page in case anyone needed to quickly block the bot. I don't know if anyone saw it, but someone in the discussion mentioned registering a bot before using it. Now whether that means programming a special feature or merely good-faith posting of the IP address on the Bot page is not for me to decide, but it was one thing that was requested in case things went wrong. -- Ram-Man
That seems fair to me (simply posting the IP in a good faith effort). Other bots will be apparent from Recent Changes, unless it's randomized somehow. Anyway, if one is quick enough to be noticable, it's quick enough to see and block, and if it's not, then it's slow enough we can revert. --KQ

I should probably note my methodology: I downloaded all the mass amounts of imformation from the United States Census Bureau. It was all in multiple files and a mess. I had to combine it all together, clean it up, etc. I did all this in a spreadsheet. I also created a number of new categories like % water which I could easily calculate from the data. In doing so there were a number of naming problems. I still have not worked all those problems out, but I was *aware* of them. Then I exported the data and moved it into a MySQL database. From there I created all the information for U.S. Counties. I did all of them by hand (some 3,000) and it took a long time. I decided to write a bot to do what I would do anyway but on a larger scale. Nevertheless, I added the cities for Alaska and Alabama by hand to make sure before I ran the bot that it wasn't doing stupid things. The rest of the errors were caught by others watching (or not yet caught by anyone?). I like this place too much to not be careful, but I probably could have been even more careful! It should be noted that someone noted an error that around 2,000 articles were bad. *ouch*. It was actually easily corrected, but it did raise a red flag! -- Ram-Man


Haven't read all of the above (yet), but some remarks:

  • If we run bots without a user name, every admin can block them whenever necessary. This is (right now) not possible for usernames.
  • There might be a note on this page that a certain bot is currently active. This way we can still see who is operating the bot
  • Let all edits by a bot be done using "minor edit" on - that way we can at least read RC, by filtering out the minor edits (use your user preferences for that).

Jeronimo


I think there's more to be won by regulating the kind of bot entries rather than dealing with them. I think the "ShitHole, SomeState" articles outrank many other articles in quality, and I see no reason to mark them as different. It would be completely different if somebody were to add articles with "ShitHole is a truckstop in SomeState, USA." as the contents.
So, I'd say, don't do anything with them at all, as long as their content is reasonable and encyclopedic. I don't even care about the random page feature. The only reason I use it for, is to find stubs. These articles are also stubs. Granted, I might not be able to tell anything about most of them, but that's not a problem to me. So if the Random function were to change, it should be coupled to articles being (or not being) stubs, and not to being uploaded by a bot. Jeronimo

There definitely should be pressure for imported entries to be of high quality. Bot-registration would allow individual users to pick and choose which entries they consider ready for prime time.

Note that the proposal would have the imported entry show up on a search, so the content wouldn't be hidden from someone just looking for specific information on that subject. --The Cunctator

My point is that why separating perfectly good quality stub articles from other articles, while there are loads of other crap on Wikipedia added by normal persons? We might just as well have articles not viewed or edited by a second person marked. Jeronimo
The short answer is that bots don't sleep. A major part of the reason that Wikipedia's quality doesn't devolve is that all parties have essentially equivalent resources--even the most prolific individual can't do more work than, say, 10 people. But bots are a completely different story.
Note that the proposal doesn't call for a primary distinction of bot/non-bot--it calls for a distinction between imported/non-imported. --The Cunctator
Once again, the fact that bots add articles quicker than non-bots has no implications on the contents of the articles. We should be just as vigilent for normal persons to add crappy, bad, NPOV, copyrighted, etc. contents as for bots, or importers, or whatever. Therefore, there's no reason to treat such articles differently. Jeronimo
I don't know that I agree that there are no implications on the contents of the articles. I think that the small-town bot, and bots in general, are most useful (an most likely to be implemented) when essentially importing/reformatting tabular data (such as the census data entered by Ram-Man).
If a bot were able to actually write articles even as bad as the worst stub, that would actually be an acheivement; since the stubs rarely have a common theme or format (apart from being short).
In a "conceptual space", stubs are generally far apart in terms of the actual content they cover; whereas bots by their nature are going to focus the added content in one area. To my mind, bot generated articles are just tabular data, reformatted in a pretty way; but they might as well be a single, extremely long article which presents a table. That's not a general characterization of stubs - 100 stubs couldn't easily be combined into a single article (let alone 20,000). Chas zzz brown 02:07 Oct 28, 2002 (UTC)
First of all, stubs combined into single articles are not useful. Someone looking for an article on one of the small towns would not find an article to satisfy themselves. It would also be a mess to create full articles. People wouldn't even bother trying to find the stub.
Secondly, the tabular data *is* interesting and useful in various cases (like non-tabular data!) and as such they should be formulated in a way that is approachable. When I first started adding statistical data (to U.S. states) I had Zoe tell me that to her the numbers are not really meaningful because she is not particularly number oriented. For people like this, tabular data is nearly useless, so putting it in an accessible format is important. Very few people would look at tables of data (Trust me, 35,000 city entries make for very large tables!)
Thirdly, and I should have stated this earlier if it was not clear, all the entries created would have been done eventually by hand had I not done it with a bot. I just employed my programming skills to save me a *lot* of time. Don't believe me? I did 3,000 counties and all the cities and Alabama and Alaska by hand. I'm just obsessive that way, that and I did it just because I could.
Fourthly, a good point was raised. These "stubs" are larger than the current median and average article size. Now even though they are stubs, they are more useful than a lot of other articles that are created hand. The big issue is that people are biased against bots. No one has complained one bit about the county articles but I hear a lot of complaint about the bot added cities. I bet no one even knew that the Alabama and Alaska entries were entered by hand! The articles are almost equivalent, but people don't like one because a bot did it.
Fifthly and Lastly, a bot creating *bad* articles is can be far worse than any vandal destroying pages. -- Ram-Man

I agree that the problem of bots is not necessarily that they are mechanical. I see the core issue as that of imported material. (There is the secondary but real concern of bots that run amok--but that's not an issue with the entries themselves.)

It seems to me that the proposed policy above does a reasonable job of answering (nearly) all concerns without placing a burden on people who want to add these kinds of entries. --The Cunctator

Thoughts on the proposal: I don't like the idea of flagging entries as bot entries or import entries. Data is data and it should be treated as such as I mentioned above. There should be no choice to import or not. Pages need to be static and able to be easily changed at will. On the other hand, having some way of marking entries would be useful so that large edits can be reverted in the event of a large-scale mistake because they are already pre-filtered. One thing I do think is mandatory is some way to clean up the Recent Changes. This means having some way to register a bot (just a list maybe?) and some way to either filter out the bot entries or have a separate recent changes for bots. Let's also remember the spirit of Wikipedia. We are not trying to make hard and fast rules. Easy guidelines should be good. That is why I am in favor of a system that is based on good faith. I will post a more complete proposal that I would approve of soon. -- Ram-Man

1. I think there may be some confusion of meaning here--I too think that all entries should stand on their own merits. The proposal #1 wouldn't distinguish between bot entries and import entries.
It assumes that bot entries are a subset of imported-data entries.
Do you believe this is an invalid assumption?
2. I moved the proposals off the main page because there's a fundamental point of contention. I also tried to merge your points with the first proposal respectfully. There only seemed to be the one major point of contention. Of course, you should certainly revert the merging if you believe it was done improperly. --The Cunctator

A stylistic note: One of my objections to Ram-Man's town-and-county bot above is that it took what was obviously originally tabular data in its data source (for instance, the population figures) and converted it into boilerplate paragraphs with the numbers filled in. (E.g., "As of 2000, there are n people, m households, and x families....") In my view, this lessens the data's usefulness, by making it more difficult to read and compare.

As far as I can tell, the argument for using sentences rather than tables is that it looks more like ordinary writing to a first glance. However, it is not ordinary writing. It is just tabular data that has been stripped of a more understandable layout and variable-interpolated into boilerplate sentences.

Usually, when a reference work is presenting the same pieces of numeric information for a large number of entries, it will use tables rather than boilerplate sentences. Tables are more quickly read, and more rapidly recognizable as relating to one another and containing corresponding data from page to page. A Wikipedia example of this practice would be the entries for the chemical elements. Many of these were originally written with all their numerical data in paragraphs. They are now being rewritten with tables, and it is is a great improvement.

I'd like to suggest that this be considered a precedent (indeed, a strongly encouraged standard) for any bot creation of large numbers of pages containing the same fields. --FOo

Ordinary writing is easier to read. Ortolan88
No its not, tables are. That why they were invented. Want an example: Look at Portland, Maine. Robert Lee
At least it's a fixed-width-font table and not one of those uneditable HTML horror shows. But it doesn't have as much information as the ordinary writing below and it requires "programmed looking". There's a reason ordinary writing got ordinary. Ortolan88
The opening paragraph is far more interesting to read than the table. Wikipedia is not a database -- Tarquin 21:27 Nov 5, 2002 (UTC)
Actually it is. And so are paper encyclopedias. Dictionaries too. And showing tabulated data in tabulated form is not only the defacto standard way of showing tabular data, its also the method used by every other outfit out there...Encarta, Britanica, etc... And using tables where appropriate is not what makes wikipedia a database. And lastly, while a paragraph might be more interesting for you to casually read, a user searching google for the number of households in portland maine will find it easier to spot the data in a table then they would having to read a text consisting of 5 or 6 paragraphs which shows the same information padded with lots of fluff. JMO. Robert Lee

For what it's worth, I like tables a lot and tend to overuse them. However, I didn't use tables with the bot because I received feedback from some people that tables were not a very accessible format. One reason that I would raise is that the amount of data is quite large and there would be too many entries in the table for the table to be useless. In the articles, the data is grouped into different paragraphs representing similar data for easier scanning. Many people don't like tables, especially large ones. I for one do not like large tables, and so some of this data would have to either be left out (which I think is not an option) or put in prose form. If you are going to put the minor stuff into prose, you should also put the important stuff. -- Ram-Man

Tables are as accessible as they need to be. They are precisely appropriate when thousands (or even tens) of pages need to refer to the same fields -- particularly numerical fields -- and those fields need to be obviously comparable with one another. Comparing figures in a column, or corresponding entries in similar tables on two pages, is easier and less error-prone than visually extracting the correct figure from the middle of a paragraph.
Layout niceties (including color and subdivisions) can make them more approachable. Again, the example of the chemical elements pages: a table is used for the numerical and formal information, while paragraphs are used for the unique information such as the practical uses of the respective elements.
Bot paragraphs are not bad per se. It would be valuable, IMHO, to have a bot-written introductory line or two in each town and county page. An example might be: "Great Barrington is a small incorporated town in Berkshire County, Massachusetts, USA." This is a simple, readable, true sentence. It can also be automatically generated from data (given a predetermined judgment call as to what size of town counts as "small"); it does not overwhelm the reader with figures; and it cannot be more presentably represented as a table.
In contrast, the sentence "In 1994, Great Barrington had 3898 registered voters, with 1204 (30.8%) being Democrats, 587 (15.1%) being Republicans, 4 (0.1%) being Other parties, and 2103 (54.0%) being Unenrolled Voters." [1] is also a true sentence, but much less simple and readable. Moreover, since it is natural for people to want to compare these figures one with another and against those from other towns, this information would be more usefully conveyed in a table layout, as below:
Registered voters in 1994
Party affiliation Voters Pct.
Democrats 1204 30.8%
Republicans 587 15.1%
Other parties 4 0.1%
Unenrolled Voters 2103 54.0%
Total 3898
The point has been made that tables like this might dominate the pages for most towns, since there are no human-written and informative paragraphs for most towns. So too do tables dominate the pages for less fully described chemical elements. (They would not, presumably, dominate the page for New York City any more than they do for Iron.) This is not a bad thing. If most towns are really so obscure that no sentient being wants to write about them, perhaps they don't need their own Wikipedia pages. I don't imagine any of the major encyclopedias have entries for Great Barrington. :) --FOo

Does the rambot skip existing entries? There is no statistical information whatsoever in Cambridge, Massachusetts, arguably a more important place than Chitlinswitch, Tennessee. (The Cambridge article is also kind of lousy, but that's a different issue.) Ortolan88 17:26 Nov 7, 2002 (UTC)

I have the same problem with data on municipalities of the Netherlands; one of the nice things of using a bot could be that you get a blanket coverage. Apparently entries are skipped when an article already exists, but I think it would be better to add the data at the end, perhaps with a remark that it has been added automatically, in case somebody would wonder why it is not nicely merged with the rest. The merging can then be done manually by anyone reading the article. Patrick 20:57 Nov 7, 2002 (UTC)
The rambot skipped those entries. I have around 300 or so articles that were skipped that have not had the data merged in. I have it in a text file on my computer and I am merging the data into the articles as I have time to work on it. A number of cities have already been updated. I could just append the data to the article and say 'please merge this', but I'd rather see the articles finished. There are too many areas of Wikipedia that look incomplete because of things like that. -- Ram-Man

These bot articles are not articles at all. They are little or nothing other than mere lists of numbers. Therefore, they should not be included in the main article count. But if you want numbers, and you want utility, then why not ZIP codes? --User:Juuitchan

This question has been addressed many times already I think. Things like ZIP codes and other information are not that hard to add later. The fact of the matter is that the information added was the information that was currently available. If you want zip codes, then add them. Otherwise wait until I do it. BTW the U.S. Postal Service definitions of places (zip codes, etc.) do not match with the census bureau's definitions or legal definitions. So such information is not easily made accurate. That is why I have not added it yet. The Wikipedia mailing list had a discussion on what should be included in an article count. If you don't like the "mere list of numbers" and those who disagree with you, how about updating some of the articles to include other information? -- RM
Just linked Bitterroot to Missoula, Montana, then added to the Missoula article the fact that it is the only place that bitterroot (the state flower) grows. Took about as long as it would take to repeat an oft-made complaint against the rambot, and much more interesting, fun, encyclopedic, and productive. These articles are the foundation for the encyclopedia of the future. Use them. Improve them. Ortolan88
I don't particularly like the fact that every other random article is a place in the US, but I recognise that this is a temporary thing. And the articles are generally easy to improve; a good way to do it is look at "what links here" -- information can be acquired from those other pages. --Sam

Does anybody have an explanation for:

They all have a population of zero. If nobody lives there, is there any reason to keep these articles? They seem kinda silly, all this talk about what is apparently a piece of wilderness arbitrarily designated a "community" (I'm referring to Oklahoma here). Thoughts? Tokerboy 07:44 Nov 1, 2002 (UTC)

They're harmless and not worth the effort for removal. One in fact was a census mistake; Belleair Shore, Florida was listed by the US census as having a population of 0 which was a shock to the local residents (51 households). A couple more are probably also census mistakes; Supai, Arizona went from 423 to 0 in 10 years and Sportsmen Acres Community, Oklahoma went from 181 to 0. If these are mistakes then we should find out and if they are true then it would be interesting to find out why these places lost their populations.
Also, whether or not somebody lives at a place shouldn't be a reason for deciding whether we should keep or delete an entry. In fact many central city/downtown areas in the US don't have anybody officially living in them and yet these areas are very important. I'm sure many of these places have interesting history associated with them. Most look like they actually have people and even industries working in them (just like the US downtowns). --mav
http://www.fryeisland.com/
Funny, for a town with no population, you wouldn't expect them to have a domain name, 2 ferries, a 9-hole golf course and I bet those town meetings are a REAL snore. Actually, Frye Island is not only a real town, but it is a world famous vacation hotspot drawing in tourists by the hundreds each year from all over the world. In my opinion, not only should Frye Island have an article, that article should be a whole lot better then the one which exists now. Maybe instead of complaining about it you should do some research and give these towns the credit they deserve! Just my $0.02. Robert Lee 09:48 Nov 1, 2002 (UTC)
I agree - these entries need to be improved not removed. --mav
Just for the information, all these entries had the population data (and other data) marked by the census bureau as "Not Applicable" for whatever reason. I don't know if the census bureau changed the way it counts population or if it is an error. The census bureau publishes a large "errata" so it might be in there (I have not looked). If they are wrong, alas, people who know about the places are going to have to update them. -- Ram-Man
Why is the information (which seems to be from tables) not in tabular form? --User:Juuitchan

There was a leap from 40000 articles to 90000 articles. This is a little bit strange. Perhaps I missed something.

User:Ram-Man created a bot that generated 30,000 US place articles in the same format and wording as Union, Mississippi. --mav
This means that now virtually every other article is about a place in the US. Would it be possible have a moratorium on these entires for a while less Wikipedia becomes little more than a US gazeteer?
Additional content is a good thing. Why impose a moratorium? It would make much more sense for concerned individuals to focus their efforts on creating additional content in neglected areas. -- NetEsq 16:23 Nov 24, 2002 (UTC)
That depends on the granularity and usefulness of the additional content. If we soon have say... 1,000,000 articles and 920,000 of them were about every one horse town and truck stop in Russia how much use would it be?
Indeed, I've dressed up a dozen or more of these into more complete articles and the Ram-Man says that more than 500 have been so extended. I, for one, would like to know why a town in Mississippi ended up being named Union. Was it named that before the American Civil War or after? Was it named after the federal union or labor unions? How people building an encyclopedia can complain about factual articles on subjects that people might actually look up is beyond me. See my user page for my list, Ortolan88 16:39 Nov 24, 2002 (UTC)
Additional content is not a priori a good thing. I'm not sure why you asserted such an extreme position so baldly. See Wikipedia talk:Bots for how I believe imported data dumps should be handled. --The Cunctator
You, of all people, should understand the uses of stating an extreme position baldly. The Ram-Man has provided a needed foundation for the expansion of the Wikipedia. I like the idea that the Wikipedia has an article on my home town Valdosta, Georgia and that I can dress that article up with some interesting information. Likewise, Newton, Massachusetts, where I now live, and Macon, Georgia and West Memphis, Arkansas, where some musicians I admire came from, and a bunch of others of varying degrees of interest, and not a word, Mr. Quite! Anonymous (below), about hamburger joints. Ortolan88 17:53 Nov 24, 2002 (UTC)
Quite! I thought this was an International project. Knowing how many burger bars are in some mid-western town is of little use to people in the rest of the world.
Bah! Listings for any city, town, or village I might wish to find some information on is part of what I'd want in an ideal encyclopedia, whether that town is in the USA, UK, Guatemala, Cambodia or anywhere else. It seems to me that the problem is not that there are too many enteries concerning the USA, but that there are thus far too few for other places. Let's get to work on that! -- Infrogmation
Indeed listings for everywhere in the world would be ideal. I like the fact that Wikipedia has an article on my home town too. I wrote it. It's not in the US. It doesn't mention what percentage of the population are green and what percentage are blue. It does tell you where you can get a decent meal because the town gets a lot of tourists, so the article is pretty useful and it's not just a cut-and paste job from a census article located somewhere else on the web.. Ok so some 500+ Ram-bot articles have been edited by real users, that means some 29,000+ haven't. My point is that if Wikipedia becomes literally swamped with these articles derived from the US census it turns this site into a mere US gazeteer and mirror of the census site. I'm suggesting that maybe the balance of articles is getting a little out of hand. If I hit Radom page I'm getting getting some small US towns most the time. I'm just suggesting that we paused the progress of the bot for a while so that the balance can be brought back into line. Now shoot me down in flames. It was just an innocent suggestion. This is my last word. Ohh and I'm less anonymous than some random handle and a hotmail address. Bye bye.
My real name and my real e-mail address (which is at the oldest ISP on the Internet and which I have had since 1990) are given on my user page, which is a lot more than your nothing. Ortolan88
80.46.160.59 is much less anonymous than mickey.mouse@disney.com
Maybe that's an issue with the "Random" function. When I get a town entry at random, I edit it. I search the web for related sites and other information. And I search the "What links here" and I do a site-wide search for the name of the city so that all articles mentioning it are linked to it. I am grad that Random page links to these articles. And I'm glad they are included. If we don't create the stub articles, nobody would think to improve them. Stubs ARE good. They are not perfect however. They have problems. One of the problems is that links to stub articles are blue (they should be violet in my opinion with more red tone the smaller the article is or the less times it has been edited). This is a software issue however. The articles are fine. PS: Maybe I can import a complete discography of every artist that ever lived complete with lyrics, won't that be fun??? There have got to be at least a half a billion songs out there! :o). Robert Lee
Firstly, sorry I know I said that was my last word, but the discussion got expunged before it had come to a natural close I think, so I'm back. Well I know my view isn't shared by you guys. But I do want to make the point that I think the bot importing all of this information in bulk is unbalancing this project. As indeed Robert Lee's tongue in cheek suggesting about importing a discography makes the same point or my point earlier a gazeteer of Russia.
I think the only reason it feels unbalanced is because of the "Random Page" feature. If we didn't have that feature, who would know it felt unbalanced? -- Ram-Man

This is probably the 5th or 6th time (at least) that people have started a discussion on the merits of the city articles. It gets old! I think everyone agrees that having ones own hometown is cool. No other encyclopedia has that (well and I guess neither do we... yet). I've added to many city articles that I hit on "Random Page". In fact somewhere under Wikipedia:Utilities I think it says something about the random page feature being used to fix stubs, which is what these (and most other articles) are. I think the answer to this entire discussion is that if you don't like what you see now, stop complaining and start adding more material! -- Ram-Man

5th or 6th time eh? hm... I wonder why that is. Well unfortunately I don't have a bot able to wikify up a detailed entry on every issue of the Superman and Spider-Man comics ever produced, so I won't be able to catch you up.
See also the suggestion that the city articles might well be eventually part of a Wikiteer.

Try as I might, I cannot see the merits of an argument which states that Wikipedia has too many geographical stubs in relation to other articles. The fact that different people continue to bring up the same argument does not make it any more meritorious, nor does the fact that these geographical stubs are for towns in the United States. It's very clear that people see value to these stubs and are working hard to supplement them; it's also very clear that the people complaining about these stubs could put their time to better use by supplementing the areas that they feel are being neglected.-- NetEsq 01:15 Nov 26, 2002 (UTC)