User talk:NicDumZ/Archive 2

From Wikipedia, the free encyclopedia

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

1 linkchecker
2 Bot droppings
3 Non English sites
4 Report page
5 Bare references change breaks comments
6 Raúl Fernández
7 the bot added some garbage
8 Request
9 The da Vinci Barnstar
10 Chile
11 Question
12 Recognition for a job well done a job well done another job well done
13 Minor Edits
14 Grumble Link to Seroquel website
15 Sinking_of_Prince_of_Wales_and_Repulse
16 Bad Title?
17 DumZiBoT's edit to Template:Infobox Planet
18 Good title
19 Thanks for DumZbot!
20 Japanese characters
21 DumZiBoT
22 Request running of bot?
23 Great Page!
24 conversion from bare ref, causes the reference to be listed multiple times
25 Bug report
26 Thanks
27 bot spelling error
28 Excessive link text
29 Character encoding issues
30 yet another feature request - cite web templating
31 Page not found used as link
32 Older screw-up
33 Dumzibot: heaven or hell?
34 PDF being called Microsoft Word
35 DumZbot reference conversion style "problem"
36 Half completed
37 Mastek
38 Suggestion
39 Laundry list
- 39.1 Done
- 39.2 Won't Do
- 39.3 ??
- 39.4 To Do
40 DumZiBoT on long pages
41 Sweet work
42 Avoid long "titles"?
43 Nature:access
44 Poke :)
- 44.1 Reproducing bug#2
- 44.2 Svn version
45 Bot producing mojibake
46 Consolidate duplicate refs
47 Bot blocked
48 ANI concern
49 Peque
50 A little misspelling in replace.py
51 Quote handling
52 Reflink updates
53 Interwiki position of sk
54 Bug?
55 Bot - eswiki
56 Cool bot!
57 Russian letters in references
58 Outhouse article
59 Inadvertent advertising

linkchecker

Hi !

A little bug in your tool :

In Humanitarian response to the 2004 Indian Ocean earthquake, there is the source text

[[Sweden]] || [[Swedish krona|SEK]] 500M (USD 72.2M)<ref>http://www.regeringen.se/sb/d/4823/a/36245</ref> || || [[Swedish krona|SEK]] 1100M (USD 159M)<ref>[http://www.frii.se/index3.shtml Frivilligorganisationernas Insamlingsråd - Aktuellt]</ref> || || 177.2 || 0.5,

and it is converted into

[[Sweden]] || [[Swedish krona|SEK]] 500M (USD 72.2M)<ref>{{cite web |url=http://www.regeringen.se/sb/d/4823/a/36245 |title=]</ref> || || [[Swedish krona|SEK |archiveurl=http://web.archive.org/web/20050101025709/http://www.regeringen.se/sb/d/4823/a/36245 |archivedate=2005-01-01}}] 1100M (USD 159M)<ref>[http://www.frii.se/index3.shtml Frivilligorganisationernas Insamlingsråd - Aktuellt]</ref> || || 177.2 || 0.5

NicDumZ ~ 19:47, 15 February 2008 (UTC)

Its been fixed, it was a problem with me not wanting capturing the space between the URL and title on normal links. — Dispenser 06:21, 16 February 2008 (UTC)

On another note, this is a bug too. The Evil Spartan (talk) 07:13, 17 February 2008 (UTC)

Sure it's not a feature? :) I mean, it looks like a more useful link name than was before ...

... or are you confusing DumZiBoT and linkchecker? (The diff is from DumZiBoT, but this section is about a linkchecker bug, so I almost have to ask.) — the Sidhekin (talk) 07:31, 17 February 2008 (UTC)

Bot droppings

Since other human editors and all other bots (that I've seen) do not label their invidual contributions, I think it would be a good idea if the  comment was not inserted in every single place a reference was converted in the article. The edit summary is the place to disclose the bot has been at work, and anything that increases the amount of dead text we have to wade through while editing is not moving in the right direction. I recognize there may be utility in tagging these changes so that a human being can make sure the change is sensible; from other comments I've read here, that may be a big concern.(Some templates for citations are now amazingly long and complex when you see them in the editor, and it's easy to cut off a tag and the end and scramble half an article. That's the same sort of problem.) --Wtshymanski (talk) 15:47, 19 February 2008 (UTC)

This is fairly standard practice for bots. It a signature but rather a comment informing what the content is. This is quite often done with other things, like with the introduction of the cite.php system, WP:DOC adds comments about where to put the interwikis, and we have other bots which put images that have been deleted into comments. If you wish to change standard practice I'd recommend that you bring it up at the Bot owners' noticeboard. — Dispenser 20:34, 21 February 2008 (UTC)

Non English sites

The first edit your bot made in this diff [1] is total garbage. I think this is because the bot is reading something in chinese, or similar. Although the reference is spam anyway (thanks for bringing it to my attention) and I am going to delete it, it does show up a flaw in your bot if the site is not in english. SpinningSpark 20:12, 21 February 2008 (UTC)

The bot correctly handles texts in other scripts, just when they've actually been specified. Reading through the documentation of UnicodeDamnit, it should if the page doesn't specify the encoding it should use statistical method for determination. This doesn't always work and results in the garbage that you see. — Dispenser 20:34, 21 February 2008 (UTC)

If the bot is not guaranteed to get it right then it should have human review. I thought it was a principle of bots that the operator was responsible for the actions of a bot, not the community at large to correct its mistakes. SpinningSpark 08:48, 22 February 2008 (UTC)

You are right: During approval process, bot owners are required to prove that their bots get it right most of the time. Antivandalism bots sometimes get it wrong, and so does DumZiBoT !

What happened is simple, and pretty rare : the erroneous link does not give any encoding in its code, hence you can't know what set of characters should be printed. The fact is that no automated tool can be 100% sure that the character set it prints is meaningful (My firefox is not even able to show properly the title/the page !) : Only a human can, and that's why every tool is required, per international standards, to tell users what encodings they are using. Without that information, automated tools behaviors are unpredictable. DumZiBoT actually is a bit more clever than *usual tools*, because a lot of pages do not give any encodings : When no encoding is specified, he tries to guess which encoding it is. And I have to say, that most of the time, it works (I can give you examples ;) ! ). Fact is, when a non-occidental charset is used, DumZiBoT is sometimes mistaken, true. I'm sorry, but keep in mind that these borderline cases are very rare..! NicDumZ ~ 13:44, 22 February 2008 (UTC)

This is what your bot wrote: ªF½åÅ]³N¤è¶ô¸ê°Tºô. I don't see how that could be mistaken for any western language even with the most simple of tests. SpinningSpark 15:56, 22 February 2008 (UTC)

It is not only about western languages : Some linked pages are in Chinese, Vietnamese, or Japanese, and I don't want to ignore them.

As you can see it, all the characters are *valid* : They could be found in, for example, a French title. Only the combination of the characters is some garbage, and the analysis of the semantic of natural languages is very hard, especially when you don't know what was the original language...

But if you have in mind some simple tests, go ahead, the source is online : User:DumZiBoT/reflinks.py :)

NicDumZ ~ 16:02, 22 February 2008 (UTC)

I'm not going to mess with the code. Even if I had the skill, it is not my resposibility. I can think of any number of simple tests which would have rejected that garbage. How about six or more characters in a row that are not simple alphanumeric without diacratics? I have not seen a word in any language which uses diacratics on six succesive letters, not even polish. Anyhow, this particular example would have failed the test even if diacratics were allowed. SpinningSpark 16:36, 22 February 2008 (UTC)

(unident) You are probably right : With some tweaks, this test would work when applied to occidental alphabet. However, what about "東賢魔術方塊資訊網" ? This title is meaningful (it is the title of the same link, when using the Chinese Big5 encoding) but is not made of alphanumeric characters : Your test would reject it, but actually, it's a valid title ! Also, in an other hand, there are many ways to produce garbage using Chinese or Japanese characters, and such a title could easily result from an erroneous unicode conversion by DumZiBoT, so a test "six or more characters in a row that are not alphanumeric and that are not Japanese, Chinese, or Vietnamese characters" would also raise false positives (i.e. would detect a bad title when it's valid) or false negatives (i.e. detecting a valid title when it's not). Trust me, it's not this easy. NicDumZ ~ 17:07, 22 February 2008 (UTC)

I cannot understand your reasoning here. If the bot had been capable of recognising the chinese encoding there would have been no problem (putting aside the issue that my browser will not display it). The fact is, it did not recognise the encoding. In those circumstances, doing nothing is safer than what it did do, presumably assume a western encoding. SpinningSpark 17:30, 22 February 2008 (UTC)

I cannot understand your reasoning here. If the bot had been capable of recognising the chinese encoding there would have been no problem You're wrong here. What I'm trying to say is : There is no way, at the end of a decoding process, to know if the text is meaningful or not. In an other way, there are no differences for DumZiBoT, if it decodes a byte sequence into "ªF½åÅ]³N¤è¶ô¸ê°Tºô" or into "東賢魔術方塊資訊網" ! Your statement, If the bot had been capable of recognising the chinese encoding there would have been no problem assumes that there is a way to know if the tried encoding is correct or not, but such a test does not exist ! Sure, there is a way to know that some encodings don't fit to decode some byte sequences. For example, UTF-8 is not working here : The result is "�F��]�N��T��", where the question marks state that the byte sequence has no equivalent in UTF-8. But there are a lot of charset that actually decode this byte sequence, and there is no way to know which one is meaningful. The first encoding is windows-1252, the second is Big5. DumZiBoT could have tried ISO 8859-5, ("ЊFНхХ]ГNЄшЖєИъАTКє") TCVN ("êFẵồỀ]́NÔốảụáờ̀TẲụ"), ISO 8859-10 ("ŠF―åÅ]ģNĪčķôļę°Tšô"), IBM 864 ("ﺕFﺵﻣﻊ]٣N¤ﻭ٦ﻬ٨ﻳ٠Tﻑﻬ"), HZ ("狥藉臸砃よ遏戈癟呼"), and so on, and so on...

My implementation choice was to fall back to windows-1252 when no encodings are found because it is a flexible representation of occidental alphabets : Since most of the links are occidental (We're on an English-speaking encyclopedia !!), using a default occidental charset is meaningful.. However, before using it :

I try fetching an encoding from the HTTP header
I try fetching an encoding from the HTML source
I try, for domain names that ought to use an exotic charset ( .ru, .zh, .jp, etc...), to use their national charsets

Now, for the very, very small amount of links that A) are not compatible with windows-1252 B) do not follow international standards, and C) Do not use their national charsets, I don't care. I'm not going to change the behavior of my bot for less than 0,1 % (0,01% ?) of the links. That make only 60 (6 ?) wrong edits over 60K+ : Again, I'm sorry for the inconvenience, but it's definitely a won't fix.

NicDumZ ~ 14:35, 23 February 2008 (UTC)

Keep in mind that bots are tolerated on Wikipedia, not pandered to. It's very important that bots do not become a nuisance. Converting bare references is not a task that is so important that it's worth frustrating users for. I've read all you've written in this thread and I have to say that I can't tell you how to improve this bot, but please try to be friendlier with people who point out problems rather than telling them you don't care and that you will continue without making any further effort to fix the problem. When trying to think of a solution, keep in mind that it would be better for the bot to give up on 33% of references than for it to aim for 100% and end up sometimes adding garbage (or what looks like vandalism in my latin-tag case below). --Gronky (talk) 20:50, 4 March 2008 (UTC)

Number 3 is a really smart way to deal with things. In the case at hand, the url is a .com.tw adress. Is .tw currently in the list for "exotic" encodings? It would have probably resulted in the correct result then. Martijn Hoekstra (talk) 21:09, 4 March 2008 (UTC)

Report page

It might be a nice idea to have a special page where users can report mistakes the bot has made, something along the lines of what user:clueBot does. (see cluebots edit history) Martijn Hoekstra (talk) 21:19, 5 March 2008 (UTC)

Bare references change breaks comments

FYI: This change demonstrates that fixing a bare reference that is commented out will produce undesired results. Cburnett (talk) 06:10, 9 March 2008 (UTC)

Raúl Fernández

Your bot keeps adding the a link to fr:Raúl Fernández, who is a completely different person than English Wikipedia's Raúl Fernández. Is there any way to stop this? Cheers, CP 07:46, 18 February 2008 (UTC)

This is just one of several bots doing this. The root of the problem is that the French and the Italian wikipedia both claim these are the same. Presumably one editor (mistakenly) did this, and the bots are now copying it. I've now removed the inter-wiki links from and to the French article, so the bots should have have nothing left to copy. Unless they have a cache or something ... :) No, I think this should stop it. :) — the Sidhekin (talk) 07:56, 18 February 2008 (UTC)

Well, thanks for monitoring my talkpage, and fixing these little problems. I've been away these day, and it's a pleasure to have all the problems fixed when back :) NicDumZ ~ 08:16, 18 February 2008 (UTC)

Great! Thanks a lot! Cheers, CP 17:47, 18 February 2008 (UTC)

Hi, not sure how to get in touch with you, but you made some changes to a page I look after and changed some links and misdiscribed them. I know no one 'owns' the wikipedia, but please only change things you know about. The entry was for British Baseball Federation. I've changed them back, but it's time I'd prefer not having to waste doing this. Many thanks. John (PS. I'm not overly familiar with this site, but couldn't see any other way to post you a note, so attached it to this one. Maybe the site should be more user-friendly!!, but thats for the wikipedia management I guess). —Preceding unsigned comment added by John Walmsley (talk • contribs) 12:48, 23 March 2008 (UTC)

the bot added some garbage

In this edit: [2] the bot wrongly added a tag to a reference to wrongly say the reference was in Latin. Can you look into fixing this? Thanks. --Gronky (talk) 20:36, 4 March 2008 (UTC)

What really needs fixing is that web server: "Content-Language: lang" indeed! :-P I don't blame the bot for thinking this is Latin!

Still, I guess it would be better if the bot were to ignore it as an invalid language code. Can't expect all the world's misconfigured web servers to shape up, now can we? :) — the Sidhekin (talk) 21:12, 4 March 2008 (UTC)

Yup, Sidhekin got it :)

DumZiBoT only takes the first two letters of the language code, that's why it ended using latin.

I'm trying to know if your suggestion would work, because there are strange values sometimes. For English, you can find en, en-US, en-UK, en_UK, en US, english, or... whatever, because actually I can't think of any tools actually using the Content-language tag.

NicDumZ ~ 14:15, 5 March 2008 (UTC)

Good point, but not necessarily decisive.

I don't think underscores and spaces are permitted, but at least underscores are so common that it may be well worth considering merely the first sequence of alphanumerics, whatever the separator. (The first sequence should always be the language code, except for the grandfathered forms, of which there's a limited number, few of which are very interesting.) What's trickier is english, but on the other hand, on this wikipedia at least, we could well ignore it as invalid as we don't tag English language sources anyways. :) (Alternatively, you could make it an exception, I suppose.)

I think the language subtag registry gives (among other things) all current language codes as well as all grandfathered forms. "Content-Language: i-klingon", anyone? :) — the Sidhekin (talk) 15:23, 5 March 2008 (UTC)

Whether the webserver is misconfigured or not isn't the question. I'm highlighting another fail case for this bot. Hopefully some constructive thinking can be done with the information in these failure reports. --Gronky (talk) 15:47, 5 March 2008 (UTC)

DumZiBoT failed to consider the proper titles for these two bare edits made on 6 February 2008. Neither of them has any language hints for the user agent, but one of them has a .co.jp domain. Not that it matters much, as the content of the <title> element isn't consistent with the actual title and still needs manual correction. --Kakurady (talk) 23:52, 22 March 2008 (UTC)

Request

Hey Nic, would you be able to run your DumZiBoT through the Öser page? Thanks. Khoikhoi 03:55, 23 March 2008 (UTC)

Dispenser just did it ;) NicDumZ ~ 11:10, 24 March 2008 (UTC)

The da Vinci Barnstar

		The da Vinci Barnstar
		You are awarded this barnstar for enhancing Wikipedia by programming DumZiBoT, a reliable robot that has both greatly improved Reference lists and increased the productivity of Wikipedians. EconomistBR (talk) 21:32, 24 March 2008 (UTC)

Wooha !

Thanks, I appreciate it ;)

NicDumZ ~ 21:53, 24 March 2008 (UTC)

Chile

Please go through the Chile article again. I was forced to revert your edits as they were made on top of a vandalized version. Thank you. ☆ CieloEstrellado 02:04, 25 March 2008 (UTC)

Your edit got reverted.

Anyway, please consider using http://tools.wikimedia.de/~dispenser/view/Pywikipedia, it's great !

NicDumZ ~ 15:52, 25 March 2008 (UTC)

Question

Hello. I have a question if DumZiBoT can make one fairly simple task, it is quite similar to one he is already fixing so nicely. For details look here. Thank you very much. - Darwinek (talk) 15:39, 25 March 2008 (UTC)

Yes. Adding the refs tag is fairly easy : It's a common script from pywikipedia.

I have not even considered running it, for it is really easy to handle, and I thought that some other bot would already be doing it.

I'm afraid that running my bot on this would require another BRFA, tho.

NicDumZ ~ 15:45, 25 March 2008 (UTC)

Maybe you could create DumZiBoT2 or something like that, it would require another BRFA, too but it is not a matter of time. The work just should be done sooner or later. What do you think? - Darwinek (talk) 15:54, 25 March 2008 (UTC)

It might be better implemented into AWB (see WP:AWB/FR) as it does many other similar types of edits. Revision which implement <references/> appending. — Dispenser 23:07, 25 March 2008 (UTC)

Is there any option how to request this feature to be added to next version of AWB? - Darwinek (talk) 23:57, 25 March 2008 (UTC)

Recognition for a job well done a job well done another job well done

		The Working Man's Barnstar
		Every time I check my Watchlist, you and DumZiBoT have improved another article. Your continued work deserves recognition! TheRedPenOfDoom (talk) 03:20, 26 March 2008 (UTC)

Thanks! Great idea for a bot. — Omegatron 04:00, 26 March 2008 (UTC)

Great Bot. Yaki-gaijin (talk) 06:00, 26 March 2008 (UTC)

Thank you... ! :)

Really !

NicDumZ ~

Minor Edits

Is there a way you could make the bot not mark its edits as minor when it modifies a certain number of bytes? It's a relatively minor thing but it came to my attention with this edit where DumZibot added over 2000 bytes but marked the edit as minor. Great bot, by the way! The Dominator (talk) 18:28, 25 March 2008 (UTC)

I could do that easily, but... what for ? xD

NicDumZ ~ 20:06, 25 March 2008 (UTC)

lol, good point, still it does help a little, for example when you're calculating the average number of minor edits or the percentage of edits that are minor, bots marking major edits as minor sort of alter the statistic. The Dominator (talk) 20:19, 25 March 2008 (UTC)

Ah. Would you get offended if I answered you "altering the statistic does not really justify the count of how many bytes are modified for each article" ? :D

Besides, the minor/normal edit limit would be very arbitrary, am I right ?

NicDumZ ~ 20:28, 25 March 2008 (UTC)

Grumble Link to Seroquel website

Hi. DumZiBoT recently generated a title for the link to the Seroquel home page on the quetiapine Wikipedia page. This was the change: diff. Prior to the change the link looked like this http://www.seroquel.com/. It was easy to tell the link was to a corporate web page. Now it looks like this Home. The revised link is not so clear. Perhaps you might modify the DumZiBoT parsing algorithm so that it generates a list of untitled links of the form “http://www.token.com/”. Some of these links may then be changed to the form “Token home page”, or the link may be left unchanged, or the old substitution algorithm might be used depending on the characteristics of the destination url. I have modified the link to look like this: Seroquel website. Regards. KBlott (talk) 20:39, 25 March 2008 (UTC)

I understand your concern. On de:, actually, they considered that problem as so important that it partly was the reason for asking me to stop DumZiBoT.

But actually, I disagree. Surely, and quite sadly actually, some webmasters don't really get the point of using a descriptive title for their pages. But Domain website as a title is really loosing some information :

There are more pages that have useful page titles than pages with undescriptive titles, and defaulting to a standard title for a minority of site is not the way out
Some domain names are actually pretty much uninformative : I'd better use a vague page title than an even more vague domain name.

I understand your concern, but I already thought a lot about that, and as of now I really think that the way DumZiBoT does is quite efficient, looking at how many websites use descriptive titles, and how many don't.

NicDumZ ~ 10:54, 26 March 2008 (UTC)

Actually, I agree with your proposition that page titles generally contain more information than domain names. On the other hand, it is easy to find exceptions to this rule. The problem is confounded by the fact that “information content” is essentially subjective. Ultimately any string in a grammar is meaningless, except in relation to the semantics that we as organics (or bots) may happen to assign to it. I think this problem is inherent to all sufficiently large parsing/editing tasks. I agree that web page designers often forget to give their web pages useful titles. In any case, the Seroquel link is now titled, so DumZiBoT will probably leave it alone from now on. Regards. KBlott (talk) 15:53, 26 March 2008 (UTC)

Sinking_of_Prince_of_Wales_and_Repulse

Hi, Would you run the bot through Sinking_of_Prince_of_Wales_and_Repulse again please, It picked up that two off site referenced pages had poor page titles. I have now corrected these off site pages and if the bot could be run again it would reflect this on the above page references list. Apart from that, looks very interesting indeed. Nice one ;-) --Andy Wade (talk) 19:00, 26 March 2008 (UTC)

I'm glad my bot helped you improving your page titles.

However, as you can read in the [|User:DumZiBoT/refLinks|FAQ]], DumZiBoT only modifies untitled links, so it won't let me update the titles ;) (Or not with that current script) But you can easily =]

Cheers,

NicDumZ ~ 22:58, 26 March 2008 (UTC)

The pages in question are actually part of a frames site so normally they wouldn't show the page title.

But they were still untidy so I'm glad they're sorted out now. Cheers. --Andy Wade (talk) 23:39, 26 March 2008 (UTC)

Bad Title?

As seen in this dif you might want to add titles consisting solely of "test" to your title blacklist. Spiesr (talk) 20:18, 26 March 2008 (UTC)

Yes, you are right, actually. I added "test" as a blacklisted title, and DumZiBoT now runs with that updated list.

Thanks a lot for the report !

NicDumZ ~ 22:51, 26 March 2008 (UTC)

DumZiBoT's edit to Template:Infobox Planet

Your bot's edit to Template:Infobox Planet, [3], broke every page where that part of the template is used, by adding a references block to the end of the template. It might be an idea to restrict your bot to the main article space... Thanks. Mike Peel (talk) 20:39, 26 March 2008 (UTC)

Woops.

I'm truly sorry for that. I brought the bot to a stop, changed the references block code to add it only if it's in the main article namespace.

It still however add title to the references in every namespace, because I don't see any problem with that (Or please tell me if you can think of any)

Thanks for the kind report.

NicDumZ ~ 22:46, 26 March 2008 (UTC)

Good title

Sorry but nice name for your bot. I am not exactly sure why it labelled a particular report as Microsoft Word. How exactly is the bot meant to work? Shouldn't it read the title that is inside the document? here Simply south (talk) 21:32, 26 March 2008 (UTC)

Well, what is the title that is inside the document?

<dc:title><rdf:Alt><rdf:li xml:lang='x-default'>Microsoft Word - LIP Chapter 3 - Boro Policy Statement _2_.doc</rdf:li></rdf:Alt></dc:title>

... I cannot quite fault the bot here either. :) — the Sidhekin (talk) 21:41, 26 March 2008 (UTC)

For one thing, it is not a .doc but a .pdf (and where did that code come from, maybe i am in way over my head somehow) here Simply south (talk) 21:54, 26 March 2008 (UTC)

That code came straight out of the PDF file. Just read it in a text editor instead of a PDF viewer, and you'll find it. At least part of it are human-readable. :) (I'm betting the PDF file has been created from a .doc file, and no one thought to give it a proper title.) — the Sidhekin (talk) 22:11, 26 March 2008 (UTC)

... or, if you have the right tools:

sidhekin@blackbox[23:13:02]~$ pdfinfo lip_chapter_3_-_boro_policy_statement_.pdf 
Title:          Microsoft Word - LIP Chapter 3 - Boro Policy Statement _2_.doc
Author:         
Creator:        PScript5.dll Version 5.2
Producer:       Acrobat Distiller 6.0 (Windows)
CreationDate:   Tue Oct  3 13:23:58 2006
ModDate:        Tue Oct  3 13:23:58 2006
Tagged:         no
Pages:          22
Encrypted:      no
Page size:      595 x 842 pts (A4)
File size:      189739 bytes
Optimized:      yes
PDF version:    1.4
sidhekin@blackbox[23:13:20]~$

(It may even be your PDF viewer displays the title; check the window title bar or document info dialog?) — the Sidhekin (talk) 22:16, 26 March 2008 (UTC)

Thanks a lot, Sidhekin, for your help here, you got it perfectly right ;) NicDumZ ~ 22:55, 26 March 2008 (UTC)

Ah ?! How is "DumZiBoT" a nice name ? Out of inspiration, I just used the last part of my username, which derivates from my real name have I unwillingly made it sound... special ? :) [I'm not a native English speaker and sometimes just miss these little things] NicDumZ ~ 22:55, 26 March 2008 (UTC)

Erm... sorry.. unfortunately "dum" (with a b at the end) means stupid. I was not meaning anything bad. It is just a funny and rather ironic name. It is doing good work so... never mind.

Computer language is complete gibberish but i have found the part you are referring to. Simply south (talk) 23:04, 26 March 2008 (UTC)

Ah xD I knew that but didn't quite made the link. Don't worry, I really don't feel offended by that coincidence ;) NicDumZ ~ 23:09, 26 March 2008 (UTC)

Thanks for DumZbot!

The bot is doing a great job. Having said this, once it gets through Wikipedia, it would be nice if it ran "periodically," and not every night maybe? My watchlist gets flooded with entries which are all improvements, I'll say that. And yes, we should have done it right the first time and this wouldn't be happening.

Would once a month be too infrequent once the whole encylopedia is processed? Again, a brilliant job whoever thought of this. Has been needed for a long time! Student7 (talk) 23:16, 26 March 2008 (UTC)

Thanks for your Thanks ! :)

You may imagine that searching for the pages that contains bad references is a heavy process. (If not, I'm telling you ! :] ) Since only a few pages, compared to 2 millions of articles, need fixing, DumZiBoT is not retrieving all the pages from the servers just to alter the ones that need alteration. I use XML dumps for my work.

And actually, the en: database is only dumped every two months. DumZiBoT is working on the dump of March, 15th, but last available dump, (and last run) is from January, 9th.

Yes, as of now, DumZiBoT is only fixing the bare references inserted during that two month timespan : It takes quite a long time :) (DumZiBoT has been continuously running for nearly 80 hours now and I don't really know when it'll stop !)

Now, I'm not sure that I could improve that behavior. Knowing the dumps issue (I only get to know the list of articles needing modification every two months or so) what would you suggest me to do ?

Thanks,

NicDumZ ~ 23:53, 26 March 2008 (UTC)

I think you are telling me that, in the future, you will essentially run this "every two months." I appreciate that it has run for several days on January's list and will continue many more days until through. I vote for that! I assume that future runs might be shorter. But either way, assuming I'm looking at an "average sample" I won't see more updates than I'm looking at now. And maybe a lot less once it is through. Yes, it needs to keep going. A real plus for references. Editors are encouraged to update the quality of some references now that they have names. It should help a lot. Student7 (talk) 00:49, 27 March 2008 (UTC)

Japanese characters

In this edit the bot generated a weird title for the ref. I'm guessing it is because the page is in Japanese? Just thought I would let you know. -- Ned Scott 06:54, 27 March 2008 (UTC)

Thanks !

It was not because the title was in japanese, it was because no encoding is actually specified in the HTML source : It is not standard complying, and there is no precise way to tell which encoding it is (My Firefox fails at outputing correctly the page)

However, I improved a bit the rules on .jp domain names, and it's better now.

Thanks again ;)

NicDumZ ~ 07:16, 27 March 2008 (UTC)

DumZiBoT

Hallelujah! It's about time someone with the know-how stepped up to the plate. Next step: Create a version that works in real time after each edit. If not in real time, then "on-demand" for a particular article. The other day I corrected dozens of such links in 2008 NCAA Men's Division I Basketball Tournament. If I could have "submitted" that article to your bot for processing it would've saved me a boatload of time. davidwr/(talk)/(contribs)/(e-mail) 20:35, 24 March 2008 (UTC)

Dispenser ported my tool to http://tools.wikimedia.de/~dispenser/view/Pywikipedia !

Enjoy !

NicDumZ ~ 20:40, 24 March 2008 (UTC)

Awesome bot! Thanks for this. Tb (talk) 21:28, 24 March 2008 (UTC)

Thanks from me too. In the past I've done a lot of grunt work wrapping external links in ref tags, good to see someone following along and furthering that. :) Bryan Derksen (talk) 18:22, 27 March 2008 (UTC)

Request running of bot?

Is there any way we can request running this bot on a page? Like maybe putting a tag at the top or something?--Paul McDonald (talk) 12:55, 27 March 2008 (UTC)

See http://tools.wikimedia.de/~dispenser/view/Pywikipedia which is an online version of the bot. — Dispenser 14:13, 27 March 2008 (UTC)

Great Page!

Really nice job on your page. I like it, so keep up the good work! =) --Cher <3 (talk) 00:24, 28 March 2008 (UTC)

User:DumZiBoT ? :) NicDumZ ~ 07:45, 28 March 2008 (UTC)

conversion from bare ref, causes the reference to be listed multiple times

I see you have edited couple of references in article Dada Kondke. These changes have lead to the references section below to be populated with same title multiple times. Is there a way to fix this? Appreciate the conversion process but if the problem persists then its annoying :) --Kedar (talk) 06:53, 28 March 2008 (UTC)

Well, just don't use thrice the same reference, name it :)

See Wikipedia:Footnotes#Naming a ref tag so it can be used more than once

NicDumZ ~ 07:44, 28 March 2008 (UTC)

Bug report

Hi there! I remember pointing out to you once that the bot does not always handle titles in Russian correctly, and I think you've fixed the problem for the most part, but today this edit popped up in my watchlist, and it seems to have the same problem. Don't know if it's something wrong with the site or with the bot, but I think it's worth a closer look. Cheers,—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 14:36, 27 March 2008 (UTC)

Well, yes, again, it's a website that does not give out any encoding. :( Added a rule for .su websites, and it works

Thanks for the report ! :)

NicDumZ ~ 07:50, 28 March 2008 (UTC)

Actually no, it doesn't. The link you provided shows gibberish in Cyrillic letters. Looks like wrong encoding was selected, and it looks like KOI8-R was involved at some point in the decoding process. Sorry for the bad news!—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 13:56, 28 March 2008 (UTC)

Better then. DumZiBoT now use KOI8-R instead of windows-1251 as a default Cyrillic charset. But actually, since both of these encodings are 8 bits, *any* text could be decoded using KOI8-R. What I'm saying here is that I've just switched priorities : If someone here comes and tell me that some title ot decoded as garbage since DumZiBoT used KOI8-R where windows-1251 would have been better, I won't be able to do anything more...

NicDumZ ~ 16:38, 30 March 2008 (UTC)

Thanks

Thanks for converting the source links in relations to the PS2 version of Syphon Filter: Logan's Shadow. I usually have trouble with those. Beem2 (talk) 04:01, 31 March 2008 (UTC)

bot spelling error

Your bot seems to have misspelled "Association" quite a few times. I wonder if you could rerun to fix the error. A specific example is here, and a search shows it's fairly common. Regards, Jpmonroe (talk) 07:14, 31 March 2008 (UTC)

Yeah, basically, the title from the ARIA website is misspelled.

The website is wrong, not my bot, which just copies the title from the website.

NicDumZ ~ 11:16, 31 March 2008 (UTC)

Excessive link text

The bot generally does a good job, but look at this diff. Perhaps it should have some maximum text length? ✤ JonHarder ^talk 13:37, 29 March 2008 (UTC)

Ah, thanks a lot for the report !

Actually, I remember being asked about that problem, but can't remember why I did not fix it :(

Titles longer than 250 characters (arbitrary length) are now skipped.

NicDumZ ~ 16:17, 30 March 2008 (UTC)

If you skip, then please chop like this: "A really long line that is truncated" -> "A really long line that is tr.."

Rather than doing "A really long line that is tr", such that readers will know it's truncated. Electron9 (talk) 21:20, 30 March 2008 (UTC)

Ah, yes. When I meant "skipping", I meant that the references was not being changed. However, you're right. Simply skipping it is not enough. I now process it, appending "..." at the end. Thanks !!!

NicDumZ ~ 15:50, 31 March 2008 (UTC)

Character encoding issues

From pywikipedia's BeautifulSoup.py

try:
    import chardet
#    import chardet.constants
#    chardet.constants._debug = 1
except:
    chardet = None
chardet = None

Look at the last line. Web version updated, sources at http://tools.wikimedia.de/~dispenser/resources/sources/ — Dispenser 22:13, 30 March 2008 (UTC)

Ah, right

Good catch, it might help a lot ! (Sad that the run on the current dump is over, tho !)

I will test how chardet can improve DumZiBoT behavior.

I will commit that fix in pywikipedia trunk, as soon as I'll be able to.

NicDumZ ~ 15:48, 31 March 2008 (UTC)

yet another feature request - cite web templating

I'm sure you must've been asked this often; apologies (and just ignore this) if so... but could (perhaps the next revision of) DumZiBot convert the labelled URLs to web cite templates? This would mean that it could put in the retrieved date and would make it easier for other editors to flesh out the refs with richer information later. I appreciate all the work you and your bot have done! Pseudomonas_(talk) 10:02, 31 March 2008 (UTC)

Yes, I've been asked that often. However, a few minutes ago, my FAQ page was not answering that FAQ, so don't apologize ;)

Here is your answer.

Thanks ! ;)

NicDumZ ~ 15:43, 31 March 2008 (UTC)

Page not found used as link

Hi, I like the bot idea overall. I just saw that it made this edit which I don't think it should list 404 pages as the title for a link. Perhaps if it gets a 404 response it could just skip trying to label that link? Just an idea. Thanks. MECU≈talk 15:53, 31 March 2008 (UTC)

Unfortunately, that's a "200 OK" response. Silly webserver, to give a 200 code with that title.

Though I suppose adding "page you requested could not be" to the exceptions might be an idea? — the Sidhekin (talk) 16:21, 31 March 2008 (UTC)

Yup, and pretty silly webpage which has <div>s *in* the title markup.

However, I improved the part about "page not found" from "page.*not *found" to "page.*not( *be) *found" so that it matches that wrong title. It should be better.

Thanks,

NicDumZ ~ 18:34, 31 March 2008 (UTC)

Older screw-up

Please investigate [this] unnoticed screw-up. `'Míkka>t 03:40, 1 April 2008 (UTC)

No, DumZiBoT did not screw up. The page was already screwed up. I just tried adding a simple <references/> tag, and the same problem happens. In fact, a <ref> tag was not properly closed. I fixed it, and ran again DumZiBoT on the page.

NicDumZ ~ 09:11, 1 April 2008 (UTC)

Dumzibot: heaven or hell?

Bot made another error. Can't you people just do the links your selves?

[4]

Death Valley? izaakb ~talk ~contribs 00:10, 29 March 2008 (UTC)

Well. Try. Open the pdf. Title ?

Being aggressive when submitting an invalid bug report is just... ridiculous.

NicDumZ ~ 04:12, 29 March 2008 (UTC)

How 'bout answering the polite and accurate ones then? Equazcion •✗/C • 05:07, 29 Mar 2008 (UTC)

No. I answered that annoying question at 5 am local time, because it was just too much. But I need to think more about serious questions to issue an adapted answer :) NicDumZ ~ 09:49, 29 March 2008 (UTC)

The correct name of the PDF is Lethal Lou's: Profile of a Rogue Gun Dealer not "Death Valley". What PDF is called "Death Valley?" Not at the link below:

http://www.gunlawsuits.org/xshare/pdf/reports/lethal-lous.pdf Death Valley -- Bot generated title --

rgds izaakb ~talk ~contribs 21:35, 29 March 2008 (UTC)

The correct name of the PDF notwithstanding, the actual title of the PDF is "DEATH VALLEY". Observe:

sidhekin@blackbox[22:43:15]~$ pdfinfo lethal-lous.pdf 
Title:          DEATH VALLEY
Author:         vice
Creator:        Acrobat PDFMaker 7.0.5 for Word
Producer:       Acrobat Distiller 7.0.5 (Windows)
CreationDate:   Tue Sep  5 13:44:38 2006
ModDate:        Tue Sep  5 14:02:06 2006
Tagged:         yes
Pages:          25
Encrypted:      no
Page size:      612 x 792 pts (letter)
File size:      3643054 bytes
Optimized:      yes
PDF version:    1.6
sidhekin@blackbox[22:43:43]~$ perl -nle 'print if /title/.../title/' lethal-lous.pdf 
         <dc:title>
            <rdf:Alt>
               <rdf:li xml:lang="x-default">DEATH VALLEY</rdf:li>
            </rdf:Alt>
         </dc:title>
sidhekin@blackbox[22:43:46]~$

Computer programs don't generally go around inventing titles. — the Sidhekin (talk) 21:44, 29 March 2008 (UTC)

I been looking at the bot as doing work for editor who have been too lazy to use half proper references. Even if the titles are so great they are very valuable in dead link recovery. — Dispenser 06:59, 30 March 2008 (UTC)

Now I understand how the bot got the name, but I think that is problematic as the writer of the document did not update the file info you posted above with the document name. If I were to go searching elsewhere for a document entitled "Death Valley" it wouldn't do much good, since that's apparently what the writer while in-progress. And I guess there is no way to double-check that? izaakb ~talk ~contribs 01:37, 1 April 2008 (UTC)

One way to increase confidence in a title extracted from a PDF would be to check to see if that string is present in the text stream of the PDF (I realize some PDFs don't have this). I suppose this would work for HTML pages as well, and might prevent some of these issues where the metadata is not, in fact, a good title. —johndburger 02:46, 1 April 2008 (UTC)

No, it wouldn't work. A title is basically a summary of the content, I see no particular reason why the title would be in the page / document body."User talk:NicDumZ - Wikipedia, the free encyclopedia", is not to be seen anywhere in this page, by the way. NicDumZ ~ 09:14, 1 April 2008 (UTC)

What about pages or PDFs which only contain images? — Dispenser 03:27, 2 April 2008 (UTC)

PDF being called Microsoft Word

http://en.wikipedia.org/w/index.php?title=MPEG_transport_stream&diff=201201845&oldid=199427990 The first reference it titled a PDF document microsoft word. Can you do something about that? Daniel.Cardenas (talk) 01:44, 27 March 2008 (UTC)

see Good title just above ;)

NicDumZ ~ 07:03, 27 March 2008 (UTC)

Are you going to do something about it? A PDF is not microsoft word. I suggest you add a rule to do nothing in this case. Daniel.Cardenas (talk) 11:23, 27 March 2008 (UTC)

A fix may be to check if titles have .foo in the end, and if they do, remove it. Martijn Hoekstra (talk) 07:59, 28 March 2008 (UTC)

Corrected for now. The regex is pretty simple, (?i) *microsoft (word|excel|visio|powerpoint), it probably can be improved.

NicDumZ ~ 14:41, 2 April 2008 (UTC)

DumZbot reference conversion style "problem"

Currently 'DumZbot' converts:

<ref>http://www.energybulletin.net/2389.html</ref>

Into:

<ref>[http://www.energybulletin.net/2389.html Europe Worries Over Russian Gas Giant's Influence | EnergyBulletin.net | Peak Oil News Clearinghouse]</ref>

I think the idea is great, but the destination format isn't really that good. I would suggest the following format:

<ref name=wwwenergybulletinnet2389html>{{cite web|title=Europe Worries Over Russian Gas Giant's Influence | EnergyBulletin.net | Peak Oil News Clearinghouse|url=http://www.energybulletin.net/2389.html}} 080327 energybulletin.net</ref>

(dateformat is YYMMDD)

That way one can see 1) which site the information comes from without clicking on the link 2) date when the link was retrieved 3) Any updates to the template can utilise the collected information structurly 4) the reference can be used several times.

Otherwise all is well :-)

Electron9 (talk) 18:49, 27 March 2008 (UTC)

Here is your answer.

Thanks ! :)

NicDumZ ~ 15:44, 31 March 2008 (UTC)

Also - there is a very good reason NOT to include the retrieved date - this needs to be done by a human who actually checks that the source backs up the statement it is a citation for. --Random832 (contribs) 14:55, 2 April 2008 (UTC)

Half completed

Hello, I think your bot was fixing the page Fiona Sit. I see it converted a lot of the references, but it looks half completed? Any reason why the bot left, lunch break? Benjwong (talk) 01:51, 2 April 2008 (UTC)

Yes there seems a bug for three of the links as the script matches <ref> but not <Ref> tags. Some of the references has two link, this wont be fixed. And rest 404 or aren't HTML files. — Dispenser 02:42, 2 April 2008 (UTC)

Ok that would explain it thanks. Benjwong (talk) 02:48, 2 April 2008 (UTC)

Wow, I suck. Actually I will fix that for the next run.

NicDumZ ~ 10:48, 2 April 2008 (UTC)

fixed ! Thanks !

NicDumZ ~ 14:24, 2 April 2008 (UTC)

Mastek

I there are sufficient reference available in Mastek article can you remove the unreferenced tag from the page or shall I do it myself. KuwarOnline (talk) 11:40, 9 April 2008 (UTC)

Suggestion

I think this bot is a great idea and it works perfectly as far as I've seen. I just have one suggestion: I find that the link text this bot generates is actually less descriptive than the URL. The bot just uses the title of the page, which usually doesn't distinguish the link all that well from others in the reflist. For example, if there's an article on Joe Smith, sourced with a few different biographies on that person, the links would all read "Joe Smith bio", "the life of Joe Smith", or something similar. There isn't much to distinguish one from the other, especially as far as which are from reliable sources.

The most important thing about references isn't really the title of the page, but the root site they're located on. I wonder if you'd consider modifying your bot to include the root site address in addition to the page title -- for instance, something like "Title at Site.com" (Joe Smith bio at timemagazine.com). This would allow a casual glance of the reflist to reveal any unreliable sources, any glaring omissions of sources that should be there, etc.

Thanks and please let me know your thoughts. Equazcion •✗/C • 23:00, 28 Mar 2008 (UTC)

The most important thing about references isn't really the title of the page, but the root site they're located on. I don't agree with that at all; I think your bot is doing great work by adding titles. But it would be even better if, for example with this diff, the bot were to include a domain name for the source, so that the added text would look something like this: "University of Illinois at Chicago (UIC) - College of Engineering], www.uic.edu"

If you were interested in this, I'd suggest considering yet one more enhancement - a page of standard sources (nytimes.com, washingtonpost.com, etc.), with matching names (New York Times, Washington Post), which the bot could use rather than posting the domain. By doing this, your bot would be adding two missing elements of citations, not just one. -- John Broughton (♫♫) 20:06, 5 April 2008 (UTC)

I'd like to third this; pagetitle, domain is far more useful than either alone.--Father Goose (talk) 22:52, 5 April 2008 (UTC)

I just can't find what to answer you guys. I understand what you're willing to do. But... To me formatting references in this way or in another (using cite templates for example) is just an editorial choice that I can't do as a bot owner. I try to be as neutral as I can. If I actually add in some way the web address of the link, others will come and ask me not to include it, with arguments as good as yours... :)

My point of view is simple : references should have titles, it's a strong style recommendation. I can simply add titles using the webpage title, so I do it, and no one can reasonably complain about this, because it's a well-accepted recommendation/policy. But if I actually format these titles in a way that is not widely accepted, I'll run into troubles :)

See below, someone thinks I should even remove the simple "bot generated title" comment ! :)

NicDumZ ~ 15:45, 12 April 2008 (UTC)

I don't think anyone would object to simply adding the domain name. When web refs are formatted manually, they always contain some indication of the site they came from. Equazcion •✗/C • 15:06, 13 Apr 2008 (UTC)

Maybe there's just some confusion here. Here's a ref your bot did:

GBU-43/B / "Mother Of All Bombs" / Massive Ordnance Air Blast Bomb

Here's how we'd like it:

GBU-43/B / "Mother Of All Bombs" / Massive Ordnance Air Blast Bomb. www.globalsecurity.org.

This is consistent with {{cite web}}'s formatting; in fact, I used "cite web" to generate the example above, putting the domain name of the website in the work field. (Per {{cite web}}'s documentation: "work: If this item is part of a larger "work", such as a book, periodical or website, write the name of that work.") DumbZBot does not need to use "cite web" itself but can just copy its formatting: [URL pagetitle]. domain name.

I cannot imagine that our suggestion to add the domain name to the reference that DumbZBot generates would be in any way controversial.--Father Goose (talk) 21:17, 13 April 2008 (UTC)

Laundry list

Done

Add check for PDF so Microsoft Word - ... .doc isn't a valid title (include family, PowerPoint, excel, Visio)
Fix language Icon issues (lang -> latin (la), assumed character encoding != language)
- I'm not supporting language icon anymore. I believe it's too much work for too little pages.
Fix issues with <Ref> and <REF>

Won't Do

Add checks for DE: where the <h1> element need to partially match the title
- That's absolutely not the way I coded my script, I do not intend to change it this way. NicDumZ ~ 15:35, 12 April 2008 (UTC)
Fix issues with title which are less than 6 characters
Maybe merge identical references
- Maybe in another script, but no in reflinks. It's a rather complex task. NicDumZ ~ 15:35, 12 April 2008 (UTC)
Maybe identify unbalanced <ref> tags
- No ? I'm not fixing syntax errors ;) NicDumZ ~ 15:35, 12 April 2008 (UTC)
Reject title with HTML tags (look for </...>)

??

Add the safari like algorithm that shows only the differences between titles
- If multiple titles on a page are the same, but from different links then skip title
  
  You mean... only showing what titles got appended ? The current diff scheme only show the lines that got changed, it's not a lot of text, is it ? NicDumZ ~ 15:35, 12 April 2008 (UTC)
Add optional support to convert numbered external links into references where there are three or more <ref>s

To Do

Add optional support for bullet external links
Post the link to the source other than the BRfA
Get ride/merge the meta-data section in the FAQ

Well that's the stuff that I've been able to come up with. You might to look at the toolserver source as I've tried implementing (poorly) a new encoding scheme. — Dispenser 03:38, 2 April 2008 (UTC)

DumZiBoT on long pages

Hey Nic - love the bot, great work. One request: any way to modify it so it does not add the commented text (Bot generated title) when it converts bare references on long pages (>32k)? We are trying to fight long articles per WP:AS, and on those pages every character we save, even non-visible ones like commented text, helps. UnitedStatesian (talk) 14:07, 10 April 2008 (UTC)

Well, no I won't remove the comment. The idea is to let the user know that a robot (non-human) inserted the title, so that, in case of garbage, he can easily know what happened. I also think that I get more bug reports with comment. Editors don't just correct a wrong title, they report it, so I can improve the bot. :)

NicDumZ ~ 15:23, 12 April 2008 (UTC)

Well, thanks anyway; still love the bot. UnitedStatesian (talk) 04:52, 15 April 2008 (UTC)

Sweet work

Thanks again for this bot. It's doing great work. Do you have an estimate on the percentage of articles it has hit? Timneu22 (talk) 21:20, 13 April 2008 (UTC)

Well, yes, it's easy. DumZiBoT has 100K contribs all on this task. On 2,300K articles, it makes something like 4% of the articles. (Over 100K contribs, I think that the amount of articles that got processed several times can be ignored) NicDumZ ~ 21:29, 13 April 2008 (UTC)

Avoid long "titles"?

This page demonstrates the longest bot-generated title I've seen to date (citation 10 is a very detailed error message). To avoid this kind of thing, can the bot be programmed to ignore or truncate "titles" longer than some pre-set length? --Orlady (talk) 03:22, 14 April 2008 (UTC)

Thanks ! I actually fixed this on March, 30th. The maximum length is of 250 characters. For example, on this website, it gives this which is not really better, due to the html tags in the title. NicDumZ ~ 08:16, 14 April 2008 (UTC)

Nature:access

This bot generated titles for references, which appeared as "nature:access". this may be because it is operating from a computer without access to nature.com. i noticed this on the edits for Monotreme in January. Can it be fixed?Hectorguinness (talk) 12:39, 30 April 2008 (UTC)

Poke :)

Hi! I'm User:Bdamokos from the Hungarian Wikipedia and I would like to let you know, that I am going to test your bot on the Hungarian Wikipedia, if its okay with you. I think its a great tool. Regards, --Dami (talk) 01:35, 3 May 2008 (UTC)

Hey! I've run a couple of test edits on hu.wikipedia and there are three issues: Sometimes it gives a title to a link that already has a title (the third one is an example of this), repeats the title twice (2nd and 3rd link), and it freezes when trying to get the title of a PDF file (giving an error that an other process is using the same file; should I install some extra program that the bot relies on?). If you could help me getting to the roots of these problems, I think this could be a great tool on huwiki also. Bye, --Dami (talk) 02:23, 3 May 2008 (UTC)

You probably want to try the Feb 12 version as I have been editing it since then. The only benefits I made besides portability to my web framework is that it uses (a modified) chardet correctly. — Dispenser 02:43, 3 May 2008 (UTC)

Hello !

I'm glad you're trying it. Thanks for telling me, too :)

I've just added my up-to-date script to the official pywikipedia SVN, synchronizing it with noreferences.py for more coherence.

You will need :

if not already done, to configure noreferences.py for hu:
for pdf handling, the unix command pdfinfo. It can output some garbage about a badly formatted, or truncated PDF, it's pretty normal : to avoid downloading big PDFs, if it's bigger than 2Mo, I only get the first 2Mo from the file, and pdfinfo does not like it. However the title is located in the headers, and it should work. If you do not use Unix / cannot get pdfinfo, let me know, and I'll add an option to skip PDF files.

About your first bug, I tried adding the ref to my test page, and running the bot on it. As you can see, the hu: reference is not being modified. This is likely to be a bug that has been fixed now.

About your second bug, well.... Honestly for now I really don't know what it might be. Let me know if it happens again, I'll try to reproduce and to fix it.

Cheers ! :)

NicDumZ ~ 08:11, 3 May 2008 (UTC)

Hey! Thanks for the help! I am using it on Windows, so if the pdf part could be switched off or made optional (if I can get a Linux that works with my laptop in the future) that would be nice.

I have translated to Hungarian what I understood (=everything except the badtitles part, I guess its ok if its the same in hu as in en):

Long python source

msg = { 'fr':u'Bot: Correction des refs. mal formatées (cf. [[Utilisateur:DumZiBoT/liensRefs|explications]])',
        'de':u'Bot: Korrektes Referenzformat (siehe [[:en:User:DumZiBoT/refLinks]])',
        'en':u'Bot: Converting bare references, see [[User:DumZiBoT/refLinks|FAQ]]',
        'hu':u'Robot: Forráshivatkozások konvertálása'}
 
lang_template = { 'fr':u'{{%s}}',
                  'en':u'{{%s icon}}'}
 
deadLinkTag = {'fr':u'{{Lien mort}}',
               'de':u'',
               'en':u'{{dead link}}',
               'hu':u'{{halott link}}'}
 
comment = {'fr':u'Titre généré automatiquement',
           'de':u'Automatisch generierter titel',
           'en':u'Bot generated title',
           'hu':u'Robot generálta cím'}
 
stopPage = {'fr':u'Utilisateur:DumZiBoT/EditezCettePagePourMeStopper',
            'de':u'Benutzer:DumZiBoT/EditThisPageToStopMe',
            'en':u'User:DumZiBoT/EditThisPageToStopMe',
            'hu':'User:Damibot/EditThisPageToStopMe'}
 
soft404   = re.compile(ur'\D404(\D|\Z)|error|errdoc|Not.{0,3}Found|sitedown|eventlog|hiba', re.IGNORECASE)
dirIndex  = re.compile(ur'^\w+://[^/]+/((default|index)\.(asp|aspx|cgi|htm|html|phtml|mpx|mspx|php|shtml|var))?$', re.IGNORECASE)
domain    = re.compile(ur'^(\w+)://(?:www.|)([^/]+)')
badtitles = {'en':
                # starts with
                ur'(?is)^\W*(register|registration|(sign|log)[ \-]?in|subscribe|sign[ \-]?up|log[ \-]?on|(untitled|new) *(document|page|$))'
                # anywhere
                +ur'|(404|page|file).*not *found|error'
                # should never be
                +ur'|^JSTOR. Accessing JSTOR$'
                # ends with
                +ur'|(register|registration|(sign|log)[ \-]?in|subscribe|sign[ \-]?up|log[ \-]?on)\W*$',
            }
 
linksInRef = re.compile(
    # bracketed URLs
    ur'(?:<ref[^>]*>)(\s*\[*(?P<url>(?:http|https|ftp)://(?:' +
    # unbracketed with()
    ur'^\[\]\s<>"]+\([^\[\]\s<>"]+[^\[\]\s\.:;\\,<>\?"]+|'+
    # unbracketed without ()
    ur'[^\[\]\s<>"]+[^\[\]\s\)\.:;\\,<>\?"]+|[^\[\]\s<>"]+))[!?,\s]*\]*\s*)(?:</ref>)')
#'http://www.twoevils.org/files/wikipedia/404-links.txt.gz'
listof404pages = '404-links.txt'
 
# References sections are usually placed before further reading / external
# link sections. This dictionary defines these sections, sorted by priority.
# For example, on an English wiki, the script would place the "References"
# section in front of the "Further reading" section, if that existed.
# Otherwise, it would try to put it in front of the "External links" section,
# or if that fails, the "See also" section, etc.
placeBeforeSections = {
    'de': [              # no explicit policy on where to put the references
        u'Literatur',
        u'Weblinks',
        u'Siehe auch',
        u'Weblink',      # bad, but common singular form of Weblinks
    ],
    'en': [              # no explicit policy on where to put the references
        u'Further reading',
        u'External links',
        u'See also',
        u'Notes'
    ],
   'hu': [
                u'Külső hivatkozások',
                u'Lásd még',
        ],
}
 
# Titles of sections where a reference tag would fit into.
# The first title should be the preferred one: It's the one that
# will be used when a new section has to be created.
referencesSections = {
    'de': [
        u'Einzelnachweise', # The "Einzelnachweise" title is disputed, some people prefer the other variants
        u'Quellen',
        u'Quellenangaben',
        u'Fußnoten',
    ],
    'en': [             # not sure about which ones are preferred.
        u'References',
        u'Footnotes',
        u'Notes',
    ],
   'hu': [
                u'Források és jegyzetek',
                u'Források',
                u'Jegyzetek',
                u'Hivatkozások',
                u'Megjegyzések',
                ]
}
 
referencesTemplates = {
    'wikipedia': {
        'en': [u'Reflist',u'Refs',u'FootnotesSmall',u'Reference',
               u'Ref-list',u'Reference list',u'References-small',u'Reflink',
               u'Footnotes',u'FootnotesSmall'],
        'hu': [u'reflist'],
    },
}

. Maybe if you could put in the Hungarian also into the original source code, it would be easier to update. Best regards, --Dami (talk) 11:42, 3 May 2008 (UTC)

Thanks, I added your hu: translations to both noreferences.py and reflinks.py on the SVN NicDumZ ~ 23:31, 4 May 2008 (UTC)

Hey! I tried the Feb 12 version, but it doesn't work... I think the problem is basicaly with Hungarian accented characters, and I think the issue might be specific to Windows (Unix handling the encodings better and stuff...). I will try to test with a live CD.--Dami (talk) 12:05, 3 May 2008 (UTC)

Character encoding ? Does some titles print badly ? NicDumZ ~ 23:31, 4 May 2008 (UTC)

I was thinking, that the bot is trying to find titles to links that already have them, because in the title there are accented characters like éáőúűóüöí; but the error happened on an Ubuntu I tested it on, with the same pages, but it doesn't happen always; so I don't know what is the pattern. I will try again with SVN version and report back. --Dami (talk) 11:05, 5 May 2008 (UTC)

Hi again! It has the same bug on Linux as well. Currently I disabled the PDF part, and set it to manual mode. This way its quite usable, but not as fast as with automatic mode. Regards, --Dami (talk) 17:38, 3 May 2008 (UTC)

How have you disabled it ? Using -ignorepdf from the last SVN version, or modifying the outdated code found on User:DumZiBoT/reflinks.py ? NicDumZ ~ 23:31, 4 May 2008 (UTC)

This was before SVN version, with commenting out the relevant class. --Dami (talk) 11:05, 5 May 2008 (UTC)

It seems that the number of errors apart from the 5 or 6 articles it always gets wrong is zero. So, don't worry about good old huwiki, and keep developing such great tools!--Dami (talk) 17:52, 3 May 2008 (UTC)

Reproducing bug#2

Sorry for flooding your talk page. Could you run a test to check what does the bot do, if the same link appears more than once on a page, like here [5] or [6]. Ideally it would merge these refs into one named one, but instead it inserts the title as many times as there are identical links. --Dami (talk) 18:06, 3 May 2008 (UTC)

Are you actually using the SVN version ?

I just tried with the SVN version, here and here, and no wrong behavior occurred. ?! NicDumZ ~ 23:17, 4 May 2008 (UTC)

This was before the SVN version, because the translation was not yet included. I'll try with the SVN version and report back. --Dami (talk) 11:02, 5 May 2008 (UTC)

Svn version

Almost all problems solved, at the price of introducing a new one: It doesn't ignore titles such as "Untitled Document" or just "Untitled" (For example at hu:Marosvásárhely if I remember correctly). The ignoring should be enabled for Hungarian (there is no need for extra blockings for Hungarian, apart from maybe for "Névtelen" [meaning untitled]). If this problem could be solved, it would be wonderful. Thanks again, --Dami (talk) 19:02, 5 May 2008 (UTC)

I also made a small change in the translation, that would be nice to have in the SVN version:

msg = { 'fr':u'Bot: Correction des refs. mal formatées (cf. explications)',

       'de':u'Bot: Korrektes Referenzformat (siehe en:User:DumZiBoT/refLinks)',
       'en':u'Bot: Converting bare references, see FAQ',
       'hu':u'Robot: Forráshivatkozások kibővítése a hivatkozott oldal címével'

} --Dami (talk) 19:09, 5 May 2008 (UTC)

Nice !

I'm updating the SVN...

About [7], I don't think that you were using the SVN version at that time, were you ? :) [why ? It is inserting {{en icon}}, and I removed this feature in the SVN version, because most HTTP server don't give out proper language codes, resulting in wrong language icons inserted...]

NicDumZ ~ 20:57, 5 May 2008 (UTC)

I was actually using the SVN version, but just clicked on NO when it asked, whether to commit the change. Now I saved it, just to show you [8]. --Dami (talk) 08:19, 6 May 2008 (UTC)

This error is still present in the latest SVN version [9], I don't know what's causing it as changing the English badtitles list to Hungarian doesn't help. Any ideas?--Dami (talk) 13:12, 9 May 2008 (UTC)

Well thanks for your report. It seems that when adding my script to the repository, one space slipped in the badtitles regular expression. (?! I feel confused about that...)

Thanks again, a lot of blacklisted titles were NOT detected as bad titles because of my mistake. It is now fixed.

NicDumZ ~ 18:42, 10 May 2008 (UTC)

Thank you!--Dami (talk) 20:03, 10 May 2008 (UTC)

Bot producing mojibake

Your bot sometimes produces mojibake. See for example this edit: [10]. If you've already fixed this, or if it's not really your bot's fault, please ignore. —Keenan Pepper 01:51, 8 May 2008 (UTC)

Hello !

I just tested, and yes, apparently it has been fixed.

Thanks however for the message, I didn't know about the mojibake word :þ

NicDumZ ~ 06:49, 8 May 2008 (UTC)

Consolidate duplicate refs

Could this bot could consolidate duplicate references? Some pages have multiple references all to the same article, and it would be nice if they could be condensed to give a more accurate picture of the number of articles truly referenced. Novasource (talk) 17:59, 12 May 2008 (UTC)

Looks like it is now consolidating references! Yay! Novasource (talk) 17:19, 14 May 2008 (UTC)

Is it ?

Checking for duplicates was definitely something I intended to code "When I have time". But unfortunately that check does not seem this easy, and I don't have that much time :)

NicDumZ ~ 17:21, 14 May 2008 (UTC)

Yup. It worked at Lupe Valdez and Buffalo Speedway. Are you running the same thing that's at http://tools.wikimedia.de/%7Edispenser/cgi-bin/reflinks.py? That's what I used, and it appears to reference you in the edit summary (which I modified to say "Consolidate links" instead of the usual edit summary), which links the word FAQ to User:DumZiBoT/refLinks. Novasource (talk) 20:33, 14 May 2008 (UTC)

Well. I first wrote reflinks.py for DumZiBoT. At some point Dispenser used the source to convert the script to be web based and started adding new features. Now I continued developing the shell-based reflinks.py in my own way, and we now have scripts than tend to differ. No, DumZiBoT is not running the same script :þ

Dispenser I know you're reading this: your script is still buggy, (see [11] for example), but you apparently did a good job implementing that duplicate check :þ If you're okay with that, I'd like to take the "good part" to include it in the pywikipedia SVN :)

NicDumZ ~ 20:48, 14 May 2008 (UTC)

I actually implemented the duplicate check because of that bug. I changed the script to search for all free links, but when it double replaces duplicate links on a page it causing the problem. I implemented an extremely simple check which looks for duplicates. A better one would compare urls in the references.

# Convert autonumbered references into cite.php format
# Regex from AWB user
if autonum2ref:
        new_text = re.sub(r'(?m)(?!^[*#:=].*?)(?<!<ref>)(?<!\*)(\s*)\[(https?://[^\] ]*)\](?!.*</ref>)', r'<ref>\2</ref>\1', new_text)
# Merge duplicate refs
for m in re.finditer(r'(?si)(<ref>)(.*?)(</ref>)', new_text):
        # Skip single references 
        if new_text.count(m.group()) <= 1:
                continue
 
        refname = 'autoref'
        for g in re.split(r'\W', m.group(2)):
                if len(g)>len(refname):
                        refname = g
        i = 1
        while refname+str(i) in new_text:
                i+=1
        else:
                refname += str(i)
        new_text = new_text.replace(m.group(), '<ref name="%s">%s</ref>' % (refname, m.group(2)), 1)
        new_text = new_text.replace(m.group(), '<ref name="%s"/>' % refname)

And there was a bug in the version that I was using where it would automatically insert the references section when there were no references. It could happen in the shell operation when not using the database as the source. I would like to remove the web hack and start using switches, the autonum2ref is an example of one. — Dispenser 03:03, 15 May 2008 (UTC)

Bot blocked

I've had to block your bot as it was apparently malfunctioning. See WP:ANI#BOT out of control and needs temporary blocking/shutting off for the discussion, and [12] for the malfunction diff. Mango juice^talk 17:53, 17 May 2008 (UTC)

You shouldn't take it personally: it's a bot. When bots malfunction there's always a calculation about whether the block or leaving the bot unblocked is the correct action. In this case, I judged that since all the bot does is put up interwiki links, which several other bots do, there would be zero harm in blocking it, and potentially non-zero harm in not blocking it. Mango juice^talk 13:26, 19 May 2008 (UTC)

ANI concern

Bot: User:DumZiBoT or [67]

Confirmation that it is a bot: I'm a bot, I am not able to understand by myself what is the aim of all these basic binary operations that I'm performing

Diff that bot is deleting material, not just adding a link at the bottom: http://en.wikipedia.org/w/index.php?title=Barack_Obama&diff=213072670&oldid=213069913

Please disable bot until repairs can be made. DianeFinn (talk) 17:42, 17 May 2008 (UTC)

Recent edits seem OK. It's not editing that fast either. Do you have time to try asking at User talk:NicDumZ? I'll keep an eye on it for the next few minutes. If you get no response on the talk page, come back here and leave a note. Or simpler still, edit User:DumZiBoT/EditThisPageToStopMe. :-) Carcharoth (talk) 17:49, 17 May 2008 (UTC) A bot shouldn't be deleting anything. What's another possibility? Sneaky edit summaries and human editing using a bot user? I'm not going to accuse someone of that. So the neutral observation is that the bot is not functioning. DianeFinn (talk) 17:53, 17 May 2008 (UTC)

Copied from ANI DianeFinn (talk) 17:54, 17 May 2008 (UTC)

Peque

Hello! loke it that: [13]; es:Peque (Antioquia) is a town in Colombia, no in Spain. Tnk you, XalD (talk) 15:25, 19 May 2008 (UTC)

A little misspelling in replace.py

In replace.py the nn comment reads "blabla teksterstatting". It should read "blabla teksterstatning" (per [14] [15] (nr. 2 reads 'not found')).

To be more precise, it should go like this:

'nn':u'robot: automatisk teksterstatning: %s',

Yeah, I didn't want to upload a whole patch just for this, I hope you understand. :-P

Thanks in advance. --Harald Khan Ճ 17:22, 20 May 2008 (UTC)

Thanks, I just fixed this ;)

NicDumZ ~ 20:22, 20 May 2008 (UTC)

Quote handling

I just saw, on this location, the bot mis-handle a title. I've gone back and fixed it by hand. The issue seems to be a single-quote in the retrieved title. - Denimadept (talk) 22:05, 28 May 2008 (UTC)

Thanks for the kind report ;)

However, the HTML source of the page is <title>World</title> : firefox prints "World" as a title. Surely, in the page, you can find "World's Longest Bridge Spans", but the HTML title is "World". The HTML page is wrong, not my bot ;)

NicDumZ ~ 22:09, 28 May 2008 (UTC)

Ah hah! Report went to the wrong person, then! :-D - Denimadept (talk) 01:20, 29 May 2008 (UTC)

Reflink updates

I've noticed that in the changelog that you've removed the HTTP error logging. I've been meaning to ask you for a while now if you could send them over for use on the Toolserver. I have also add the svn version of reflink with modification only for enabling HTML output.

I'd like to discuss rev 5374 for reflinks.py. You change the syntax of the dead to a bracket-less format. This is a problem for my tool as the regex is tied to using those brackets.

The regex implementation in checklinks.py is as follows:

# Name of the dead link templates
dead_templates = r'[Dd]ead[ _]*link|Dl|dl|[Dd]l-s|404|[Bb]roken[ _]+link'
...
    # Label dead links
    text = re.sub(ur'\[(\w+://[^][<>\s]+) *([^][]*?)\](\W*?|\W*?<[^<>]*?>\W*?)\{\{(%s)[^{}]*?\}\}' % dead_templates, ur"[\1 '''&#123;&#123;dead link&#125;&#125;''' \2]\3", text)

I would like to come up with a more formal definition, possibly:

text = re.sub(ur'\[(http[s]?://[^\[\]<>"\s]+) *([^\]\n]*)\](?:</ref>)?\{\{[Dd]ead link[^}]*\}\}' , ur"[\1 **dead** \2]", text)

It is probably best to bring the discussion somewhere else to get more input on this matter. — Dispenser 04:06, 29 May 2008 (UTC)

Okay, quick answer in the morning

I'm truly sorry about the HTTP logs. I ended up not using them, and I thought it was best to delete them, as no other tool was at the moment able to use them
The bracket-less syntax was actually not wanted. r5374 introduced es: enlace roto, which has a different syntax from the other langs, and I had to adapt refDead. The deadLinkTag fix was wrong, because I forgot about brackets. It is now fixed by r5460.
With that last revision, I believe that the latter regex would work, wouldnt it ? You just have to correct it, for {{dead link}} is being placed inside <ref></ref>. Let me know...
I have also add the svn version of reflink with modification only for enabling HTML output. <- sorry, I don't understand the sentence :(
You are welcome to bring the matter elsewhere. You can post on pywikipedia-llists.wikimedia.org or... anywhere else :)

NicDumZ ~ 06:38, 29 May 2008 (UTC)

Interwiki position of sk

The bot appears to think that sk (for Slovenčina) comes all the way at the end in the Interwiki sorting order, rather than between szl and cu.[16][17][18][19] --Lambiam 08:19, 29 May 2008 (UTC)

hmm... I believe this has been fixed during the day. Is that bug still actual ? NicDumZ ~ 20:39, 29 May 2008 (UTC)

Bug?

Did something go wrong here?—Ëzhiki (Igels Hérissonovich Ïzhakoff-Amursky) • (yo?); 13:23, 29 May 2008 (UTC)

mmm... Not really. In fact, my regular expression searches for references made exclusively of links, ignoring the spaces in the references. And then for every such reference, it tries to get a title. When no title is found, the reference is put with the format <ref(name ?)>(link)</ref> without space.

That's what happened here. It seems that only a space was removed, but actually it means that no title / or a bad title has been found, and that the ref has been normalized. Here, the title of http://www-ns.iaea.org/downloads/rw/waste-safety/north-test-site-final.pdf, "Microsoft Word - [...] .doc" is blacklisted and has been ignored.

So, that looks like a bug but isn't. Not editing the page would have been better, yes, but that's not this easy... Or so I think.

NicDumZ ~ 20:45, 29 May 2008 (UTC)

WP:AWB/FR — Dispenser 00:30, 30 May 2008 (UTC)

Bot - eswiki

Hello! You are now flagged on the Spanish wiki. We expect your bot's edits to fix external links. Muro de Aguas (write me) 14:37, 29 May 2008 (UTC)

Cool bot!

Cool bot, nice work, pretty clever. PhycoFalcon (talk) 22:51, 30 May 2008 (UTC)

Russian letters in references

Hello! Please, when you convert the references in Russian (thanks for this), please, pay attention to the right letter encoding otherwise cyrillic letters looks like strange senseless symbols after your work!

Regards, Vladimir--Vladimir Historian (talk) 12:24, 31 May 2008 (UTC)

Outhouse article

Dear NicDumZ: This article is chock full of bare links, and actually needs to be converted to footnotes. It was one of the first articles I worked on, and I didn't know what I was doing at the time. I've noticed some of your good work, and you seem to be doing a lot of this with a bot. If you could help, it would be greatly appreciated. Thanks. Merci! 7&6=thirteen (talk) 19:01, 31 May 2008 (UTC) Stan

Inadvertent advertising

The HTML title from webpages belonging to magazines, which might be perfectly appropriate links in articles, can make highly inappropriate references, because they are used to give promotional messages about the magazine. For example, there are few more comprehensive and well updated sources in English for following competitive cycling than Cycling Weekly, and so it was a good source for somebody to use to show that two teams had been offered a late entry into the 2008 Giro d'Italia, but it is not the place of Wikipedia to declare, as DumZiBoT did, this publication to be "Britain's biggest-selling cycling magazine, delivers an exciting mix of fitness advice, bike tests, product reviews, news and ride guides for every cyclist". Maybe the Bot needs a filter to change its action when it comes across boastful superlatives, or maybe the automatic editnote should more explicitly invite editors to check the suitability of the results. Just a thought. Kevin McE (talk) 11:28, 1 June 2008 (UTC)

Categories: User talk archives