Wikipedia:Bots/Requests for approval/DOI bot

From Wikipedia, the free encyclopedia

< Wikipedia:Bots | Requests for approval

The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was

Approved.

[edit] DOI bot

tasks • contribs • count • sul • logs • page moves • block user • block log • flag log • flag bot

Automatic or Manually Assisted: Automatic

Programming Language(s): PHP w/ Snoopy & BasicBot

Function Summary: Adds DOIs to citations provided using {{cite journal}}

Edit period(s) (e.g. Continuous, daily, one time run): Will do a thorough job every few months; will be available to be used on specific articles whenever requested.

Edit rate requested: 6 edits per minute. In reality the querying of other websites will be the rate limiting step.

Function Details: Adds a permanent link to any article cited using {{cite journal}}.

The bot uses a variety of methods to locate the DOI, in the order stated:

Check the DOI is not already encoded in the url parameter
Check the metadata of the web page linked to by the url paremeter for a DOI
Scour the page for a DOI
If there is no URL, perform an "advanced search" on Google Scholar using, for example, the title and author parameters to form a precise query. The precise search guarantees that the URL is of relevance to the article. If this URL contains a DOI, this is added to the citation; if not, the URL is added.
Do a google web search for "article title" + each + author + name. If the first result only contains 1 DOI, it is a safe assumption that this belongs to the article cited, so it is used; otherwise, the DOI is ignored.

Operator: Verisimilus T

[edit] Discussion

That sounds really cool, who's operating it? SQL ^{Query me!} 20:57, 12 March 2008 (UTC)

Whoops. That would be me, sorry. Verisimilus T 21:02, 12 March 2008 (UTC)

A few questions:

For #3, how do you propose to know that the DOI you found belongs to the article you are looking for, and not for example, have it be from a list of references, or a see also section on the page. Are you going to do some kind of secondary check (i.e. resolve the DOI you found and see if it matches the author and title)? Also, what happens if the titles (or authors, for that matter) are not exact matches (I know they should be, but lets consider reality).

For #4 and #5, are we going to run into the same problem as CorenSearchBot (see section 5.3 of the Google Terms of Service)? Quoted here for convenience:

5.3 You agree not to access (or attempt to access) any of the Services by any means other than through the interface that is provided by Google, unless you have been specifically allowed to do so in a separate agreement with Google. You specifically agree not to access (or attempt to access) any of the Services through any automated means (including use of scripts or web crawlers) and shall ensure that you comply with the instructions set out in any robots.txt file present on the Services.

For the record, I asked Google for permission; here's their reply:

Thank you for your interest in Google Scholar. At this time, we don't
allow automated queries on our index. This is for a number of reasons,
both technical and due to agreements with content providers. While we
don't have an API available for Google Scholar, I appreciate this feature
request and have passed your email and comment on to the relevant
engineers on our team.

Verisimilus T 19:57, 13 March 2008 (UTC)

Thanks. And this bot does sound really interesting. - AWeenieMan (talk) 23:29, 12 March 2008 (UTC)

Hi, thanks for your input. Re #3, in my long experience of collecting DOIs, if an article's references contain DOIs, the article always has a DOI, stated earlier in the page code, and this isn't a problem. I'd be interested if you could find any examples to the contrary.

DOIs could be resolved to confirm that they are the correct page, but this may be difficult - and is probably unnecessary.

Inexact matching is a difficulty I'm not sure how to deal with. Problems are bound to arise, for example with italic text in titles, and accented characters in names. I'm sure it's somewhat trivial to check that a string is 95% similar (say), but it's outwith my coding experience. Help gratefully received!

Thanks for pointing out the Google T&C. That's a pity. It looks like CorenSearchBot used Yahoo to get round this - is this a possible alternative? Verisimilus T 11:22, 13 March 2008 (UTC)

[edit] Adjusted logic

I've checked Yahoo!'s terms and conditions and they do not prohibit their use by bots. Hurrah!

So I've tweaked the DOI searching algorithms so they act as follows in the absence of a URL:

Does the URL contain a doi? (e.g. http://example.com/view=article&id=10.1001/doi/ishere)
1. If so, does the page contain data telling us we've got the right title?

Sites that I've seen with DOIs in the URL are only BIOONE and Blackwell publishing. The former of these encodes the title in an invisible span.

Do the <meta> tags contain a dc.Identifier or citation_doi?
1. If so, check the dc.title or citation_title matches the title we want.
Is there a DOI in the page, anywhere?
1. Are there lots of DOIs?
  1. Do any occur in association with the title? If there are any <code><br>, <p>, <li> or <td></code> tags between the title and a DOI, the DOI could refer to a different reference, and we'll have to ignore it.
2. Is there a unique DOI?
  1. Does the DOI appear in the first 5000 characters of the document? If so, it is probably part of the document description. Any later, and it's more likely to be a reference.

This is giving promising results so far... Verisimilus T 18:33, 13 March 2008 (UTC)

Have you tested this in your userspace at all? SQL ^{Query me!} 19:18, 13 March 2008 (UTC)

Yes (as User:Botodo until I get my bot flag) - it's working surprisingly well! Just ironing out a few bugs now. Verisimilus T 19:49, 13 March 2008 (UTC)

I am liking your new logic, myself. It is too bad about Google, but not much can be done. Inexact title matching shouldn't be too hard to do (in a simple way), if you are interested. I usually use a Levenshtein distance algorithm (php.net) for such things. - AWeenieMan (talk) 20:39, 13 March 2008 (UTC)

Thanks for that. Touch wood, I think I've ironed out all the problems and am ready to go. Any suggestions for pages to test are welcome! Verisimilus T 21:52, 13 March 2008 (UTC)

Alright then, let's see what she's capable of :) Approved for trial (200 edits). SQL ^{Query me!} 00:02, 14 March 2008 (UTC)

- Unfortunately, that's "nothing" at the moment. I'm being thwarted by captchas. Does getting a "bot" flag turn these off? I'm guessing it's alright to operate under the alias of User:Botodo for now? Verisimilus T 08:08, 14 March 2008 (UTC)

Your bot user account is not autoconfirmed yet. That happens 4 days after creation, I believe. At that point, the captchas will disappear. - AWeenieMan (talk) 15:00, 14 March 2008 (UTC)

Your bot account should be autoconfirmed now, so you should be good to go with your trial :) SQL ^{Query me!} 06:48, 20 March 2008 (UTC)

What's the status of this request? — Werdna talk 13:55, 4 April 2008 (UTC)

The bot is running quite smoothly. Only a few minor issues appeared, and they were easily fixed. I'm just waiting to get it set up on my toolserver account, and will then be ready to roll out into full scale operation. Verisimilus T 14:14, 4 April 2008 (UTC)

You're at about 170 edits on this task, I'd go ahead and say trial complete :) I looked over a lot of the edits, and, they looked very good to me. I would suggest approval. SQL ^{Query me!} 01:16, 12 April 2008 (UTC)

Approved. MaxSem^{(Han shot first!)} 10:02, 12 April 2008 (UTC)

The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.