User talk:Fictional tool/Details

From Wikipedia, the free encyclopedia

Contents

[edit] Interface/code?

This is a cool thing to have, but do you think you could make it available here. Best of all would be giving users a web interface to process pages; a second best would be to at least give a link to downloadable code. But since many editors won't necessarily feel comfortable running code on their home systems (especially if it requires a working Perl/Python/whatever system installed), the web interface is ideal. LotLE×talk 20:22, 23 July 2006 (UTC)

Thanks for the comment. I realise this is a problem, but the code is quite immature (i.e. you have to know what bugs to expect to really use it efficiently) and is really in need of a rewrite. It also contains a bit of code that I'm not that happy to release, as it represents quite a lot of work on my part -> valuable. I might consider crippling the code a bit so it works most of the time (the special code is to do with evaluating server responses and making appropriate choices). A web interface is probably not the best solution, as this still requires manual intervention and, unless replicating much of the wikipedia interface, does not give the same amount of control. I'm curious to learn more about what JavaScript can do, e.g. monobook.js, but I don't really have the time, as I'm writing up my PhD. The ideal solution would be a locally run bot that knows how to login and edit pages. I think I saw a piece of perl code that does this (the script is indeed written in perl), but don't have it at my fingertips. As for the web interface, I've never written anything like that before. I assume one would need to use mod_perl? The next problem is that I don't have an always-on server, and I need my laptop for other stuff. So for the moment, the only way to do the reference conversion is to let me know which articles need it (should I set up a page for this?) so I can do it sort-of-manually. - Samsara (talkcontribs) 21:06, 23 July 2006 (UTC)
I wrote my tool using PyWikipedia, which is precisely what you mention: a library API to interface with Wikipedia. I know there is a Perl library that does the same thing. Actually, from my brief look, I tend to think the Perl API is a bit better designed (I'm a big Python guy, but wasn't so happy with the design of that particular library). As you'd expect, these libaries (I think ones for a few other programming langauges too), provide facilities to retrieve pages, update them, etc.
You probably noticed what I did with Citation Tool (I assume so, since you added a link there to your tool): the web interface lets you check a page, then provides suggestions for how to modify it. The particular modification made is different in your tool, but the concept could be almost exactly the same. I.e. fetch a page, touch up raw URL links, propose a modified version of the page text to users, optionally edit WP itself using the proposed changes. I'd be happy to host your tool at the same place as Citation Tool; in fact, it would be ideal if a button or two let you perform both sets of improvements on a given article. If you want that, we'd have to work out how you could upload improvements to the code, since my shell account is... well, my secret password. I have a little web interface to let people upload files (i.e. data or archives they want to share); I could tweak that to let you upload actual code to a live location.
Btw. I tend to think that if you think your code is valuable in a way that argues for keeping it secret, you're thinking about things in the wrong way. That's not anything about your specific code, it's just how I perceive the world to be. You gain a lot more by publicizing your coding skills than by keeping them under wraps: what your conversion does is worthwhile, but hardly any huge conceptual novelty (I know perfectly well how to write the same thing, for example, it's just a matter of spending the time). LotLE×talk 22:43, 23 July 2006 (UTC)
Oh, it's not the front end that I'm saying is valuable. I'm quite sure you could write that in a jiffy, especially since you had that on your road map. It's the http retrieval backend that took hundreds of hours of testing. And I doubt anybody will hug me for that one because all it does is just work. Microsoft understands very well that you first have to write buggy software in order for people to appreciate the improvement when it happens. - Samsara (talkcontribs) 22:43, 20 August 2006 (UTC)

[edit] Private ability

This edit comment sounded rather snippy: yes, you're right. following your attempts to pressurise me into releasing the code, it will not be released. self-fulfilling prophecy..

But in any case, I didn't feel it too relevant to "advertise" on User:CitationTool that you might have some private code on your own computer that other users can neither obtain nor utilize through a public interface. It has that sort of Fermat "I can't fit it in this margin" feel to it. Likewise, I have some wonderful code on my computer that remove all the POV language from an article, but it's not something I want to share... :-) (OK, not really, but the claim of privately having code is not very productive). LotLE×talk 13:37, 21 August 2006 (UTC)

I'm happy for you to think what you want. - Samsara (talkcontribs) 14:52, 21 August 2006 (UTC)

[edit] Pages needing attention

If you'd like to help me out on a page I watch that has way too many bare refs that could use some web-crawling mojo, I'd love it if you'd give a try at Ward Churchill, Ward Churchill 9/11 essay controversy and Ward Churchill misconduct allegations (the main bio I've gone through manually). LotLE×talk 02:22, 26 August 2006 (UTC)

Looks like you fixed those already. - Samsara (talkcontribs) 09:23, 26 August 2006 (UTC)

Sorry, I should have followed up more quickly. After I commented, I wasn't sure your schedule, or whether you had seen my note, so I went ahead and wrote the code to do the fixes. My quick bang-out (a bit over an hour, with testing) just looks for "raw links" of the form <ref>http://example.com/</ref>. But adding a look for the [http://example.com] style would be simple enough. I also have not yet integrated it with User:CitationTool. For first pass, it just works on an article that I have saved (as Wikitext) to my local harddrive; that would be easy to change too, of course. I'd also like to do something with PDFs references rather than only with HTML. PDF should have the needed meta-data in most cases.

FWIW, I also threw in a bit of extra capability: specifically, I automatically remove duplicate "raw links" that point to the exact same place (and use the same footnote multiple times; but this requires random ref names, not meaningful ones). Anyway, here's what I did:

[edit] Quick URL lookup/titlification code

#!/usr/bin/env python

import re, sys, urllib, random
from exceptions import *

exts = 'html htm HTML htm php jsp asp'.split()

fname = sys.argv[1]

content = open(fname).read()
# Fix any <ref>[http://example.com]</ref> semi-bare URLs
content = re.sub('(<ref>\[)(http://.*?)(]</ref>)', r'<ref>\2</ref>', content)
urls = re.findall('(?<=<ref>)http://.*?(?=</ref>)', content)

dups, nodups, last = [], [], ''
urls.sort()
for url in urls:
    if url != last:
        nodups.append(url)
    else:
        dups.append(url)
    last = url
    
subs = []    
for url in nodups:
    parsable = False
    for ext in exts:
        if url.endswith(ext):
            parsable = True
            break
    if not parsable:
        continue
    try:
        link_html = urllib.urlopen(url).read()
    except IOError:
        link_html = "<title>DEAD LINK</title>"
    titles = re.findall('(?<=<title>).*?(?=</title>)', link_html)
    if len(titles) == 0:
        print url, "Title not found"
    elif len(titles) == 1:
        print url, titles[0]
        subs.append((url, titles[0]))
    else:
        print url, "Multiple titles ?!"

# Give named refs to any dups
print 'DUPS:'
for url in dups:
    randid = random.randrange(1000,9999)
    content = content.replace('<ref>'+url,'<ref name="X%s">%s'%(randid, url))
    print url
   
# Substitute any located titles
print 'SUBS:'
for (url, title) in subs:
    content = content.replace(url, '[%s %s]' % (url, title))
    print url, title

# Write enhanced content
open(fname,'w').write(content)