Wikipedia talk:Dead external links

From Wikipedia, the free encyclopedia

Contents

[edit] Suggestion

This project is a really cool idea! Just one suggestion - the listings would be more useful for collaboration if they were simply posted to the wiki. That way, people could "click to test" and have handy links right to the broken pages. (Of course, the bot would need to exclude these listings from its next run.) There are too many listings to post all at once (or at least all on one page), but we can certainly start working on a chunk of them. I'm sure we could clear out some of the less-populated categories entirely. By the way, anyone can do this; you don't have to be Marumari. (I would do it myself if I weren't busy updating other such reports.) -- Beland 03:23, 5 October 2005 (UTC)

Thanks for the compliment! It shouldn't be hard (a very easy regex) to make the first field (article title) into a Wiki link. I'm gone for the next couple weeks, but I should be able to do that afterwards if nobody has taken up the call to do so. -- Marumari 19:45, 5 October 2005 (UTC)
I've done this for the 404 errors. Here's my script:
#!/bin/bash
AWKPROG='{desc=" "$3;code=$4;rpt=$5;if(NF==4){desc="";code=$3;rpt=$4;} printf "#[[%s]], [%s%s], %s %s\n",$1,$2,desc,code,rpt}'
(for i in a b c d e f g h i j k l m n o p q r s t u v w x y z; do zgrep -i ^$i 404-links.txt.gz | awk -F '\t' "$AWKPROG" > 404s/$i; done)
zgrep -i -v '^[a-zA-Z]' 404-links.txt.gz | awk -F '\t' "$AWKPROG" > 404s/misc; 
Lupin|talk|popups 12:50, 13 October 2005 (UTC)

[edit] 301 redirects

I was trying to cleanup 301 redirects. I noticed a number of pages were posting links to google cache, which sometimes failed as 404.

In some cases the link to cache is used to convert from doc/pdf to html.

Does annyone know the officiial wiki policy on using google cache links? I feel that since google links are often not valid, they should be discouraged. Pointer to the direct webpage is a better idea.

[edit] 301 redirects bot

I am planning to ask for permission to fix some of the 301 redirects. It will be a manual process, obtaining the pages to be changed. After that have the bot perform the changes.

Example of this is 114 instances of http://www.ex.ac.uk/trol/scol/ccleng.htm .

coomments? Khivi 09:13, 13 October 2005 (UTC)

Fine with me. Would make it easier to see if the 301 errors point at pages that are 404. -- Marumari 11:47, 17 October 2005 (UTC)

[edit] Refreshes

How often do folks think I should update the listings? Perhaps I should just wait til Khivi creates his 301 bot?

[edit] Jumping about

Khivi, I appreciate your edits (to the 301 section). Can you please try to work in a "range" of links, instead of jumping about like that? It does make it much more difficult to read. Thanks!

[edit] 404 Fixing Policy

What is the policy to fixing the 404 links. Should they just be deleted. Also what is the policy for linking to the internet archive. Should one link to a specific version. e.g. which is better

  1. http://web.archive.org/web/*/http://jove.prohosting.com/~skripty/
  2. http://web.archive.org/web/20030214182904/http://jove.prohosting.com/~skripty/

Probably a lot of linking to the internet archive can be (semi) automised with some kind of bot?

I'm not sure there's a policy exactly. I usually make an attempt to correct the link or replace it with another link giving the same information (which can occasionally lead to quite a bit of digging). If that's not possible then deleting the link is still usually an improvement. Some of the links are fairly unnecessary so can just be deleted anyway. As long as you leave the article better than it was before then it's a good edit.
With regard to the web archive question, my feeling is that it would be better to link directly to a specific version. I'm basing that both on simplicity for the potential reader of the article and stability in that some of the different versions available might not contain the relevant information. --Spondoolicks 14:58, 28 October 2005 (UTC)
I guess you are right. Each case needs to be judged on its merits. I remember fixing one in which linking to the archive was a good idea, but for some others it seems not useful.Juliusross 00:00, 29 October 2005 (UTC)
I've changed the IA Wayback Machine URL in the dfw and dfw-inline templates to offer the most recent version of the page instead of a menu of all the page versions in the archive (change "/*/" to "/2/"). So far I have only needed to link to the most recent version in the archive for all the dead links which I have found in the archive. --James S. 20:27, 18 January 2006 (UTC)
I would say that, at least half the time, a link to the newest version of a website in the Wayback Machine is entirely broken. So, definitely double-check, and don't just blindly use the template. -- Marumari 14:59, 19 January 2006 (UTC)
Oh, great. I've not seen that at all. Give me a few examples, please. --James S. 20:42, 22 January 2006 (UTC)

[edit] 404 cleanup question

Should we update the individual pages where the 404 linkrot entries are listed? With strikeout text? I am doing so, and see a couple of others have as well, but don't see where there are specific directions on 404 errors, unless I missed it somewhere. SailorfromNH 23:36, 26 November 2005 (UTC)

Personally I'd say just keep doing what you're doing as long as it's clear what's been checked so no-one duplicates the work. I'm updating the main project page to say where I've got to with the 404s beginning with P but striking out on the individual pages works too. --Spondoolicks 18:05, 29 November 2005 (UTC)
Updating with strikeouts is a good idea. I just got finished rechecking five that were already done, because someone didn't do the strikeouts. Oh well. --Coro 01:56, 13 December 2005 (UTC)
I think some get done independent of the project, when somebody happens by them. SailorfromNH 00:03, 15 December 2005 (UTC)

[edit] Time for an update? Again?

We're working on the list of dead links generated from the September 13th database dump here. A lot of the ones in this list will have been repaired by someone by now and also new dead links will have appeared. --Spondoolicks 18:05, 29 November 2005 (UTC)

The link checker is running now. With over a million links to check, it goes slowly. As soon as it is finished, I will clear the status fields, and upload the new files.
Done. -- Marumari 01:13, 4 January 2006 (UTC)

I am not clear where we are in the database dump cycle, but these lists are feeling a little stale. A lot of the ones in this list will have been repaired by someone by now and also new dead links will have appeared. Is it time for a new list? Open2universe 13:07, 24 February 2006 (UTC)

[edit] Question re link checker

I've noticed that quite a few of the links listed as 404 errors are of the form that goes to a section of the target page - e.g. http://www.hostkingdom.net/Holyland.html#Samaria (I'm sure there's some technical term for this type of link which I ought to know). I've tried a dozen or so of these and they all seem to be working fine so I was wondering if the link checker had not managed to process these correctly. --Spondoolicks 14:18, 11 January 2006 (UTC)

I'll take a look at it, and re-run the link checker if it's broken. For now, maybe just ignore links with anchors? -- Marumari 16:44, 11 January 2006 (UTC)
Okay, you're right - there is a bug where it is requested the URL with fragments (ie, #blah). As soon as there is a new database dump, I'll re-run the link checker, and get updated files up. Like I said, just ignore links with anchors for now.
This accounts for a lot of code 400's, too. Another possible bug — if there's an ampersand ('&') in the URL, it is stored in the database as HTML entity (&). Could it be that bot does not translate entities back into characters? This would explain a number of not-an-errors I've seen. sendmoreinfo 20:33, 22 January 2006 (UTC)

[edit] Bot request to fix malformed links

I've just put a request on Wikipedia:Bot requests for a bot to fix those links which are not working because someone used the pipe symbol (|) thinking it was used the same as for internal links - e.g. [http://www.bbc.co.uk|BBC website]. At a rough estimate based on a small sample I'd say about 2% of the 404 errors are due to this mistake. --Spondoolicks 17:26, 16 January 2006 (UTC)

Great idea.
How about a bot which checks to see if the wayback machine has anything and if so inserts {{dlw-inline}} if it does? I can see why that wouldn't be appropriate for {{dlw}} (because it's ugly) but I just made dlw-inline which I think would be appropriate for almost any situation where the wayback machine has something, except captionless links, of course; see below. James S. 06:40, 17 January 2006 (UTC)
Preventing all dead links: Would it be possible to have a bot which looks for external links within a wiki article, and then interfaces with the wayback machine to make sure that a copy of every referenced webpage is indeed archived on the wayback machine (or on similar archive)? One example of this type service (automatically archiving all referenced websites in an article) is WebCite, although that service does not guarantee maintaining the archived website if the publisher wishes to remove it. Another example is DSpace. I wish for a way to automatically and permanently archive the referenced online material. Then, as James S. mentions above, perhaps a bot could convert the dead links to access the archived version. KHatcher 19:59, 24 January 2006 (UTC)

[edit] What should we do with [captionless url] links?

Is there any guideline, tradition, or advice for dealing with dead URLs in square brackets by themselves without any caption text? I'm just replacing them with their Internet Archive link, when it yeilds results, without any further commentary such as is produced by {{dlw}} and {{dlw-inline}}. --James S. 06:23, 17 January 2006 (UTC)

Not that I know of. I'd say to just try to figure it out by the context, and barring that, either remove the link (if it is superfluous) or change the link to the internet archive. I've been adding captions and then a (Internet Archive) after the link, but there's no official procedure that I know of. -- Marumari 17:40, 18 January 2006 (UTC)

[edit] 403 code

Am I blind...or does the page mention nothing about code 403 (Forbidden)? Bloodshedder 05:44, 4 February 2006 (UTC)

Uh Oh. It doesn't indeed, and as far as I can see, never did. Weird. sendmoreinfo 12:06, 5 February 2006 (UTC)
I'm sure there must be a reason for it. Lemme look tonight for something. -- Marumari 18:09, 28 February 2006 (UTC)
I can't think of a good reason myself...any updates on this? Bloodshedder 05:40, 15 March 2006 (UTC)

[edit] How should I repair dead news article links?

A lot of news sites don't keep their articles available for long and don't allow the Internet Archive to capture them, which results in a lot of non-repairable dead links. However, if the article was from a news agency like AP or Reuters then it will be available from many other places, some of which might have either permanently accessible articles or will be available from the Internet Archive. I've just replaced a dead link to an article in the Washington Post with a link to the same article which is still available at ABC News but I don't know how permanent that is. Does anyone know of a major news site which uses AP/Reuters reports and has permanent links? --Spondoolicks 17:43, 6 February 2006 (UTC)

If there is one, I've never been able to find it. -- Marumari 18:05, 28 February 2006 (UTC)

[edit] 301 code count

The main page indicates that there are >30,000 of the 301-type errors. However, in opening the page I only get a count of slightly less than 12,000. Is this because the page is partitioned; the first one (/301) runs through the H's. Thanks. User:Ceyockey (talk to me)

I just downloaded it, and got over 30k entries. Perhaps your download got truncated accidentally? -- Marumari 18:07, 28 February 2006 (UTC)

[edit] Tips from experienced user

Hi, I was just hopping around for a project to work on I found this Link rot page. I am not sure if it is supposed to be too simple but would it be possible for experienced contributors to list efficiency tips for newbies for this project? For example, this guide from Wikipedia:Disambiguation pages with links was quite useful when I started that (until I got bored). Ashish G 00:47, 1 March 2006 (UTC)

[edit] Re:Dead external links

Hello there, Do you think you would have time to regenerate the files for the dead external links project? Not that we finished them all, but they are feeling a little stale. Thanks so much Open2universe 12:21, 1 March 2006 (UTC)

I'd love to re-generate the external link list, but I can't do that until Wikipedia does another database dump. They haven't done a database dump since December 14th. Don't ask me why. -- Marumari 15:12, 1 March 2006 (UTC)
Never mind, it just looks like they moved the database dump page, and didn't tell me. Bah. I'll re-run the process soon. -- Marumari 15:14, 1 March 2006 (UTC)

[edit] Pages too large

I came here to remove a link to an AFD deleted page, but the page was over a megabyte long. I couldn't find the entry I wanted when I went into edit mode (Firefox doesn't search within text boxes) so I left it. == way too long. --kingboyk 10:28, 21 March 2006 (UTC)

Which page are you referring to? Many of the 404 pages are large so we have tried to break them down into sections for editing. When you select a section for editing it only brings up that section. I don't believe any of the sections are that large, but if so I will fix it Open2universe 13:29, 21 March 2006 (UTC)
Wikipedia:Dead external links/301 is incredibly 1,212 kilobytes long. There is a post on VPT by a User:0plusminus0 having problems editing that page. --Ligulem 18:36, 20 September 2006 (UTC)
Ah, okay. The page doesn't link to that one. I will try to break it up.Open2universe 01:15, 21 September 2006 (UTC)

[edit] Dead links cited as sources

I was asked not to delete dead links to pages that listed population figures. I am wondering what folks think is the best way to handle this when the website cannot be found in the archive. I understand that the website was the original source, but I am reluctant to leave broken links. Should I leave the URL but not as a link? Or should I simply state that the link is unavailable? Any guidance is appreciated. Open2universe 14:57, 27 March 2006 (UTC)

[edit] Internet Archive mentioned but not WebCite

I made some edits on Dec 7th to the effect of mentioning WebCite [1] alongside the Internet Archive as a means to recover broken links, in particular if they were prospectively archived with WebCite; unfortunately, these changes were reverted by another user as "spam/self-promotion". I will not re-revert these changes to avoid an edit-war, but I do request to give this matter some serious consideration and I am seeking some support through the Wiki community. Internet Archive and WebCite are not competitors, but complement each other, and both are non-profit. I do think that WebCite could help Wikipedia a lot to avoid broken links in the first place (or to cache cited material so that it is recoverable).

BEFORE:

The 404 error is the most common symptom of link rot, and it indicates that the page has not been found. The 410 status code is similar, but indicates that the file has permanently gone. Such links are required by policy to be repaired, perhaps with a link to the Internet Archive. Wikipedia currently contains 31,913 status 404 links and 42 status 410 links.

MY SUGGESTED EDITED VERSION

The 404 error is the most common symptom of link rot, and it indicates that the page has not been found. The 410 status code is similar, but indicates that the file has permanently gone. Such links are required by policy to be repaired, perhaps with a link to WebCite or the Internet Archive. To link to WebCite, try http://www.webcitation.org/query?url=URL&date=DATE (replace URL with the url, the DATE is the cited date, and is optional. Chances are of course best to recover an archived version if the URL has been explicitly WebCited by the editor before it went dead). Wikipedia currently contains 31,913 status 404 links and 42 status 410 links.

There were other edits I made (which can be seen in the history) to include hints to the effect of avoiding 404s in the first place if all cited links would be cached prospectively using WebCite, for which somebody could write a bot (see Wikipedia:Bot_requests). I hope wikipedians will support the proposal to include hints to WebCite as well, or help in rephrasing how this should be done, and hopefully put these edits back in. I will withdraw myself from further discussions on this (except perhaps correcting factual errors in the subsequent discussion), but just want to throw the suggestion out there. --Eysen 14:39, 8 December 2006 (UTC)

The internet archive is years old with broad recognition and support. WebCite is one of a number of options at archiving specific versions of a page. If you would like to reference Web archiving here, that would likely be fine. However, I don't feel it appropriate for you to be adding numerous links to a service you are involved in operating. If the service is found to be valuable to the wikipedia community, there are plenty of editors uninvolved with the company who will find and add it. See WP:COI, Gunther Eysenbach, and Special:Contributions/Eysen for more context on my actions. I wish your well-intentioned service the best of luck! here 22:51, 8 December 2006 (UTC)