Wikipedia talk:Link intersection

From Wikipedia, the free encyclopedia

This proposal is still under construction. Feel free to jump in. -- Samuel Wantman 10:03, 18 March 2007 (UTC)

Two things to think about... first, probably not all links are meant as tags; and second, this would likely incur the same performance hit that prevents category intersection from working on enwiki. But yes, better-searchable metadata or tags would be an improvement. >Radiant< 07:48, 20 March 2007 (UTC)

Contents

[edit] A question on technology

About the problem of boolean functions in categories (Canadian films + Drama films = Canadian drama films): Is it really a software limitation, or a problem of server load? Has anyone researched the consequences in terms of server load? Hoverfish Talk 05:06, 7 April 2007 (UTC)

There has been some research on category intersection in regard to the server load, and there would be a problem if the largest categories are intersected. I think there are ways around this problem, and in discussions with User:Rick Block and User:Radiant! we identified some ways to limit the impact on the servers. As for the wikilinks, there are some articles that have a truly huge number pages that link to it. I believe the record is held by United States which has hundreds of thousands. The resulting server load problem could be easily dealt with by the software. One possibility is to just gray out any link that is too big. If you can't choose "United States", you might be able to choose Albany, New York. A better way would to analyze the request and reject those that might cause too much server load. There shouldn't be a problem intersecting links if at least one of them is small. This speeds up the process since only the small set has to be checked for membership in the hugely large set. This, I would think, would normally be the case with link intersection. -- Samuel Wantman 06:34, 7 April 2007 (UTC)

[edit] Great idea!

I use "What links here" all the time, but get the impression many people don't. Link intersection would be a great idea. Would it involve filtering sets of "what links here" lists? That would be a really great idea. I've manually done this in the pat, and it does produce some interesting results. Carcharoth 18:31, 10 April 2007 (UTC)

[edit] interesting

I think this would be a great improvement over "what links here", which for popular articles is simply unusable as a practical matter (especially as I can't seem to find a search/sort feature for it). I think this would also be useful to track down articles with duplicate/near-duplicate content for merging. Wl219 07:48, 8 May 2007 (UTC)

[edit] The size of what-links-here

Using the examples in the project page itself: bridge has over 500 mainspace backlinks (so many that I couldn't get them all in one api.php database query), and suspension bridge has 412, just below the limit for a query (bots are allowed larger queries, but that isn't really relevant here because this feature would be used by people who don't really know what they're doing from a technical point of view, rather than bots which are assumed to take server load into consideration). This is likely to lead to higher server load than predicted if implemented in the most obvious manner. There's been some discussion on Wikipedia talk:Category intersection about using a fulltext index (which would work in a similar way to the 'search' function); the website given there as an example doesn't seem to be working at the moment, but it apparently did work once. --ais523 08:27, 25 June 2007 (UTC)

I'm not understanding this concern. Let's say that a page has 1000 backlinks. I can't imagine that someone will seriously want to find similar pages with anything close to that number. The list of links all start out unchecked. You have to check off the ones you are interested in (is this clear in the description?) I was thinking that typically, there'd be just a small handful of links checked. Perhaps an extreme case would be 20 or 30. The more links that are checked off, the smaller the intersection set becomes. At some point the search becomes pointless since nothing will be returned. Having more links checked off could also speed up the process. I don't know if there is a count of backlinks that is being maintained for each page. I'm assuming there is. If you sort the checked off links by the number of back-links on each page, it would be quicker to do the query because the set of links common to all the pages cannot be more than the number of back-links on the page that has the fewest. So the more backlinks that are checked, the faster the resulting intersection becomes zero. Even if this were a problem, the interface could set a limit to the number of links checked. Something in the order of 20 links seems sufficient. -- SamuelWantman 22:55, 25 June 2007 (UTC)

[edit] A previous, perhaps related discussion

Note: The following is reposted from User talk:John Broughton/Editor's Index to Wikipedia, with some minor deletions that aren't on point. -- John Broughton (♫♫) 01:44, 26 June 2007 (UTC)

[edit] Original question from User:Teratornis

Speaking of the index, which I also like, this makes me realize there is no built-in indexing feature in MediaWiki. Such a feature would be nice to have. For example, I would like to automatically generate index pages for all the (main namespace) articles categorized under some particular category. As I'm sure you know, this sort of thing is taken for granted in technical publishing software such as DocBook (see Making an Index). I did a cursory Google search on Meta without finding much. This might be semi-related: m:Help-style indexing. --Teratornis 20:56, 6 February 2007 (UTC)

[edit] Reply from User:John Broughton

[edit] Meta and automatic keyword generation

What you found at m:Help-style indexing (and I'd never known about) was a built-in index (of sorts) for meta help pages, using keywords. For example, for m:Help:DPL, if you look at the source, you'll find the following:

<meta name="keywords" content="Help:DPL,Administration,Advanced templates,Array,Calculation,Cascading style sheets,Category,Common words, searching for which is not possible,Contents,Deleting a page,Diff" />

Looking at Help:Edit summary, this is in the source:

<meta name="keywords" content="Help:Edit summary,Contents,A quick guide to templates,Calculation,Category,Diff,Dummy edit,Edit conflict,Edit toolbar,Editing,Editing shortcuts" />

And looking at a very recent policy, Wikipedia:Canvassing, which has no antecedent at meta, this is in the source:

<meta name="keywords" content="Wikipedia:Canvassing,Canvassing,WP:CANVAS,WP:CANVASS,Administrators' noticeboard,Consensus,Ignore all rules,Multiposting,Policies and guidelines,Requests for arbitration/Guanaco, MarkSweep, et al,Requests for arbitration/IZAK" />

Where do these keywords come from? From wikilinks; they are automatically created by stripping off "Wikipedia:". (The software is smart enough to also strip off a the front of a full URL when a URL is used rather than a wikilink, in the text.) In fact, keywords are generated for every page in this wiki, I believe, based on my looking at a regular article and at my user talk page, though the rules appear to be different for different types of pages.

[edit] So, what next?

But are keywords used for anything that a normal editor might encounter? I can't find any indication that they are. A search of Wikipedia namespace found only one thing vaguely related to keywords, this very unusual WikiProject, which survived two deletion attempts (in the first, no one voted; in the second, a couple of users said - essentially - "I have no idea what this is, but it could be useful.") (Related page: User:Tractor.) And while the founder and sole member of that WikiProject is aware that source pages include keywords, he apparently isn't aware of their potential power (or has a totally different focus).

So, to summarize, we have (a) automated keyword generation; (b) a existing feature in meta that I'm guessing was designed for programmers looking through "m:Help" files, which uses keywords found on a subset of meta pages, and (c) nothing else, apparently, that takes advantage of these (except, possibly, outside search engines?). -- John Broughton (☎☎) 23:21, 6 February 2007 (UTC)

[edit] Listed at bugzilla

This proposed feature request has been posted on bugzilla as bug 10497 -- SamuelWantman 21:54, 7 July 2007 (UTC)

Maybe someone should prod a friendly developer to actually look at the bugzilla request? Carcharoth (talk) 23:43, 21 February 2008 (UTC)