User talk:Brent Gulanowski/Categorization

From Wikipedia, the free encyclopedia

OK, so, here are some specific questions:

  • What's the Colon? colon classification
  • What's the UDC? Universal Decimal Classification
  • Is "LC" the "Library of Congress" categorization system?
  • Is "DDS" the Dewey Decimal System?
  • When you say "category: all" page, do you mean that there's a literal page? Or is that a symbolic name?
  • What do you propose to do for the 200K articles already existing?

I'm sorry if this sounds hostile or dismissive, but I've been reading categorization proposals till my face turns blue. Most propose the following features:

  • Articles are tagged with a category somehow. How to do this varies.
  • It's possible to do multiple categories per article.
  • Categories can have sub-categories.
  • Category pages have links to articles in that category.
  • Category pages have links to sub-categories of that category.
  • Articles have a special link to the category or categories that they belong to.

Some also have these features:

  • Articles in a sub-category don't get linked in that sub-category's "super-category".

I guess I'm wondering if you could pull this together into a meta:Categorization requirements article or something. I think we could use that to determine if concrete implementation proposals actually meet the requirements. --ESP 21:49, 17 Dec 2003 (UTC)

I will add links for the articles which describe the terms I have mentioned. I should move the related links to the top and make them required reading, perhaps. I am still not used to writing hyper-text. It ruins my train of thought to think about linking, and I forgot to re-read it. Please see the Library classification page and related pages, though. You did get two out of four! :-)
I am frustrated to hear that there are very many categorization proposals when all I have seen is a bunch of me-too requests to use Dewey Decimal or Library of Congress or whatever weak book-sorting method and a bunch of hyper-specific examples of desired categories on Wikipedia talk:Category schemes. I don't count those as proposals. The difficulty in finding pages about categorization is particularly ironic, but maybe I'm just insufficiently 1337.
Category:all -- before getting into this, which is maybe too fine grained an issue, consider the wider question of whether a category should have one page or two pages. What I mean is that each category has two descriptions: its actual members (thus, a list), or a natural language summary, in other words an article. The two are logically distinct, but whether to make two pages (or three, see below) is ergonomic, if I can use that word. Anyway, here follows some more stuff, which I can either put into the article itself as is, or it can be debated and dissected here first.

Contents

[edit] Addenda to the Proposal

[edit] Category:All

All articles belong to the "all" category by definition. The main page, if it is the "category:all" page (which I implicitly assume in the article), is the summary article for the all category, as opposed to the member list for the "all" category. A third aspect of a category could be its meta-data, which is variable and not related to the categorization scheme. The purpose of meta-data is ergonomic. The meta-data is everything else you'd want on a page for a category -- I don't know how else to define it. The side bar, control panel (search, user links, etc.). The main page is possibly unique in light of the additional content (including what could be called meta-meta-data if that term was not so awful and nearly meaningless), so I deliberately did not dwell on it.

[edit] Separation of Concept from Implementation

How pages are tagged should be irrelevant at the conceptual and design stages. I personally have no stake in that. What is more important is the logical process which leads to the definition of a category, which is why I am emphasizing an algorithm, and why I cannot over-emphasize it (from my position).

Whether the wiki had two or 200K articles should not matter to the algorithm, although I accept that it matters to a real implementation. It is possible to create constraints which make the algorithm more complicated but more flexible and conducive, which do not change the essential nature of the algorithm.

[edit] Default Categories

For example, one could define a set of default categories that are easy to generate automatically. The most obvious one is alphabetical ordering, but there could be hundreds of similar ones based on sorting (by date created, length, whatever). Although you don't have to use sorting and partioning, either; that is just the easiest to implement or think about. (Partitioning, by the way, leads to a nice uniform tree of categories.)

(It would be possible to define a set of default categories using links, and simplify by discarding any link which creates a loop. Either mark each page as categorized (if you want a tree), or not (if you want a DAG). A category could be defined as the set of pages reached from the first page in n links, or it could be the first n pages (assuming some kind of ordering of links). Link (or page) n+1 would be taken as the first in a new category. This is much more complicated a means to auto-generate a default set of categories. It would be interesting to try. Chances are it would not be much more useful than an alphabetical sorting for human use.)

[edit] Creation of New Categories

Whether one starts with all articles not categorized or all articles in system-generated default categories, the next step is to start producing useful categories. This happens one page at a time. Ideally you only allow one page at a time, although letting users lose on the system might prove problematic -- I leave that to the sychronization specialists. But say you have no default categories and all 200K articles are just sitting in "all" or even "none". By the algorithm as it stands, we immediately have to define a special "category" category. Whether this is safe by axiomatic set theory we might need a specialist to help us with. I don't know the implications of self reference in set theory, i.e.: can a set contain itself? Probably unimportant here. It might, however, be important to decide if a category is a purely logical entity or if it is merely a page in the "category" category. I'd say the former, since pages are in some sense generated as needed, and are distinct from articles themselves, correct?

[edit] Example of Creating a New Category

Given the "all" category, the "category" category, and two articles, one about each of these categories (being named, for now "category:all" and "category:category", both articles are immediately members of both categories. Thus, a member list page generated for either will include links to both. Category member list pages 'are not' ur-entities (that is, not eligible for category membership themselves), only snapshots of the system at a moment in time. (Set theorists might take issue with this.) We are finally ready to add a new page to the category system. Let's say we have an article entitled "set theory" (or "backgammon" if you prefer). Article (set theory: ur-element) "set theory" ("backgammon") is added first to category "all". At this point the algorithm provides two options:

  • do nothing
  • move "set theory" ("backgammon") into a sub-category (set theory: sub-set) of "all" entitled, say "mathematics" ("games")

The algorithm only lets you move one article at a time into the new sub-category. You could choose to define a language for category management tasks which allowed the movement of lists of pages, and a user interface which allowed the selection of pages in order to construct a list interactively. Whatever, that's the implementation. Regardless, the article(s) so moved are still logically part of the parent category, although it is better to call the new sub-category a partition and to represent it with a token in any list of the members of the parent category (this is distinct from saying that the sub-category is itself a member of the sub-category -- we're not allowing sets of sets here, for the moment).

[edit] Implications of Default Categories (Autogenerated)

The system should probably auto-generate an empty category summary article or at least a token or reference. This might be different than how uncreated non-category articles are created.

If the articles are initially placed in auto-generated categories, it is immaterial whether those categories are "special" (say, permanent where others are not, or temporary where others are permanent). If there are ten pages per category, then you'll have 20K default categories, so its clear that the system better be able to handle them efficiently, but I don't see why that would be a problem.

A more annoying problem of combining default category generation with category summary articles is that you end up with countless stub pages that will never be filled. It would make sense that auto-generated categories do not have auto-generated stub pages.

[edit] Prevention of Orphan Categories

What is a problem, in the case of multiple contributors defining categories, is the creation of bogus categories, but that's a management issue. The algorithm takes it for granted that contributors only define meaningful categories. What would be required is that an article which begins life in a default category has to be put into an existing real category before it can be moved into a sub-category. So, the first article has to be added to the "all" category (nothing can be added to the "category" category). A stricter variation on the algorithm would involve forcing a new article to be added to sub-categories in sequence, starting at the "all" category, to ensure no orphan categories are created. Alternately, it is enough to require that every new category has a parent category, which can be proved (by induction) to amount to the same thing. One trusts that contributors will not simply make every new category an immediate child of "all".

[edit] Viewing Membership of a Category

Exactly how a list of members is maintained in the implementation is not especially relevant. Likewise, it is not important what the maximum number of categories per article is set at. What is important is that an article cannot be both a member of a category and a sub-category -- membership in any category implies membership in its parent. If an article somehow made it into the lists of both the parent and the child, it would in fact be in the parent twice, which is a violation of the nature of a set. Keep in mind that we are using a token of some kind to represent the sub-category as a partition of the parent -- that token (a pointer) literally is those members. It is important to distinguish this state of affairs from the idea of a record or structure in a programming language like C -- if a struct contains a struct as a member, the members of the second struct are not members of the first struct.

-- Brent Gulanowski 07:45, 18 Dec 2003 (UTC)