Overcategorization

Overcategorization, overcategorisation or category clutter is the process of assigning too many categories, classes or index terms to a given document. Wikipedia has developed a set of principles concerning overcategorization (Wikipedia:overcategorization). Interestingly, the concept seems not to appear in the literature of Library and information science (LIS), although it is clearly relevant for all kinds of document classification and indexing. In LIS some related concepts have been developed, for example exhaustivity of indexing and information overload, among others.

Basic principles

If too many categories as assigned to a given document, the implications for the users depends on how informative the links are. If the user is able to distinguish between useful and not useful links, the damage is limited: The user only waste time selecting links. In many cases, however, the user cannot judge whether or not a given link will turn out to be fruitful. In that case he has to follow the link and to read or skim another document. The worst case is, of course, that even after reading the new document the user is unable to decide whether or not it might be useful if its subject matter it thoroughly investigated.

Overcategorization also has another unpleasant implication: It makes the system (for example Wikipedia) difficult to maintain in a consistent way. If the system is inconsistent it means that when the user considers the links in a given category, he will not find all documents relevant in relation to that category.

Basically, the problem of overcategorization should be understand from the perspective of relevance and the traditional measures of recall and precision. If too few relevant categories is assigned to a document recall may decrease. If too many non-relevant categories is assigned precision becomes lower. The hard job is to say which categories are fruitful or relevant for future use of the document.

See also