Talk:Inverted index

From Wikipedia, the free encyclopedia

Maybe the content of Full inverted index should be moved here? This page could also be expanded with some small examples for a full inverted index and an inverted file index, as seen in [1] -- Nils Grimsmo 22:12, 18 November 2005 (UTC)

This has now been done. -- Nils Grimsmo 20:25, 13 December 2005 (UTC)

Contents

[edit] Why is it called "Inverted"

I've been wondering, why is this data structure called an inverted index? After all, it's exactly like a regular index in a book, where you look up a word and find a list of pages that mention this word. It's not "inverted" in any sense from the normal definition of an index...

[2] tries to explain that

The index is "inverted" in the sense that the key value is used to find the record rather than the other way round

but I don't understand this explanation, because this is how all indices work...

I could understand a different name: "inverted text". Whereas a text (treated as a bag of words) is defined by whether or not each word in the dictionary is in it, we can "invert" this information, and define for each word whether or not each file contains it. Instead of a file structure, we have an "inverted file" structure, which allows for quicker searching of words. Could it be that the phrase "inverted index" was coined by starting with the phrase "inverted-text index" and then dropping the word text? Or something of this sort?

Or is there a different explanation for the word "inverted" that appears in the name of the concept "inverted index"?

Nyh 14:04, 26 December 2005 (UTC)

I have never read anything about the naming, but I would guess that the phrase comes from the term "inverted files". As the concept of the term "file" narrowed over time, and the concept of such an index broadened, I guess some new term was needed, and "inverted index" was chosen because it was close to "inverted files". -- Nils Grimsmo 17:31, 26 December 2005 (UTC)
  • Consider the contents of the related article on search engine indexing, where I describe one reason for why it is called inverted. Josh Froelich 16:37, 15 December 2006 (UTC)

[edit] External links

The link to "inverted index for search engine" does not appear to be related or useful. Perhaps it should be removed. --131.107.65.116 23:08, 8 September 2006 (UTC)

[edit] Current state of the article

I think this article is rather unstructured and messy. It looks like to separate articles concatenated. A full rewrite would perhaps be nice? Klem fra Nils Grimsmo 20:55, 18 October 2006 (UTC)

  • Agree, it looks more like an excerpt from a textbook. I was planning to propose a rewrite when time permits (which will be after the weekend). --airborne 20:46, 19 October 2006 (UTC)

[edit] inverted list/file database

This article should also mention, even before the webcrawling stuff, that the technique was used in some database management systems such as ADABAS, DATACOM/DB, and Infodata's Inquire. According to Pratt, Philip J.; Adamski, Joseph J. (1987). DATABASE SYSTEMS: Management and Design. Boston: Boyd & Fraser Publishing Company, p. 465. ISBN 0-87835-227-9. , many Relational database management systems use this structure for physically storing secondary keys.

I've gathered some more stuff:

http://www.pcmag.com/encyclopedia_term/0,2542,t=inverted+list&i=45332,00.asp "inverted list" points to "inverted file", the def of which follows:

« In data management, a file that is indexed on many of the attributes of the data itself. For example, in an employee file, an index could be maintained for all secretaries, another for managers. It is faster to search the indexes than every record. Also known as "inverted lists," inverted file indexes use a lot of disk space; searching is fast, updating is slower. »

And:

  • Zhang, Hong-Chao; Alting, Leo (1994). "Impact of

computer integrated manufacturing", Computerized Manufacturing Process Planning Systems. Springer, p. 55. ISBN 0-412-41300-0. “An inverted list database is similar to a relational database, but a relational database is at a low level of abstraction, at which the stored tables themselves, and also certain access paths to those stored tables, are directly visible to the user. In general, the inverted list database has the following three significant differences from the relational database:

  • The rows of an inverted list table are considered to be ordered in some physical sequence
  • An ordering may also be defined for the total database in the form of a database sequence
  • For a given table, any number of search keys can be defined in an arbitrary field or combination of fields 

Then:

  • Kaufmann, Helmut; Schek, Hans-Jörg, Swiss Federal Institute of Technology (ETH Zürich) Dpt of Computer Sciences (1995). "Text Search Using

Database Systems Revisited -- Some Experiments", in Carole Goble & John Keane (eds.): [http://books.google.com/books?id=s5DzhBnGRTAC Advances in Databases: 13th British National Conference on Databases (BNCOD 13) July 12 - 14, 1995. Proceedings (Lecture Notes in Computer Science)]. Manchester, United Kingdom: Springer, p. 206. ISBN 3540601007. “...The set of all document identifiers which contain a certain descriptor is called an inverted list or synonymously a posting list. The individual identifiers are called postings. The cardinality of a posting list is called the descriptor's cardinal frequency. Documents can be retrieved using a boolean retrieval method or a non-boolean (Salton 1975). While the former delivers a set of matching documents, the latter delivers an ordered ranking of documents, from which some prefix must be selected as the query's final result. In the current study, we only address boolean retrieval.

2.1 Indexing Documents using Inverted Lists

Today, inverted lists are the de facto standard for word-based indexing of full text documents [[[#Knu73|Knuth 1973]];Harman 1992], which support fast search for individual words in textual documents. In principle, an inverted list works as follows: Assume each document is represented as a tuple of a relation

Document(Document-ID1, Content)

Each document is assigned a number of keywords (e.g. by extracting the individual words) describing the contents of the document. Hence, a document is approximated by a tuple of a relation with a set-valued attribute Descriptor

DocumentApproximation(Document-ID, {Descriptor})

When we are searching for documents, we are interested in all documents containing a certain descriptor. Therefore, we introduce a relation with a set-valued attribute Document-ID

InvertedList(Desriptor, {Document-ID})

Here, we store for each descriptor all references to those documents which contain this descriptor, i.e. its posting list.

Notes:

  1. In the following we assume that all documents are identified by a unique number (1, 2, 3, ...)” 

(with references to:


--Jerome Potts (talk) 09:00, 8 January 2008 (UTC)