Search engine indexing

From Wikipedia, the free encyclopedia

Search engine indexing entails how data is collected, parsed, and stored to facilitate fast and accurate retrieval. Search engine design incorporates concepts from Linguistics, Cognitive psychology, Mathematics, Informatics, Physics, and Computer science. Alternate names for the indexing process are Web Indexing and Internet Indexing, within the context of search engines designed to aid in finding web pages on the Internet, the most popular incarnation.

Contents

[edit] Indexing

The goal of storing an index is to optimize the speed and performance of finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would take a considerable amount of time and computing power. Where as an index of 1000 documents can be queried within milliseconds, a raw scan of 1000 documents could take hours (as an example). No search engine user would be comfortable waiting several hours to get search results. The trade off for the time saved during retrieval is that additional storage is required to store the index and that it takes a considerable amount of time to update.

[edit] The Inverted Index

Many search engines incorporate an inverted index when evaluating a search query to quickly locate the documents which contain the words in a query and rank these documents by relevance. The inverted index stores a list of the documents for each word.[1] The search engine can retrieve the matching documents quickly using direct access to find the documents for a word. The following is a simplified illustration of the inverted index:

Inverted Index
Word Documents
the Document 1, Document 3, Document 4, Document 5
cow Document 2, Document 3, Document 4
says Document 5
moo Document 7

The above figure is a simplified form of a Boolean index. Such an index would only serve to determine whether a document matches a query, but would not contribute to ranking matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of the word in each document.[2] With position, the search algorithm can identify word proximity to support searching for phrases. Frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of information retrieval.

The inverted index is a sparse matrix given that words are not present in each document. It is stored differently than a two dimensional array to reduce memory requirements. The index is similar to the term document matrices employed by latent semantic analysis. The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically distributed.[citation needed]

[edit] Index Merging

The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. A merge involves identifying the document or documents to add into or update in the index and parsing each document into words. For technical accuracy, a merge involves the unison of newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives.

After parsing, the indexer adds the containing document to the document list for the appropriate words. The process of finding each word in the inverted index in order to denote that it occured within a document may be too time consuming when designing a larger search engine, and so this process is commonly split up into the development of a forward index and the process of sorting the contents of the forward index for entry into the inverted index. The inverted index is named inverted because it is an inversion of the forward index.

[edit] The Forward Index

The forward index stores a list of words for each document. The following is a simplified form of the forward index:

Forward Index
Document Words
Document 1 the,cow,says,moo
Document 2 the,cat,and,the,hat
Document 3 the,dish,ran,away,with,the,spoon

The rationale behind developing a forward index is that as documents are parsed, it is better to immediately store the words per document. The delineation enables asynchronous processing, which partially circumvents the inverted index update bottleneck.[3] The forward index is sorting to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index.

[edit] Index Design Factors

Major factors in designing a search engine's architecture include:

  • How data enters the index - how words are added to the index during corpus traversal, and whether multiple indexers can work asynchronously. Traversal typically correlates to the data collection policy.
  • How to store the index data - whether information should be compressed or filtered
  • How quickly a word can be found in the inverted index. How quickly an entry in a data structure can be found, versus how quickly it can be updated or removed, is a central focus of computer science.
  • How to deal with index corruption, if bad data can be isolated, and the fault tolerance of the architecture

[edit] Challenges in Parallelism

A major challenge in the design of search engines is the management of parallel processes. There are many opportunities for race conditions and coherence faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competiting tasks.

Producer Consumer Model - consider that authors are producers of information, and a crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searcheable information and users are the consumers that need to search.

The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture.

[edit] Compression

Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk. Consider the following scenario for a full text, Internet, search engine.

  • An estimated 2,000,000,000 different web pages exist as of the year 2000[4]
  • A fictitious estimate of 250 words per webpage on average, based on the assumption of being similar to the pages of a novel.[5]
  • It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per character[6][7]
  • The average number of characters in any given word on a page can be estimated at 5 (Wikipedia:Size_comparisons)
  • The average personal computer comes with about 20 gigabytes of usable space[8]

Given these estimates, generating a uncompressed index (assuming a non-conflated, simple, index) for 2 billion web pages would need to store 5 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 25 gigabytes of storage space alone, more than the average size a personal computer's free disk space. This space is further increased in the case of a distributed storage architecture that is fault-tolerant. Using compression, the index size can be reduced to a portion of its size, depending on which compression techniques are chosen. The trade off is the time and processing power required to perform compression.

Notably, large scale search engine designs incorporate the cost of storage, and the costs of electricity to power the storage. Compression, in this regard, is a measure of cost as well.

[edit] Document Parsing

Document parsing involves breaking apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. For example, if the full contents of a document consisted of the sentence "Hello World", there would typically be two words found, the token "Hello" and the token "World". In the context of search engine design and natural language processing, parsing is more commonly referred to as tokenization, and sometimes word boundary disambiguation, tagging, Text segmentation, Speech segmentation, lexing, or lexical analysis. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang.

Natural language processing, as of 2006, is the subject of continous research and technological improvement. There are a host of challenges in tokenization, in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets.

[edit] Challenges in Natural Language Processing

Word Boundary Ambiguity - native English speakers can at first consider tokenization to be a straightfoward task, but this is not the case with designing a multilingual indexer. In digital form, the text of other languages such as Chinese, Japanese or Arabic represent a greater challenge as words are not clearly delineated by whitespace. The goal during tokenization is to identify words for which users will search. Language specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).

Language Ambiguity - to assist with properly ranking matching documents, many search engines collect additional information about words, such as its language or lexical category (part of speech). These techniques are language-dependent as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document.

Diverse File Formats - in order to correctly identify what bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document.

Faulty Storage - the quality of the natural language data is not always assumed to be perfect. An unspecified amount of documents, particular on the Internet, do not always closely obey proper file protocol. Binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade.

[edit] Tokenization

Unlike literate human adults, computers are not inherently aware of the structure of a natural language document and do not instantly recognize words and sentences. To a computer, a document is only a big sequence of bytes. Computers do not know that a space character between two sequences of characters means that there are two separate words in the document. Instead, a computer program is developed by humans which trains the computer, or instructs the computer, how to identify what constitutes an individual or distinct word, referred to as a token. This program is commonly referred to as a tokenizer or parser or lexer. Many search engines, as well as other natural language processing software, incorporate specialized programs for parsing, such as YACC OR Lex.

During tokenization, the parser identifies sequences of characters, which typically represent words. Commonly recognized tokens include punctuation, sequences of numerical characters, alphabetical characters, alphanumerical characters, binary characters (backspace, null, print, and other antiquated print commands), whitespace (space, tab, carriage return, line feed), and entities such as email addresses, phone numbers, and URLs. When identifying each token, several characteristics may be stored such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number.

[edit] Language Recognition

If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language, given that many of the later steps are language dependent (such as stemming and part of speech tagging). Language recognition is the process by which a computer program attempts to automatically identify, or categorize, the language of a document. Other names for language recognition include language classification, language analysis, language identification, and language tagging. Automated language recognition is the subject of ongoing research in natural language processing. Finding which words the language belongs to may involve the use of a language recognition chart.

[edit] Format Analysis

Depending on whether the search engine supports multiple document formats, documents must be prepared for tokenization. The challenge is that many document formats contain, in addition to textual content, formatting information. For example, HTML documents contain HTML tags, which specify formatting information, like whether to start a new line, or display a word in bold, or change the font size or family. If the search engine were to ignore the difference between content and markup, the segments would also be included in the index, leading to poor search results. Format analysis involves the identification and handling of formatting content embedded within documents which control how the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning, or text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary and very little information is disclosed, while others are well documented. Common, well-documented file formats that many search engines support include:

Techniques for dealing with various formats include:

  • Using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format
  • Writing a custom parser

Some search engines support inspection of files that are stored in a compressed, or encrypted, file format. If working with a compressed format, then the indexer first decompresses the document, which may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include:

Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing:

  • Including hundreds or thousands of words in a section which is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in HTML, which may incorporate the use of CSS or Javascript to do so).
  • Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer.

[edit] Section Recognition

Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well-written book, divided into organized chapters and pages. Many documents on the web contain erroneous content and side-sections which do not contain primary material, that which the document is about, such as newsletters and corporate reports. For example, this article may display a side menu with words inside links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear in the raw source content sequentially are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, a dilemma ensues where the quality of the index is degraded and search quality is degraded due to the mixed content and improper word proximity. Two primary problems are noted:

  • Content in different sections is treated as related in the index, when in reality it is not
  • Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents, assuming the goal is to go after the meaning of each document, a sub-goal of providing quality search results.

Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via Javascript. Viewers of web pages in web browsers see this content. If the search engine does not render the page and evaluate the javascript within the page, it would not 'see' this content in the same way, and index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via javascript or use the Noscript tag to ensure that the web page is indexed properly. At the same time, this fact is also exploited to cause the search engine indexer to 'see' different content than the viewer.

[edit] Meta Tag Indexing

Specific documents offer embedded meta information such as the author, keywords, description, and language. For HTML pages, the meta tag contains keywords which are also included in the index. During earlier growth periods in the Internet and search engine technology (more so, the hardware on which it runs) would only index the keywords in the meta tags for the forward index (and still applying techniques such as stemming and stop words). The full document would not be parsed. At this time, full-text indexing was not as well established, nor was the hardware to support such technology. The design of the HTML markup language initially included support for meta tags for this very purpose of being properly and easily indexed, without requiring tokenization.[citation needed]

As the Internet grew (the number of users capable of browsing the web and the number of websites increased and the technology for making websites and hosting websites improved), many brick-and-mortar corporations went 'online' in the mid 1990s and established corporate websites. The keywords used to describe webpages (many of which were corporate-oriented webpages similar to product brochures) changed from descriptive keywords to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. The fact that these keywords were subjectively-specified was leading to spamdexing, which drove many search engines to adopt full-text indexing technologies in the 1990s. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky', the customer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. In this sense, full-text indexing was more objective and increased the quality of search engine results, as it was one more step away from subjective control of search engine result placement, which in turn furthered research of full-text indexing technologies.

[edit] See also

[edit] External links

[edit] References

  1. ^ Black, Paul E., inverted index, Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology Oct 2006. Verified Dec 2006.
  2. ^ Grossman, Frieder, Goharian. IR Basics of Inverted Index. 2002. Verified Dec 2006.
  3. ^ Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Stanford University. 1998. Verified Dec 2006.
  4. ^ Murray, Brian H. Sizing the Internet. Cyveillance, Inc. Pg 2. July 2000. Verified Dec 2006.
  5. ^ Blair Bancroft. Word Count:A Highly Personal-and Probably Controversial-Essay on Counting Words. Personal Website. Verified Dec 2006.
  6. ^ The Unicode Standard - Frequently Asked Questions. Verified Dec 2006.
  7. ^ Storage estimates. Verified Dec 2006.
  8. ^ PC Technology Guide: Guides/Storage/Hard Disks. PC Technology Guide. 2003. Verified Dec 2006.