Talk:Search engine indexing

From Wikipedia, the free encyclopedia

Contents

[edit] The goal of this topic

The goal is to provide an authoritative resource on the architecture, behavior, major processes and challenges in search engine indexing. This should be described for the general audience of the web, not tech nerds (such as myself!).

Editors, please refrain from adding commercial references. Everyone learned about search lately from Google and thinks it is the best of all things and how everything must be understood, and while it is, this article must provide a NPOV.

Everyone is invited to edit, and I would love the help.

[edit] TODO

  • fill out the list of references
  • correctly formatted references
  • add back in the some of the content removed on the 9th, but in a correct fashion
  • remove annotational garbage and meet wikipedia article standards
  • need to harmonize this with facts about other types of search engines. mention other indicies like tries. mention other media types like audio,video,image. this is for full text, but mention partial text, nocache, metasearch and other search engine times. it is misleading (IMO) to only portray this as the one way in which search engines index
  • come up with the rest of this todo list when there is time.
  • learn about an integrate with Technology template, Technology portal, other relevant templates or portals
  • Get rid of 'weasel words', where the article contains statements including 'generally speaking it is accepted that ...', 'most agree'. Replace these with factual references (I know they exist, just have to cite).
  • Remove all personal or opinionated content, or rephrase it to be neutral and factual.
  • harmonize with information extraction

[edit] Search engine sizes

This comes from http://blog.searchenginewatch.com/blog/041111-084221, so I have not included it at the moment in the article, as I do not want to do anything illegal, and am not sure this is the best reference. The goal is to show the sizes, at least at some point in time, of the number of pages indexed, to help get a feel for the size. The understanding and reference to sizes in application is important to understand the technological challenge and the rationale behind the intense research in compression and forms of indexing and search engine architectures. Josh Froelich 16:44, 15 December 2006 (UTC)

Search Engine Reported Size Page Depth
Google 8.1 billion 101K
MSN 5.0 billion 150K
Yahoo 4.2 billion
(estimate)
500K
Ask Jeeves 2.5 billion 101K+

[edit] Controlled vocab

  • maybe provide link to full text search topic, harmonize with its contents
  • explain controlled vocab style indexing, start lists, weighted lists, other techniques, which are indexed differently in the sense that a specifialized inverted index is created that is not data driven. the keywords (in keyword based controlled vocab searching) are like classes in a classification model or an associate array or map to keywords and specific full text terms/articles.

[edit] Notes on WikiPedia as an Example

I am considering adding an example of WikiPedia itself of the innards of search engine indexing. For the wikipedia lucene example, looking at the SVN source code on mediawiki's webstie, we can see that:

Josh Froelich 20:21, 7 December 2006 (UTC)