Talk:Search engine indexing
From Wikipedia, the free encyclopedia
Contents |
[edit] The goal of this topic
The goal is to provide an authoritative resource on the architecture, behavior, major processes and challenges in search engine indexing. This should be described for the general audience of the web, not tech nerds (such as myself!).
Editors, please refrain from adding commercial references. Everyone learned about search lately from Google and thinks it is the best of all things and how everything must be understood, and while it is, this article must provide a NPOV.
Everyone is invited to edit, and I would love the help.
[edit] TODO
- fill out the list of references
- correctly formatted references
- add back in the some of the content removed on the 9th, but in a correct fashion
- remove annotational garbage and meet wikipedia article standards
- need to harmonize this with facts about other types of search engines. mention other indicies like tries. mention other media types like audio,video,image. this is for full text, but mention partial text, nocache, metasearch and other search engine times. it is misleading (IMO) to only portray this as the one way in which search engines index
- come up with the rest of this todo list when there is time.
- learn about an integrate with Technology template, Technology portal, other relevant templates or portals
- Get rid of 'weasel words', where the article contains statements including 'generally speaking it is accepted that ...', 'most agree'. Replace these with factual references (I know they exist, just have to cite).
- Remove all personal or opinionated content, or rephrase it to be neutral and factual.
- harmonize with information extraction
[edit] Search engine sizes
This comes from http://blog.searchenginewatch.com/blog/041111-084221, so I have not included it at the moment in the article, as I do not want to do anything illegal, and am not sure this is the best reference. The goal is to show the sizes, at least at some point in time, of the number of pages indexed, to help get a feel for the size. The understanding and reference to sizes in application is important to understand the technological challenge and the rationale behind the intense research in compression and forms of indexing and search engine architectures. Josh Froelich 16:44, 15 December 2006 (UTC)
Search Engine | Reported Size | Page Depth |
8.1 billion | 101K | |
MSN | 5.0 billion | 150K |
Yahoo | 4.2 billion (estimate) |
500K |
Ask Jeeves | 2.5 billion | 101K+ |
- Also see Overture press release Josh Froelich 16:52, 15 December 2006 (UTC)
[edit] Controlled vocab
- maybe provide link to full text search topic, harmonize with its contents
- explain controlled vocab style indexing, start lists, weighted lists, other techniques, which are indexed differently in the sense that a specifialized inverted index is created that is not data driven. the keywords (in keyword based controlled vocab searching) are like classes in a classification model or an associate array or map to keywords and specific full text terms/articles.
[edit] Notes on WikiPedia as an Example
I am considering adding an example of WikiPedia itself of the innards of search engine indexing. For the wikipedia lucene example, looking at the SVN source code on mediawiki's webstie, we can see that:
- In http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/EnglishAnalyzer.java?revision=6911&view=markup, when parsing documents, each token is lowercased and stemmed.
- In http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/MWDaemon.java?revision=8447&view=markup, searchers operate off the index asynchronously in multiple java threads
- In http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/MWSearch.java?revision=8991&view=markup, using a rebuild command, wikipedia article corpus is indexed sequentially, in its entirety.
- Lots of info about SearchState - http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/SearchState.java?revision=8992&view=markup
- wikipedia uses a separate indexer for german, esperanto, and russian, but everything else uses the english tokenizer, based on the language.
- we can see that we first write incremental changes to an in memory index which is then later written to a file-based index
- the wiki text is first parsed and wiki syntax is removed in the stripWiki function, so that the document is treated like a normal english document
Josh Froelich 20:21, 7 December 2006 (UTC)