Talk:Search engine indexing

From Wikipedia, the free encyclopedia

1 The goal of this topic
2 TODO
3 Search engine sizes
4 Controlled vocab
5 Notes on WikiPedia as an Example

[edit] The goal of this topic

The goal is to provide an authoritative resource on the architecture, behavior, major processes and challenges in search engine indexing. This should be described for the general audience of the web, not tech nerds (such as myself!).

Editors, please refrain from adding commercial references. Everyone learned about search lately from Google and thinks it is the best of all things and how everything must be understood, and while it is, this article must provide a NPOV.

Everyone is invited to edit, and I would love the help.

[edit] TODO

fill out the list of references
correctly formatted references
add back in the some of the content removed on the 9th, but in a correct fashion
remove annotational garbage and meet wikipedia article standards
need to harmonize this with facts about other types of search engines. mention other indicies like tries. mention other media types like audio,video,image. this is for full text, but mention partial text, nocache, metasearch and other search engine times. it is misleading (IMO) to only portray this as the one way in which search engines index
come up with the rest of this todo list when there is time.
learn about an integrate with Technology template, Technology portal, other relevant templates or portals
Get rid of 'weasel words', where the article contains statements including 'generally speaking it is accepted that ...', 'most agree'. Replace these with factual references (I know they exist, just have to cite).
Remove all personal or opinionated content, or rephrase it to be neutral and factual.
harmonize with information extraction

[edit] Search engine sizes

This comes from http://blog.searchenginewatch.com/blog/041111-084221, so I have not included it at the moment in the article, as I do not want to do anything illegal, and am not sure this is the best reference. The goal is to show the sizes, at least at some point in time, of the number of pages indexed, to help get a feel for the size. The understanding and reference to sizes in application is important to understand the technological challenge and the rationale behind the intense research in compression and forms of indexing and search engine architectures. Josh Froelich 16:44, 15 December 2006 (UTC)

Search Engine	Reported Size	Page Depth
Google	8.1 billion	101K
MSN	5.0 billion	150K
Yahoo	4.2 billion (estimate)	500K
Ask Jeeves	2.5 billion	101K+

Also see Overture press release Josh Froelich 16:52, 15 December 2006 (UTC)

[edit] Controlled vocab

maybe provide link to full text search topic, harmonize with its contents
explain controlled vocab style indexing, start lists, weighted lists, other techniques, which are indexed differently in the sense that a specifialized inverted index is created that is not data driven. the keywords (in keyword based controlled vocab searching) are like classes in a classification model or an associate array or map to keywords and specific full text terms/articles.

[edit] Notes on WikiPedia as an Example

I am considering adding an example of WikiPedia itself of the innards of search engine indexing. For the wikipedia lucene example, looking at the SVN source code on mediawiki's webstie, we can see that:

In http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/EnglishAnalyzer.java?revision=6911&view=markup, when parsing documents, each token is lowercased and stemmed.

In http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/MWDaemon.java?revision=8447&view=markup, searchers operate off the index asynchronously in multiple java threads

In http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/MWSearch.java?revision=8991&view=markup, using a rebuild command, wikipedia article corpus is indexed sequentially, in its entirety.

Lots of info about SearchState - http://svn.wikimedia.org/viewvc/mediawiki/trunk/lucene-search/org/wikimedia/lsearch/SearchState.java?revision=8992&view=markup
- wikipedia uses a separate indexer for german, esperanto, and russian, but everything else uses the english tokenizer, based on the language.
- we can see that we first write incremental changes to an in memory index which is then later written to a file-based index
- the wiki text is first parsed and wiki syntax is removed in the stripWiki function, so that the document is treated like a normal english document

Josh Froelich 20:21, 7 December 2006 (UTC)

Retrieved from "http://en.wikipedia.org../../../s/e/a/Talk%7ESearch_engine_indexing_2929.html"

Talk:Search engine indexing

From Wikipedia, the free encyclopedia

Contents

[edit] The goal of this topic

[edit] TODO

[edit] Search engine sizes

[edit] Controlled vocab

[edit] Notes on WikiPedia as an Example

Views

Navigation

Search