Wikipedia:Search engine test

From Wikipedia, the free encyclopedia

This page is an essay. This is an essay. It is not a policy or guideline, it simply reflects some opinions of its authors. Please update the page as needed, or discuss it on the talk page.
Shortcut:
WP:SET
WP:GOOGLE
This guideline in a nutshell This how-to topic in a nutshell:
Measuring is easy. What's hard is knowing what it is you're measuring.

This page describes a method used by some on Articles for Deletion to approximate notability and sometimes in discussions over naming articles. This page discusses its proper use, limits and restrictions. Here are some ways to use Google ([1]), Alexa ([2]) and Yahoo! ([3]) to check articles and other information.

Contents

[edit] Types of search engine tests

On Wikipedia, a Google Test is any use of Google or other search engines as references. Several very distinct kinds of information can be gleaned by this method. It should be stressed that none of these applications are conclusive evidence, but simply a first-pass heuristic or rule of thumb.

  • Unencyclopedic or spurious topics. Some topics introduced to Wikipedia articles don't belong here. Some people think that these can be detected by searching for a relevant phrase on a search engine (Google, for example) and counting the number of search results (see below for problems with this). This technique is used by supporters for weeding out hoaxes, fictions, and unpublished personal theories and hypotheses. It is also used to ascertain whether a topic is of sufficiently broad interest to merit inclusion in the wiki, though this application is highly subject to bias (see below). See Wikipedia:What Wikipedia is not for a comprehensive list of unencyclopedic topics (which is entirely unrelated to the number of search engine hits an item has).
  • Copyrighted material. Large pieces of poorly wikified text, submitted to the wiki all at once, particularly by a new or anonymous user, are often copy-and-pasted from outside sources. Some of these are submitted in violation of copyright. (See also Wikipedia:Spotting possible copyright violations, Wikipedia:Copyrights.) A copy-and-paste operation from an online source can often be detected by running searches for excerpts.
  • Idiomatic usage. The English language often has multiple terms for a single concept, particularly given regional dialects. A series of searches for different forms of a name reveals some approximation of their relative popularity. For a quick comparison of relative usage try googlefight, e.g. comparing deoxyribose nucleic acid and deoxyribonucleic acid. Note that there are cases where this googletest can be overruled, such as when an international standard has been set, as in the case of aluminium.
  • Related sites. If an article is of high quality (see Wikipedia:Featured articles), Google may be used to look for sites that might take an interest in it and be convinced to link to it.
  • Research. Of course, search engines are good for finding sources of further information.

[edit] Techniques

The Google Web search is not the only Google search. In performing a Google test, consider searching groups (USENET newsgroups). This is a significantly different sample and represents, for the most part, conversations in English conducted by people who are not deliberately trying to sell products or reach a mass audience. A "Web" search will typically return 10 to 1000 times more hits than a "groups" search. Because group and Web searches have very different "systemic biases," hit numbers are not comparable. Nevertheless Group searches are particularly helpful in identifying entities whose Web presence may have been artificially inflated by promotional techniques; it is suspicious if a phrase gets, say, 100,000 Web hits but only 10 Groups hits.

USENET postings are date-stamped and have been archived for over twenty years, making them more useful than Web searches as a record of recent history. Using a Groups "advanced search," it is possible to restrict a search by date, which can help in identifying how recent the widespread use of a term is.

Google News searches can assess whether something is currently newsworthy. In comparison to Web or Groups, Google News used to be less susceptible to manipulation by self-promoters, but with the advent of pseudo-news sites designed to collect ad revenues or to promote specific agendas, this test is now no more reliable than the others in areas of popular interest. Note that Google News indexes many "news" sources that reflect specific points of view, and many news sources that are only of local interest.

Depending on the subject, advanced search functions may be useful. For example, adding "site:gov" or "site:edu" will restrict your search to U.S. government sites or U.S. college and university sites.

Other tools that may be useful for research include Google Scholar, which searches academic literature. Also see Wikipedia:Notability (academics)

A caution about Google Scholar: Google Scholar works well for fields that are (1) paper-oriented and (2) where all (or nearly all) respected venues have an online presence. Most papers written by a computer scientist will show up, but for less technologically up-to-date fields, it's dicey. Even the journal Science puts articles online only back to 1996. Thus, Google Scholar should rarely be used as proof of non-notability.

Medline, now part of Pubmed, is the original broadly-based search engine, originating over four decades ago and indexing even earlier papers. Thus, especially in biology and medicine, Pubmed "associated articles" is a Google Scholar proxy for older papers with no on-line presence. E.g., The journal Stroke puts papers on-line back through the 1970's. For this 1978 paper [4],Google Scholar lists 100 citing articles, while Pubmed lists 89 associated articles

Google Book Search can be valuable. As part of the world of print, Google Book Search has a pattern of coverage that is in closer accord with traditional encyclopedia content than the Web, taken as a whole, is; if it has systemic bias, it is a very different systemic bias from Google Web searches. Multiple hits on an exact phrase in Google Book Search provide convincing evidence for the real use of the phrase or concept. Google Book Search can locate print-published testimony to the importance of a person, event, or concept. It can also be used to replace an unsourced "common knowledge" fact with a print-sourced version of the same fact. www.a9.com searches, restricted to "books," can be used in the same way. Its database is apparently the same as that of Amazon's "look inside this book/search inside the book" feature.

Project Gutenberg is especially useful for literary topics and history. Try a Google search for site:gutenberg.org "cock's duty" for example.

[edit] Google bias

When using Google to test for importance or existence, bear in mind that this will be biased in favor of modern subjects of interest to people from developed countries with Internet access, so it should be used with some judgment. For example, a current popular-music group from the United States will probably need many thousands of Google hits before most Wikipedians consider it worthy of inclusion. A similarly important group in a country with less Internet presence will have many fewer hits, if any. An important musician of the 14th century might not show up on Google at all.

Q. What is the minimum number of matches you should see if a term is not made up? (3? 27? 81?)

A. Perhaps a few hundred, but this depends on several things:

  • The article's scope: If narrow, fewer references are required. Try to categorize the point of view, (whether it is NPOV, or other) eg: notice the difference between Ontology and Ontology (computer science).
  • The subject: If it's about some historical person, one or two mentions in reliable texts might be enough; if it's some Internet neologism, it may be on 100 pages and might still not be considered 'existing' for Wikipedia's purposes.
  • The type of sites you find: Pay attention to how open the sites are about accepting submissions. The Urban Dictionary, for example, accepts submissions freely. This is especially important if you suspect an author is self-promoting, or is promoting an idiosyncratic viewpoint. A single Internet user can submit the same ideas to message boards and open-submission sites all over the Internet.
  • Duration the term has existed on Wikipedia: Sometimes when an article is created, the term initially doesn't exist anywhere else on the internet, and the Google test may help determine that it's not widely used. After an article exists on Wikipedia for an extended time, the term (e.g. article title, or other jargon in the article) will be copied to many Wikipedia mirror sites, many unofficial mirrors, as well as "scraper" web sites which return results for any search term their web crawler encounters. Over time it may become harder to determine which hits originated from Wikipedia, and which hits reflect independent usage of the term. Wikipedia is one of the most frequently used sources of information on the web, and it's important that Wikipedia should not directly or indirectly rely on itself as a source.

Further judgment: the Google test checks popular usage, not correctness. For example, a search for the incorrect Charles Windsor gives 10 times more results than the correct Charles Mountbatten-Windsor.

Also, some topics may not be on the Web because of low Internet use in certain areas and cultures of the world.

The search result from Google are highly biased towards popular culture. This article, for example, points out that Barry Williams ("Greg Brady" from the Brady Bunch) had (at time of writing) 45% more Google hits than Albert Einstein (2,400,000 vs. 1,660,000). (On the other hand, this is somewhat out of date: a more recent Google search gave 17,500,000 hits for "Albert Einstein", and only 303,000 for "Barry Williams" despite there being multiple celebrities by that name. On the other hand, a query for Britney Spears [5] returned 30,500,000 hits.)

Especially when trying to determine the frequency of use of diacritic vs. non-diacritic versions of a word, the internet (and therefore Google) is extremely biased towards the non-diacritic versions. This is often more an example of laziness and cluelessness of those who created the webpages than a real test of usage. For example, spelling the weather phenomenon El Niño as 'El Nino' is just plain wrong (it doesn't rhyme with keno, vino, or Zeno). When Spanish words that have the ñ letter get naturalized into English the ñ often gets converted to "ny" (as when cañón became canyon), but "El Niño" is rarely spelled "El Ninyo" (and that spelling is more likely than not on an English-language website). Yet despite the fact that the spelling should be El Niño, a Google test shows that there are more web pages with "El Nino" than "El Niño" (8,830,000 vs. 7,970,000 as of September 2005). Much better criteria for deciding upon the use of the diacritic vs. non-diacritic versions of a word would be the entries in dictionaries, other encyclopedias, and style guides.

Note that other Google searches, particularly Google Book Search, have a different systemic bias from Google Web searches and give an interesting cross-check and a somewhat independent view.

[edit] Urban legend bias

As was mentioned above, Google checks popular usage, not correctness. Just because a particular set of facts are repeated hundreds of times in a Google search, does not make them correct. For example, there is an urban legend about the USS Constitution which has the ship setting sail in 1779, although the ship was not launched until 1797, and there are hundreds of sites which repeat this information[6]. Likewise, the internet is a breeding ground for gossip and "juicy" stories that get repeated almost verbatim from source to source and webpage to webpage, routinely skewing any data that might be obtainable via search engine. Using snopes[7] might be a good idea.

[edit] Non-applicable in some cases, such as pornography

The simple Google test by number of hits is not applicable to people or titles within a number of internet-based businesses, notably (but by no means restricted to) pornography. This is because an entire sub-industry has appeared with the sole purpose of increasing the number of Google hits certain subjects receive. They achieve this by use of a number of techniques, including multiple mirror sites, and spamming of notice boards and Wikipedia. Also, pornographic actors tend to appear in production-line quantities of entirely non-notable films. It is therefore necessary, as per Wikipedia:criteria for inclusion of biographies, for the researcher to prove that the actor or actress has established notoriety. This usually requires finding journalistic coverage, independent biographies or extensive fan clubs.

[edit] Validity of the Google test

Given that the results of a Google test are interpreted subjectively, its implementation is not always consistent. This reflects the nature of the test being used on a case by case basis.

In some cases, articles have been kept with Google hit counts as low as 15 and some claim that this undermines the validity of the Google test in its entirety. The Google test has always been and very likely always will remain an extremely inconsistent tool, which does not measure notability. It is not and should never be considered definitive.

Major factors which may affect Google hit count include subjects from countries where the Internet is not prevalent or topics which are of a historical nature but have not yet been well documented on the Internet. In other cases, it is completely speculative as to why a subject merits inclusion with a hitcount below 100 while other such articles are frequently deleted. Examples include articles on minerals and major landmarks (mountain peaks, etc) that may not be documented much on the Internet.

Also note that the number of hits that Google reports is (sometimes or perhaps always; the details are secret) an estimate, not an exact figure. The number of hits reported by Google has little meaning until one navigates to the last page of the results, since it's only then that Google applies all criteria to a query (such as eliminating duplicate and spam control). Often the hit count is cut by a factor of 10 (or much more) after doing this. Jumping to the end of the results (or as far as is practical), also reveals if the hit count is actually related to the intended meaning of the search term. Queries are further improved by setting the results per page to the maximum value (which reduces duplicate results) and excluding any domain of a bias party. For instance "JoesRockBand.com" should be excluded when searching for references to "Joe's Rock Band". For longer lasting articles, excluding the term "wikipedia" itself, may be needed, to avoid counting all the mirrors and language versions of a wikipedia article. In fact, the AFD discussion itself, once archived and indexed by Google, may actually add to the Google hit count used the next time the item is discussed. Finally, some human labor has to be involved, and a manageable sample of sites found must be opened individually, to actually verify the relevance of the hit count.

[edit] On "unique" results

For search terms that return many results, Google uses a process that eliminates results which are "very similar" to other results listed, both by disregarding pages with substantially similar content and by limiting the number of pages that can be returned from any given domain. For example, a search on "Taco Bell" will only give a couple pages from tacobell.com even though many in that domain will certainly match. Further, Google's list of unique results is constructed by first selecting the top 1000 results and then eliminating duplicates without replacements. Hence the list of unique results will always contain fewer than 1000 results regardless of how many webpages actually matched the search terms. For example, from the about 742 million pages related to "Microsoft", Google presently returns 552 "unique" results (as of Jan 9, 2006[8]). Caution must be used in judging the relative importance of websites yielding well over 1000 search results. Once the count goes over a few hundred, it becomes difficult or impossible to determine just how high it should be. (At the end of a result list is a link to "search with omitted results included"; however, in no case will Google ever allow you to see more than 1000 results.) On the other hand, an extremely low unique count (say, a few dozen) may be a sign that only a small number of unique hits really exist. Doing a site-specific search may help determine if most of the hits are coming from the same web site; a single web site can account for hundreds of thousands of hits.

[edit] Search engine limitations

Many, probably most, of the publicly available web pages in existence are not indexed. Each search engine captures a different percentage of the total. Nobody can tell exactly what portion is captured.

The estimated size of the World Wide Web is at least 11.5 billion pages [9], but a much deeper (and larger) Web, estimated at over 3 trillion pages, exists within databases whose contents the search engines do not index. These dynamic web pages are formatted by a Web server when a user requests them and as such cannot be indexed by conventional search engines. The United States Patent and Trademark Office website is an example; although a search engine can find its main page, one can only search its database of individual patents by entering queries into the site itself.

Google, like all search engines should, follows the robots.txt protocol and can be blocked by sites that do not wish their content to be indexed or cached by Google. Sites that contain large amounts of copyrighted content (Image galleries, subscription newspapers, webcomics, movies, video, help desks), usually involving membership, will block Google and other search engines. Other sites may also block Google due to the stress or bandwidth concerns on the server hosting the content.

Google and other popular search engines are also a target for search engine "search result enhancement", also known as search engine optimizers, so there may also be many results returned that lead to a page that only serves as an advertisement. Sometimes pages contain hundreds of keywords designed specifically to attract search engine users to that page to serve an advertisement instead of a page with the content related to the keyword.

Google has also been the victim of redirection exploits that may return more results for a specific search term than exist actual content pages.

Forums, membership-only and subscription-only sites (since googlebot does not signup for site access,) and sites that cycle their content are not cached or indexed by any search engine, and with the move of sites going to AJAX/Web 2.0 designs, this will become more prevalent as search engines only simulate following the links on a web page. AJAX page setups (like Google maps), dynamicly return data based on realtime manipulation of javascript.

Search engines also might not be able to read links or metadata that normally requires a browser plugin. Adobe PDF,or macromedia flash, or where a website is displayed as part of an image. Search engines also can not listen to podcasts or other audio streams, or even video where a search term may have been mentioned.

[edit] Foreign languages, non-Latin scripts, and old names

Claims for the non-notability of a topic are occasionally made based on few Google hits, where a considerably larger number of hits would have resulted from searching in the correct script or for various transcriptions. An Arabic name, for instance, needs to be searched for in the original script, which is easily done with Google, provided one knows what to search for, but one also has to take into account that e.g. English, French and German webpages will likely transcribe the name using different conventions. Even for English only webpages there may be many variants of the same Arabic or Russian name.

In addition, different forms of a name used in the original language must be searched for. A Russian personal name has to be searched for both including and excluding the patronymic, and any search for names and other words in strongly inflected languages should take into account that arriving at the total number of hits may require searching for forms with varying case-endings or other grammatical variations not obvious for someone who does not know the language. Names from many cultures are traditionally given together with titles that are considered part of the name, but may also be omitted (as in Gazi Mustafa Kemal Pasha).

Even for Anglo-Saxon names, the spelling and rendering of older names may allow dozens of variations for the same person. A simplistic search for one particular variant may underrepresent the web presence by an order of magnitude.

Doing a search like this requires a certain linguistic competence which not every individual Wikipedian possesses, but the Wikipedia community as a whole includes many bilingual and multilingual people and it is important for nominators and voters on AfD at least to be aware of one's own limitations and not state conclusively a small number of Google hits for, say, a Serbian poet without pointing out the limited validity of a preliminary search using only one particular transcribed form of the name.

[edit] Alexa test

Although Wikipedia is not a web directory, we can have articles about web sites if they meet the same criteria for encyclopedic interest as other articles. See Wikipedia:Notability (websites). (Note that for corporate web sites editors have often argued that there should not be separate articles for the company and its web site except in very unusual circumstances.)

To perform the Alexa test on a particular web site, just go to Alexa (http://www.alexa.com), and type in the URL. Some editors use the Alexa ranking to determine whether Wikipedia should have an article, arguing that we should certainly have articles on top 100 sites, possibly have articles on top 1,000 sites, and usually not have articles for sites not in the top 100,000. However, Alexa rankings are not a part of the notability guidelines for web sites for several reasons:

  • Below a certain level, Alexa rankings are essentially meaningless, because of the limited sample size. Alexa itself says ranks worse than 100,000[10] are not reliable, and some critics feel it is worse than that.
  • Placing cutoffs at 100 and 1000 is arbitrary.
  • Alexa rankings vary over time.
  • Alexa rankings include significant bias. (See below.)
  • Alexa rankings do not reflect whether any source material for constructing an encyclopaedia article actually exists. A highly ranked web site may well have nothing written about it, or a poorly ranked web site may well have a lot written about it.
  • A number of unquestionably notable topics have corresponding web sites with a poor Alexa ranking.

[edit] Bias in the Alexa test

The Alexa rating may include significant bias, due to various factors. For example, the official Alexa toolbar is only available for Microsoft Internet Explorer running on Microsoft Windows, though there are third-party plugins (such as SearchStatus) for other browsers and operating systems. However, it is widely believed that, for example, a website exclusively devoted to an Apple Macintosh related topic might not have an Alexa ranking that accurately represents its true traffic activity because Alexa releases no demographic information about its users. At the same time, some webmasters install the Alexa toolbar for the sole purpose of improving their own rankings, by visiting their own web site with it, though more recently attempts to verify this effect have failed. Though no one knows Alexa's sample size, low traffic sites can be noticeably affected by a single, frequent toolbar user.

[edit] Further reading

  • Joe Meert (2006-04-30). Argumentum ad Googlum. Science, AntiScience and Geology. — Meert observes that "The temptation to find a quick retort means that, many times, people don't bother to check the source carefully." and that "people will look for a specific phrase that may be taken out-of-context to support their argument". He states that it is "dangerous and irresponsible to think that we can Google away a complex discussion" and that he has "learned long ago that there is no substitute for detailed research on a topic".
  • Rich Turner (2004-02-29). Argumentum ad Googlum; Why Getting a Million Hits on Google Doesn't Prove Anything. Grumbles. — Turner points out that "that something gets hits on Google does not make it correct" and gives several examples of things that are incorrect that garner thousands of hits on Google search results.

[edit] See also

  • Meta:Mirror filter, a way to filter sites from Google search to remove sites which mirror Wikimedia content
  • {{find}} a template designed to help with Google books, news archive and scholar searches

[edit] External links

In other languages