Evaluating internet sources

From Wikipedia, the free encyclopedia

Evaluating internet sources requires a healthy dose of common sense and skepticism about the validity of their findings. From web directories to search engines to metasearch engines, searching on the internet can be an overwhelming experience. In some cases, such as the invisible web, and sites that prohibit spiders from searching them, the internet is impenetrable. While search engines continue to develop, becoming more effective and smarter, they are becoming increasingly plagued with more advertisements, and with companies paying for the number one spot in a search. Internet users must therefore learn to search effectively. The balance between searching effectively and maintaining a reasonable level of skepticism is the user's best tool in searching for information on today's internet.

Contents

[edit] History of the Internet

In the 19th century, the world was brought together by the telegraph, which allowed for the quick dissemination of information, changing the way people thought, and the way business was conducted. It was referred to as the "highway of thought" (Standage VIII). Before this, as Tom Standage points out in his book, "The Victorian Internet", the fastest way to send information was the same as it had been since the Mongols first tamed the horse. The telegraph was eventually replaced by the telephone, but it opened people's minds to a new, faster-paced way of life.

The Internet is a set of networking protocols that allows one computer to connect to and communicate with other computers (Sherman and Price 1). It was developed in the 1960s by the Department of Defense (DOD) to connect universities and laboratories, so they could more easily share research and data, thereby increasing productivity, and decreasing unnecessary duplication. The result, first successful in 1969, was ARPANet, which evolved into the Internet.

There was no easy way to search for files on the early Internet. Searching was really sending e-mails to people with access to other computers and their files, and waiting for a reply. Gopher changed all these by creating the first menu-type index of files which could be shared on the Internet. It was a closed system, available only at the University of Minnesota. One could not read the files on the internet, but had to download them into their home computers. Soon, Gophers appeared in other places, and subsequently they were joined together. Archie, and then Veronica were software programs that were created as a means to search Gopher menus. With Archie (a play on the word, "archives"), one could search for anonymous FTP files, while Veronica (yes - Archie's girlfriend) was capable of a keyword search of Gopher menus. Veronica initiated the practice of Boolean searches. She was the first to allow one to limit his search by using "AND", "NOT", "OR", "EITHER", or parenthesis (Burke 66), as well as the first to allow wild card, or word truncation (Burke 67). The World Wide Web is a set of software protocols that runs on the Internet, was conceived for the same reasons, allowing users to easily access files (Sherman and Price 1).

In 1989, physicists at CERN in Switzerland wanted a way to share information and data in a more complete way than before. Tim Berners-Lee perfected the Hypertext Markup language or HTML (the language, most websites are written in) and along with HTTP, they were able to share not only indexes off the internet, but were able to share entire documents which could be linked to each other by hypertext.

[edit] Subject Directories

Subject directories (or web directories) list the name and address of the web pages, and work like a telephone book. The earlier ones relied on web page authors to submit their websites to the directory (Sherman and Price 13). Each site was handpicked, usually annotated, and classified by subject. Many people at these directories reviewed the sites listed in their directory, and selected them for relevance and quality. The first web directory was "The Project" (Sherman and Price 12).

Yahoo! is an example of a subject directory. In 1994, Jerry Yang and David Filo created "Jerry's Guide to the Internet" which used spiders to search the web for sites, grouping them manually into hierarchical lists, "Jerry's guide" became Yahoo!, which is an acronym for "Yet Another Hierarchical Officious Oracle" (Sherman and Price, pg. 15); the authors did not say which came first, the name or the acronym. Another more recent example of a web directory, Beaucoup.com, categorizes the sites found on it, as well as reviews some new sites, and is useful if the web user has an idea of what kind of sites he is searching for. When the user wants to search for a word or topic, he would use a search engine.

[edit] Search Engines

In 1994, Brian Pinkerton, a graduate student at the University of Washington, created a web crawler to search for cool and unusual pages. He posted it to the Internet with an interactive interface, and created the first search engine (Sherman and Price 14). A search engine is a site that "allow(s) you to find specific documents through keyword and menu choices" (Sawyer). Search engines allow you to search a greater number of sites than do directories. Spiders or web crawlers go from link to link, and index key words from the header of the page, as well as words from the text of the page. Some spiders, like at AltaVista, indexes every word on a site. Others index the 100 most frequently used words.(Curt Franklin) When one types in a request, it matches the keywords to the meta tags in the header, and to the words indexed from the document. It does not evaluate the sites for content or relevance.

GoTo.com (now "Overture") has an unusual way of ranking sites, ranking them not by relevance, or popularity, but rather by payment. It opened in 1998 with the concept that who ever paid the most would be the first company listed in search returns (Arthur Weiss). Alta Vista also gives top ranking to paying sites, but notes which sites have paid for their rank (Arthur Weiss). More recently, search engines have been striving to develop more user-friendly results. At the "12th International World Wide Web Conference in Budapest, breakthroughs in searching technology were showcased. Among these innovations were improved techniques in specifying the type of data one needs. TimeSearcher, a new search engine, will allow results to "be confined to data created or changed on specific dates" (Delio). According to Wired News's Michelle Delio, we can also "(e)xpect, for example, to be able to sift through search results geographically, or to personalize Google results" (Delio). Geographically sorting data will be a fantastic advantage in web searching in the future, especially if one is searching for a florist online. Search engines are always changing. They add new pages even as they go back to update the sites already indexed. Improvements are made to the way spiders search the web, and they can recognize a greater variety of pages. AlltheWeb.com was the first search engine to index Macromedia Flash Files. It now also indexes Word files and PDF files(Robert Lackie).

Search engines and directories continue to improve and expand, and can now access more documents than ever. With these improvements, there are fewer and fewer differences between search engines and directories. Many search engines have even added directories to their sites. Despite the improvements in search techniques, search engines cannot keep pace with the growing number of sites on the Internet.

[edit] Google vs Meta Search Engines

Google is the search engine that has the largest number of indexed sites; around 3 billion different pages (Quentin Hardy 100). It has become the standard against which other search engines are measured. It has over 10,000 networked computers and can handle seven million queries an hour (Quentin Hardy 100). Google has the following capabilities:

  • It rates the page by how popular it is, and tells the user how many hits, his request had. The more visitors a page has, the higher it is in its "returned hits" list.
  • It tells the user long it will take to compile the list.
  • It corrects spelling errors.
  • It can translate to almost every language (including Klingon).
  • It is a reverse directory, a street map, and can search just for images.
  • It offers a directory of online catalogs.
  • It indexes every word on a page, except for the articles ("a", "an" and "the") (Curt Franklin).
  • It can recognize and index pages that are not HTML. Google can recognize PDFs, PostScript, Excel, Power Point and Rich Text documents, so it can offer more returns on request. Not only does Google recognize PDF documents, it translates them into HTML (Robert Lackie).

Google is renowned for its relevancy and its simplicity. However, as some of Google's competitors approach the same level of relevancy, it will have to continue to make itself "smarter", and learn what the user tends to want, providing similar responses each time. As Arnaud Fischer states, "Why wouldn't the most relevant results from several of the best engines not be more relevant than the results of a single-, even the best-, crawler-based engine?" (Fischer). However, these metasearch engines do not store or index pages. Metasearch engines use the resources gathered and indexed by search engines. Many metasearch engines, like Dogpile.com and Info.com, have reputations for being more "advertisement"-based (Fischer), rather than a regular search engine, which is often NOT what the user wants. Arthur Weiss says of meta search engines, "Such tools are parasitic, in that they share none of the database and indexing development overhead, and instead take away advertising revenues from the search tools they use" (Arthur Weiss). For example, when "Portland, Oregon" was plugged into info.com, the first 15 sites that came up were hotels, airlines or apartments, followed only then by the city's homepage. On Google, though, the city's homepage was first on the results list, next to Yahoo!'s city map. Vivisismo.com is a great meta search engine. Its advantage over other meta search engines is the way it subdivides, or "clusters" its hits into further categories, making it easier to sort through the hits, and reach relevant sites.

[edit] The Invisible Web

The "Invisible Web" (or "Hidden Web", or "Deep Web", or Dark Matter) and "How to Search It") contains information that spiders or web crawlers cannot access, so are usually excluded from general purpose search engines and directories. The sites are not accessible to the web crawlers for a variety of reasons. Search engine technology, while improving, is limited. Web crawlers work by traveling from one hypertext link to another. If there are no links to that page, or if the links are broken, the web crawlers cannot find them. However, the web designers can still submit their URL to individual directories, or search engines.

Another reason web crawlers do not include certain sites is the formatting on the pages. Search engines can index text documents, most images, and audio and video files, but they cannot read PDF, Flash, Shockwave, .EXE, PostScript or .ZIP files. They do not contain HTML text.

Web crawlers are not good with databases either, as databases are actually incomprehensible to web crawlers (Sherman and Price 59). Databases are commonly found on library, business, university and business association web sites. The Educator's Reference Desk is the world's largest educational database (Sherman and Price 83). It is filled with archived journals, citations and archives from education and library electronic mailing lists. However, because of its format, it is invisible to web crawlers, and is usually missing from standard search engine results.

Websites can deliberately block spiders or web crawlers from accessing them. The designers can do this by placing certain blocks in the meta tags in the head of the page, or by requiring a password before accessing the site.

[edit] Better Searching

When a user types "movies" into Google, a googol of different answers will come up. There are ways of narrowing the search so only the relevant sites are selected. The first step is to evaluate what the user really wants, or, as Mary Ellen Bates puts it, "Who cares?" (Bates). The user should first try to state his request in a clear question, and select the keywords in the question as the keywords to his entry. Words such as "the, and, I and to", should be omitted, and "movies, French, Foreign" entered. The sites that come up will be much more relevant. If the user enters the synonyms of his keywords, it will add more sites to his search, eg. by choosing "movies", he could add "films", or "motion pictures", to his entry. The search engine would bring up yet another different list.

Boolean search words can be used to further refine the search. The user can enter "+", "-", "AND", "NOT" and "OR" to limit the number of hits. If he use "AND" or "+", the search engine will bring up only those sites which contain those words. Some search engines (Google, for example) uses the Boolean "AND" by default. Thus, if you enter "Movies Foreign French", the search engine will bring up only those pages that contain all three of these words. If he wants "French movies" that do not feature "Gerard Depardieu", then he would have to limit his search with a Boolean "NOT"; he would enter "French AND foreign AND movies NOT Depardieu", and the search engine will weed out all movies with "Gerard Depardieu". If supposing he wants films in French, but does not care whether they were made in Canada or France, then he would use the Boolean "OR"; he would enter "French foreign films OR France OR Canada." Most search engines have and "Advanced Search" feature which will help the user refines his search even further; if something specific like a review or history is needed, this feature would be the most appropriate one to use.

When a user find a site that he thinks he might use, he should bookmark it under "Favorites" by creating a file for his search. Just because the user has found the site once, does not mean he will find it again. Bookmarking is therefore a great way to save time in finding a site that a user wants. Another option is to save the entire page to his hard drive, and then delete it when he is finished with it. The only advantage to this is that he does not have to be online to read the document, and he does not tie up his phone line (for those who still use dial-up connections, not broadband).

There are sites specifically set up to help people choose a search engine. Debbie Abilock's "Choose the Best Search Engine" site is a well-known favorite (Sherman). Even typing "choose search engine" into Google will bring up many sites that guide users through their searches. These engines will help identify what kind of information the user needs, and suggests search engines that will provide it.

[edit] Evaluating a web site

Here are some things for the user to consider when evaluating the materials in a site:

  • How did you find the site? Was it the 100th site posted by the search engine? Was it recommend by a colleague, a friend, or journal you are familiar with? Was it a link from another page? Was it included in the bibliography of another web site?
  • Is it related to your topic?
  • Is the information accurate? One quick way to check for accuracy is to compare the information on this site with that of other sites on the same subject. A better way is to compare it with information you have found in other sources. Inaccurate information, however, will sometimes appear on multiple websites. If information presented on two different websites has identical wording, the repetition of the information should not be taken as a confirmation of factuality. It is likely that both webpages simply copied the text from the same source. Thus, if the original source provided incorrect information, then all websites that copy that information are just as incorrect.
  • Is the site, someone's home page or pet project? Or is it affiliated with an organization like a university, or a business, or a government agency? Is there a way to contact the author, or the organization, by email or even snail mail? Is there a link to the organization's home page so that you can find out more about them?
  • Does the organization have an agenda or a bias? Does it offer a variety of viewpoints?
  • What is the purpose of the site? Is it for entertainment, or news, or editorial? Is it an advertisement?
  • Is there detailed information on the page? Is it an in-depth discussion on the topic, or is it a "My dog, Fido" project? Or is it primarily a list of links to other sites?
  • Is the information easy to read? Is the grammar and spelling correct? Are there maps, or graphs, or charts?
  • How old is the page? Is there a date at the bottom of the page? Some sites will have the date, showing when the page was posted, and the date of the most recent update.

[edit] References

  • Bass, Steve. Maximum Google; Google Tips.PC World Volume 21, Number 6 Pages 121-126. June 2003 (online version).
  • Bates, Mary Ellen. Who Cares About Information Quality? SearchEngineWatch.com. June 17, 2003 (online version).
  • Burke, John. The Learning Internet; a Workbook for Beginners. New York, NY: Neal-Schuman Publishing, Inc., 1996.
  • Chankhunthod, Anawat, Peter B. Danzig& Chuck Neerdaels. Computer Science Department, University of Southern California & Michael F. Schwartz& Kurt J. Worrell- Computer Science Department University of Colorado - Boulder.
  • Delio, Michelle. Big Changes for Search Engines. WiredNews.com. May 27, 2003 (online version).
  • Dragutsky, Paula (1999 last update 1/1/03). Guides to Specialized Search Engines (online version).
  • Fischer, Arnaud. What's It Going to Take to Beat Google? SearchEngineWatch.com. June 12, 2003 (online version).
  • Glasner, Joanna. Search Results Clogged by Blogs. WiredNews.com. May 16, 2003 (online version).
  • Hahn, Harley and Rick Stout. The Internet, a Complete Reference. Berkeley, CA: Osbourne McGraw-Hill, 1994.
  • Hutchinson, Sarah E. & Stacey C. Sawyer. Computers, Communications & Information; A User's Introduction. Boston: McGraw-Hill, 2000.
  • Kehoe, Brendan. Zen and the Art of the Internet.New Jersey: Prentice Hall, 1994.
  • NCSA Education Group (1993). An Incomplete Guide to the Internet (online version).
  • Sherman, Chris. What's the Best Search Engine? SearchEngineWatch.com. June 3, 2003 (online version).
  • Sherman, Chris and Gary Price. The Invisible WebUncovering Information Sources Search Engines Can't See. Medfeld, NJ: Information Today Inc., 2001.
  • Standage, Tom. The Victorian Internet. New York, NY: Walker Publishing Company, Inc., 1998.
  • Symons, Ann K. "Sizing Up Sites: How to Judge What You Find on the Web: The Smart Web Primer Part 2." School Library Journal, April 1997: Pg. 22-25.
  • Weiss, Arthur. "Searching for Mammon - Search engine business models": FreePint issue 40, June 10, 1999. (online version). This article is an update to a paper entitled "The Evolution Of World-Wide-Web Search Tools" in the Proceedings of the Online Information Conference 1998 Pages 289-295 - which can be requested from http://www.marketing-intelligence.co.uk/pubs/papers.htm

[edit] External links

  • A Hierarchical Internet Object Cache
  • http://www.theinvisible web.com
  • http://www.magportal.com/c/edu/research/
  • http://www.firstfind.info/help/help.html: This is a useful site for Library Media Specialists and teachers, who are helping their students with their first searches.
  • The Connecticut Digital Library: Searches thousands of popular and scholarly articles, from 1980 to the present, including Spanish-language articles, newspapers, business information on over 450,000 companies, health and wellness information, and much more.
  • The complete version of the World Bank's e-Library is an electronic portal to the bank's full-text collection of books, reports, and other documents. A commercial, subscription-based research and reference tool, the World Bank e-Library brings together more than 1,200 titles published by the World Bank during the past several years in an indexed and cross-searchable database. Furthermore, the e-Library will hold all future titles formally published by the World Bank. Apart from the e-Library, institutions may also subscribe to the World Bank's electronic databases, "World Development Indicators Online", and "Global Development Finance Online". In addition to providing an online journal service via Ingenta.com and IngentaSelect.com, Ingenta has built more than 60 publisher-branded websites and journal web sites, ranging from branded homepages to customized stand-alone websites, for its clients. Nearly 15,000 institutions worldwide are registered with Ingenta, where 1.8 million articles are delivered each month. The World Bank e-Library is built and powered by Ingenta. Access to e-Library titles is available via the World Bank site, and via IngentaSelect.com, through an institutional subscription to the collection, or on a pay-per-view basis.

These URLs and annotations, found in the Invisible Web, are cut and pasted from the sites themselves:

  • http://www.completeplanet.com/: Discover and search over 103,000 searchable databases and specialty search engines.
  • FindArticles.com is a vast archive of published articles that is available for free. Constantly updated, it contains articles from more than 300 magazines and journals, dating back to 1998.
  • Direct Search is a growing compilation of links that contain data, not easily or entirely searchable or accessible from general search tools like Alta Vista, Google, or Hotbot. Although these "general tools" are essential for the retrieval of internet-based data, searchers often fail to realize that a massive amount of information, residing on the Invisible Web, is not easily or entirely searchable or accessible via such tools.
  • InvisibleWeb.com offers the world's largest source of searchable databases; presently over 10,000. Its exclusive, automated discovery technology is designed to detect new sources, and automatically purges bad links, thus reducing users' frustration. IntelliSeek also offers a range of customization options and flexibility of services.
  • http://www.noodletools.com/debbie/literacies/information/5locate/advicedepth.html
  • http://turbo10.com/