Search engine technology

From Wikipedia, the free encyclopedia

Google search result of 2010-06-06 for "oil spill" showing the #1 sponsored link by BP, in relation to the 2010 2010 BP oil spill.

Modern web search engines are highly intricate software systems that employ technology that has evolved over the years. There are a number of sub-categories of search engine software that are separately applicable to specific 'browsing' needs. These include web search engines (e.g. Google), database or structured data search engines (e.g. Dieselpoint), and mixed search engines or enterprise search. The more prevalent search engines, such as Google and Yahoo!, utilize hundreds of thousands computers to process trillions of web pages in order to return fairly well-aimed results. Due to this high volume of queries and text processing, the software is required to run in a highly dispersed environment with a high degree of superfluity. Modern search engines possess the same following main components.

Search engine categories

Web search engines

Search engines that are expressly designed for searching web pages, documents, and images were developed to facilitate searching through a large, nebulous blob of unstructured this and that. They are engineered to follow a multi-stage process: crawling the infinite stockpile of pages and documents to skim the figurative foam from their contents, indexing the foam/buzzwords in a sort of semi-structured form (database or something), and at last, resolving user entries/queries to return mostly relevant results and links to those skimmed documents or pages from the inventory.

Crawl

In the case of a wholly textual search, the first step in classifying web pages is to find an ‘index item’ that might relate expressly to the ‘search term.’ In the past, search engines began with a small list of URLs as a so-called seed list, fetched the content, and parsed the links on those pages for relevant information, which subsequently provided new links. The process was highly cyclical and continued until enough pages were found for the searcher’s use. These days, a continuous crawl method is employed as opposed to an incidental discovery based on a seed list. The crawl method is an extension of aforementioned discovery method. Except there is no seed list, because the system never stops worming.

Most search engines use sophisticated scheduling algorithms to “decide” when to revisit a particular page, to appeal to its relevance. These algorithms range from constant visit-interval with higher priority for more frequently changing pages to adaptive visit-interval based on several criteria such as frequency of chance, popularity, and overall quality of site. The speed of the web server running the page as well as resource constraints like amount of hardware or bandwidth also figure in.

Link map

The pages that are discovered by web crawls are often distributed and fed into another computer that creates a veritable map of resources uncovered. The bunchy clustermass looks a little like a graph, on which the different pages are represented as small nodes that are connected by links between the pages. The excess of data is stored in multiple data structures that permit quick access to said data by certain algorithms that compute the popularity score of pages on the web based on how many links point to a certain web page, which is how people can access any number of resources concerned with diagnosing psychosis. Another example would be the accessibility/rank of web pages containing information on Mohamed Morsi versus the very best attractions to visit in Cairo after simply entering ‘Egypt’ as a search term. One such algorithm, PageRank, proposed by Google founders Larry Page and Sergey Brin, is well known and has attracted a lot of attention because it highlights repeat mundanity of web searches courtesy of students that don’t know how to properly research subjects on Google. The idea of doing link analysis to compute a popularity rank is older than PageRank. Other variants of the same idea are currently in use – grade schoolers do the same sort of computations in picking kickball teams. But in all seriousness, these ideas can be categorized into three main categories: rank of individual pages and nature of web site content. Search engines often differentiate between internal links and external links, because web masters and mistresses are not strangers to shameless self-promotion. Link map data structures typically store the anchor text embedded in the links as well, because anchor text can often provide a “very good quality” summary of a web page’s content.

Database Search Engines

Searching for text-based content in databases presents a few special challenges from which a number of specialized search engines flourish. Databases can be slow when solving complex queries (with multiple logical or string matching arguments). Databases allow pseudo-logical queries which full-text searches do not use. There is no crawling necessary for a database since the data is already structured. However, it is often necessary to index the data in a more economized form to allow a more expeditious search.

Mixed Search Engines

Sometimes, data searched contains both database content and web pages or documents. Search engine technology has conveniently developed to respond to both sets of requirements, to appeal to the savants that don’t see the light of day. Most mixed search engines are large Web search engines, like Google. They search both through structured and unstructured data sources, which makes it exceptionally difficult to know what you are looking for and how to get to it. Take for example, the word ‘ball.’ In its simplest terms, it returns more than 40 variations on Wikipedia alone. Did you mean a ball, as in the social gathering/dance? A soccer ball? The ball of the foot? Pages and documents are crawled and indexed in a separate index. Databases are indexed also from various sources. Search results are then generated for users by querying these multiple indices in parallel and compounding the results according to “rules.”

External links

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.