Proximity search (text)

From Wikipedia, the free encyclopedia

In text processing, a proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search.

For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick". By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page.

Contents

[edit] Rationale

The basic, linguistic, assumption is that the proximity of the words in a document implies a relationship between the words. Given that authors of documents try to formulate sentences which contain a single idea, or cluster related ideas within neighboring sentences or organized into paragraphs, there is an inherent, relatively high, probability within the document structure that words used together are related. Where as, when two words are on the opposite ends of a book, the probability there is a relationship between the words is relatively weak. By limiting search results to only include matches where the words are within the specified maximum proximity, or distance, the search results are assumed to be of higher relevance than the matches where the words are scattered.

Commercial, Internet search engines tend to produce too many matches (known as recall) for the average search query. Proximity searching is one method to reduce the number of pages matches, and to improve the relevance of the matched pages by using word proximity to assist in ranking. As an added benefit, proximity searching helps combat spamdexing by avoiding webpages which contain dictionary lists or shotgun lists of thousands of words, which rank higher in search engines that are heavily biased by word frequency to help in ranking results.

[edit] Boolean Syntax and Operators

Note that a proximity search can designate that only some keywords must be within a specified distance. Proximity searching can be used with other search syntax and/or controls to allow more articulate search queries. Sometimes query operators like NEAR, NOT NEAR, FOLLOWED BY, NOT FOLLOWED BY, SENTENCE or FAR are used to indicate a proximity-search limit between specified keywords: "brick NEAR house" and such.

[edit] Usage in Commercial Search Engines

Google allows ordered-proximity searching using one asterisk (*) to span each 2 intervening words, but with order specified: "brick *** house" OR "house *** brick" matches up to 7 intervening words (October 2006).

Implicit/automatic versus explicit proximity search: As of November 2006, most Internet search engines except Exalead and Yahoo! only implement an implicit proximity search functionality. That is, they automatically rank those search results higher where the user keywords have a good "overall proximity score" in such results. If only two keywords are in the search query, this has no difference from an explicit proximity search which puts a NEAR operator between the two keywords. However, if three or more than three keywords are present, it is often important for the user to specify which subsets of these keywords expect a proximity in search results. This is useful if the user wants to do a prior art search (e.g. finding an existing approach to complete a specific task, finding a document that discloses a system that exhibits a procedural behavior collaboratively conducted by several components and links between these components).

For example, in a search query in the form of: (keyword1 NEAR keyword2) (keyword1 NEAR keyword3), the query specifies that keyword1 and keyword2 must co-occur closely somewhere in a document, and so must keyword1 and keyword3. However, keyword2 and keyword3 need not occur closely anywhere in the document.

Exalead allows the user to specify the required proximity, as the maximum number of words between keywords. The syntax is (keyword1 NEAR/n keyword2) where n is the number of words. When using the Walhello search-engine, the proximity can be defined by the number of characters between the keywords.

Proximity search within the Google and Yahoo! search engines is possible using full-word wildcards: the wildcard is an asterisk "*" in Google, and an "a" in Yahoo! Search.

Google Asterisk: Using Google's asterisk-in-quotations approach to emulate a NEAR operator is a little cumbersome but does work (as of October 2006). For example, to specify a close (at most 2 words' distance) co-occurrence of "house" and "dog", the following search-expression could be specified:

"house * dog" OR "dog * house" <--Search for house/dog up to 2 words apart.

Note the operator "OR" must be in capital letters. One asterisk allows a proximity of at most two words' distance between two search-words. To span 7 intervening words, use 3 asterisks:

"house *** dog" OR "dog *** house" <--Search for house/dog up to 7 words apart.

To span up to 11 intervening words in a Google search, use 4 asterisks, etc.

[edit] See also

[edit] External links