SQL Server Full Text Search

From Wikipedia, the free encyclopedia

'SQL Server Full Text Search' is an inexact string matching technology for SQL Server. It is a powerful and fast way of referencing the contents of almost any character-based column on SQL Server 2000.

Full text indexes must be populated and are stored inside full-text catalogue. Full text searching is installed as a separate service called MSSearch. Full-text searching is an optional component of SQL Server 7.0/2000/2005

Contents

[edit] Types of Searches

  • Proximity based searches – cf. Section
  • Inflectional based searches – cf. Section
  • Weight based searches – cf. Section
  • Prefix term searches – cf. Section
  • Binary document search – cf. Section

[edit] A Common Example

The following article contains SQL code and simple worked examples. These examples refer to an imaginary table called "Athletes". This table contains information about various athletes their events and results in the Olympic Games. Inside this table you will find the following column names:

  • id
  • name
  • surname
  • country
  • notes

There exists a prior full-text index on the column "notes". The primary key of the table is on the column "id"

[edit] Full Text Query Syntax

There are 4 main keywords that can be invoked when full text searching is enabled: They are: "CONTAINS", "CONTAINSTABLE", "FREETEXT" and "FREETEXTTABLE". Their functionality and difference are consolidated in the following table:

Keyword Meaning Returns
CONTAINS Supports complex syntax to search for a word or phrase using boolean operators,prefix terms,proximity operators. A simple boolean
CONTAINSTABLE Supports complex syntax to search for a word or phrase using boolean operators,prefix terms,proximity operators A table with rankings of how well the match was
FREETEXT Intuitive syntax to match the meaning and not the exact wording of the words in the search condition.. Has the ability to match plurals and other inflectional forms of words. Multiple words are considered with "OR". A simple boolean
FREETEXTTABLE Intuitive syntax to match the meaning and not the exact wording of the words in the search condition.. Has the ability to match plurals and other inflectional forms of words.Multiple words are considered with "OR". A table with rankings of how well the match was

The rank is just a number from 0 to 1000 that indicates the accuracy of the search result. The higher the value the more accurate the match.

[edit] Proximity Searches

Full-text searching has the ability to find words that are located physically near each other. This is achieved by using the keyword "NEAR" between the 2 search targets.

Microsoft does not explicitly state what exactly it means by the term "NEAR". However it is considered to be around 8 to 10 words in most circumstances. The keyword alludes to the possibility that the algorithm used to calculate proximity may be a Nearest neighbor search (NNS) or possibly a Levenshtein distance algorithm, both of which are distance-based similarity functions. Given a set S of n objects lying in some space U with a similarity so that given a query q ∈ U, one can locate the point that is most similar to q among the points in S.

In T-SQL syntax, the keyword "NEAR" can also be substituted with the "~" operator to achieve the same results.

[edit] Proximity Search Simple Example #1 using the "NEAR" keyword


SELECT name, surname
FROM CONTAINSTABLE(Athletes, *, '("Gold" NEAR "Finalists")') As ct
JOIN Athletes As e
ON ct.[KEY] = e.id


[edit] Proximity Search Simple Example #2 using the "~" operator


SELECT name, surname
FROM CONTAINSTABLE(Athletes, *, '("Gold" ~ "First") OR ("Silver" ~ "Second") OR ("Bronze" ~ "Third")') As ct
JOIN Athletes As e
ON ct.[KEY] = e.id


[edit] Inflectional Searches

Full-text searching has the ability to search for word inflections, that is, for differing forms of the same word that have the same general meaning. For example: "swim", "swam" and "swimming"

Both singular and plural forms of a word can be returned - It can also handle gender and neutral form of words, verbs and adjectives. This is achieved by using the keywords "FORMS OF" and "INFLECTIONAL" before the word of interest.

[edit] Inflectional Searches Simple Example #1


SELECT name, surname
FROM Athletes
WHERE CONTAINS(Notes, 'FORMSOF (INFLECTIONAL, swim)')

[edit] Prefix Term Searches

Full-text searching has the ability to search for word beginning with a certain term. This is achieved by using a clever combination of single and double quotes followed by an * symbol.

[edit] Prefix Term Simple Example #1

SELECT name, surname
FROM Athletes a
WHERE CONTAINS(*, '"swim*"')

[edit] Binary Document Search

Full-text searching has the ability to search through binary data. The supported types are as follows:

  • MS Word (*.doc)
  • MS Excel (*.xls)
  • MS PowerPoint (*.ppt)
  • HTML (*.htm or *.html)

[edit] Weighted Searches

Full-text searching has the ability to associate a higher preference for certain terms over others. This is achieved by using the function: “ISABOUT” and the “WEIGHT” keyword. The keyword is associated with any decimal value between 0.0 and 1.0. The higher the weight the greater preference the match will be given

[edit] Weighted Searches Simple Example #1

SELECT name, surname
FROM Athletes a
JOIN CONTAINSTABLE(Athletes, notes, 'ISABOUT(Bronze WEIGHT(.2), Silver WEIGHT(.4), Gold WEIGHT(.8))') ct
ON a.id = ct.[KEY]
ORDER BY Rank DESC

[edit] Noise Words

These are common words that MS Search feels should be ignored when considering matches.

For example the article words like “the”, “an” and the conjunctions like: “and”, “but” and “or” occur so frequently that to include them in a search would result in too many false positives.

[edit] Hardware Optimization Considerations

By default, MS SQL Server reserves all the virtual memory in its process space, minus 1GB. This leaves little for MSSearch. Consider readjusting to a ratio of 8(SQL Server):2(MSSearch). By default the Windows 2000/NT virtual memory is 1.5 times physical memory. Consider increasing this to 3 times physical memory.