Query optimization

From Wikipedia, the free encyclopedia

This article or section needs to be wikified to meet Wikipedia's quality standards.
Please help improve this article with relevant internal links. (October 2007)

Query optimization is a function of many relational database management systems in which multiple query plans for satisfying a query are examined and a good query plan is identified. This may or not be the absolute best strategy because there are many ways of doing plans. There is a trade off between the amount of time spent figuring out the best plan and the amount running the plan. Different qualities of database management systems have different ways of balancing these two. Cost based query optimizers evaluate the resource footprint of various query plans and use this as the basis for plan selection. Typically the resources which are costed are CPU path length, amount of disk buffer space, disk storage service time, and interconnect usage between units of parallelism. The set of query plans examined is formed by examining possible access paths (e.g., primary index access, secondary index access, full file scan) and various relational table join techniques (e.g, merge join, hash join, product join). The search space can become quite large depending on the complexity of the SQL query. There are two types of optimization. These consist of logical optimization which generates a sequence of relational algebra to solve the query. In addition there is physical optimization which is used to determine the means of carrying out each operation.

[edit] Translating SQL queries into relational algebra

Different optimizer sequences such as the different Greek alphabet characters which are assigned to them. This is explained in the article relational algebra and query optimizer.

[edit] The goal of query optimization

The goal is to eliminate as many unneeded tuples, or rows as possible. The following is a look at relational algebra as it eliminates unneeded tuples. The project operator is straightforward to implement if <attribute list> contains a key to relation R. If it does not include a key of R, it must be eliminated. This must be done by sorting (see sort methods below) and eliminating duplicates. This method can also use hashing to eliminate duplicates Hash table.

SQL command distinct: this does not change the actual data. This just eliminates the duplicates from the results.

Set operations: Database management heavily relies on the mathematical principles of set theory which is key in comprehending these operations.

Union: This displays all that appear in both sets, each listed once. These must be union compatible. This means all sequences of selected columns must designate the same number of columns. The data types of the corresponding columns must thereby comply with the conditions valid for comparability. Each data type can be compared to itself. Columns of data type CHAR with the different ASCII and EBCDIC code attributes can be compared to each other, whereby they are implicitly adapted. Columns with the ASCII code attribute can be compared to date, time, and timestamp specifications. All numbers can be compared to each other. In ANSI SQL mode, it is not enough for the data types and lengths of the specified columns to be compatible: they must match. Moreover, only column specifications or * may be specified in the selected columns of the QUERY specifications linked with UNION. Literals may not be specified (wisc).

Intersection: This only lists items whose keys appear in both lists. This too must be union compatible. See above for definition.

Set difference: This lists all items whose keys appear in the first list but not the second. This too must be union compatible.

Cartesian Product: Cartesian products (R * S) takes a lot of memory because its result contains a record for each combination of records from R and S.

[edit] A common searching algorithm

The following is a quick overview of an algorithm used to search through the different tuples: External sorting methods are those that are suitable for large files of records stored on disk that do not fit in memory, and most database files typically fit this description. The most common example is the sort merge. The following is an explanation for this type of sort.

MergeSort is a recursive sorting procedure that uses O(n log n) comparisons in the worst case. To sort an array of n elements, we perform the following three steps in sequence:

   * If n<2 then the array is already sorted. Stop now.
   * Otherwise, n>1, and we perform the following three steps in sequence:
        1. Sort the left half of the array.
        2. Sort the right half of the array.
        3. Merge the now-sorted left and right halves.