kd-trie

From Wikipedia, the free encyclopedia


A kd-trie is a spatial data structure for indexing points in k-dimensional space. It is a variation on the kd-tree that only stores points in the leaf nodes, sometimes in small collections called bins or buckets. Internal nodes only store the dimension and the value that split. This removes the restriction that the splitting plane pass through one of the points being stored. kd-tries are usually static data structures.

Most papers refer to kd-tries as kd-trees because they're extremely similar. The articles are currently separate to avoid confusion about the algorithms that are applied to them. In fact, the original paper by Friedman and Bentley redefine the structure without using a different name for them. Overmars refers to them as "pseudo k-d tree," after his own "pseudo quad-tree." Samet calls it an "adaptive k-d tree" because it uses "adaptive partitioning." However, the original kd-tree also uses adaptive partitioning—that is, it adapts to the data set, instead of splitting regularly (in the geometric sense).

Contents

[edit] Constructing a kd-trie

Depending on the application, the choice of splitting rule varies. However, the overall method remains the same:

function split (list of points pointList, int depth)
{
    // varies, but should return a discriminator dimension and value
}
function kdtrie (list of points pointList, int depth)
{
    if pointList is empty
        return nil;
    else
    {
        var discriminator = split(pointList, depth)

        // Create node and construct subtrees
        var tree_node node;
        node.discriminator := discriminator;
        node.leftChild := kdtrie(points in pointList less than discriminator, depth+1);
        node.rightChild := kdtrie(points in pointList greater than or equal to discriminator, depth+1);
        return node;
    }
}

[edit] Splitting rules

The original splitting rule is the one presented in Bentley's paper for building a kd-tree. It is optimal in the sense of height, of O \left ( \log n \right )

function split (list of points pointList, int depth)
{
    // Select axis based on depth so that axis cycles through all valid values
    var int dimension := depth mod k;

    // Sort point list and choose median as pivot element
    select median from pointList using predicate: point1[dimension] < point2[dimension];
    return dimension, median
}

In a later paper, Bentley et al. propose a variation on the above rule. Instead of simply rotating through the dimensions, it selects the dimension with the maximum spread (max-min). This is now referred to as the standard splitting rule. It maximizes the probability of needing to explore only one subtree, while ensuring that each tree is of equal size; therefore the depth is still optimal. However, the aspect ratio of the cells may be very high.

The midpoint splitting rule selects on the middle of the longest axis of the space being searched, regardless of the distribution of points. This guarantees that the aspect ratio will be at most 2:1, but the depth is dependent on the distribution of points. A variation, called sliding-midpoint, only splits on the middle if there are points on both sides of the split. Otherwise, it splits on point nearest to the middle. Maneewongvatana and Mount have shown that this offers "good enough" performance on common data sets.

Other splitting rules exist, but are not as commonly used.

[edit] Applications

[edit] Approximate nearest neighbor

Using sliding-midpoint, can be answered in O \left ( \frac{ 1 }{ { \epsilon\ }^d } \log n \right )

[edit] Approximate range counting

Using sliding-midpoint, can be answered in O \left ( \log n + { \left ( \frac{1}{ \epsilon\ } \right ) }^d \right )

[edit] External links

[edit] References