DBSCAN

From Wikipedia, the free encyclopedia

Please help improve this article or section by expanding it.
Further information might be found on the talk page or at requests for expansion. (October 2007)

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaovei Xui in 1996. It is a density based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. A parallelisation of the DBSCAN algorithm was presented by Domenica Arlia and Massimo Coppola in 2001.

1 Steps
2 Pseudocode
3 Advantages
4 Disadvantages
5 References

[edit] Steps

DBScan requires two parameters: epsilon (eps) and minimum points (minPts). It starts with an arbitrary starting point that has not been visited. It then finds all the neighbour points within distance eps of the starting point.

If the number of neighbors is greater than or equal to minPts, a cluster is formed. The starting point and its neighbors are added to this cluster and the starting point is marked as visited. The algorithm then repeats the evaluation process for all the neighbours recursively.

If the number of neighbors is less than minPts, the point is marked as noise.

If a cluster is fully expanded (all points within reach are visited) then the algorithm proceeds to iterate through the remaining unvisited points in the dataset.

[edit] Pseudocode

  C = 0
  for each unvisited point P in dataset D
     N = getNeighbors (P, epsilon)
     if (sizeof(N) < minPts)
        mark P as NOISE
     else
        ++C
        mark P as visited
        add P to cluster C
        recurse (N)

[edit] Advantages

1. DBScan does not require you to know the number of clusters in the data a priori. Compare this with K Means.

2. DBScan does not have a bias towards a particular cluster shape or size. Compare this with K Means.

3. DBScan is resistant to noise and provides a means of filtering for noise if desired.

[edit] Disadvantages

1. DBScan does not respond well to high dimensional data. As dimensionality increases, so does the relative distance between points making it harder to perform density analysis.

2. DBScan does not respond well to data sets with varying densities.

[edit] References

"A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise". Proceedings of 2nd International Conference on KDD, AAAI Press. Retrieved on 2007-10-15.

"Experiments in Parallel Clustering with DBSCAN". Euro-Par 2001: Parallel Processing: 7th International Euro-Par Conference Manchester, UK August 28-31, 2001, Proceedings, Springer Berlin. Retrieved on 2004-02-19.