ELKI
Screenshot of ELKI 0.4 visualizing OPTICS cluster analysis. | |
Developer(s) | Ludwig Maximilian University of Munich |
---|---|
Stable release | 0.6.0 / January 10, 2014 |
Preview release | 0.6.5~20141030 / October 30, 2014 |
Written in | Java |
Operating system | Microsoft Windows, Linux, Mac OS |
Platform | Java platform |
Type | Data mining |
License | AGPL (since version 0.4.0) |
Website | http://elki.dbs.ifi.lmu.de/ |
ELKI (for Environment for DeveLoping KDD-Applications Supported by Index-Structures) is a knowledge discovery in databases (KDD, "data mining") software framework developed for use in research and teaching by the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.
Description
The ELKI framework is written in Java and built around a modular architecture. Most currently included algorithms belong to clustering, outlier detection[1] and database indexes. A key concept of ELKI is to allow the combination of arbitrary algorithms, data types, distance functions and indexes and evaluate these combinations. When developing new algorithms or index structures, the existing components can be reused and combined.
Objectives
The university project is developed for use in teaching and research. The source code is written with extensibility, readability and reusability in mind, but is also well-optimized for performance. Since the experimental evaluation of algorithms depends on many environmental factors, ELKI aims at providing a shared codebase with comparable implementations of many algorithms.
As research project, it currently does not offer integration with business intelligence applications or an interface to common database management systems via SQL. The copyleft (AGPL) license may also be a hindrance to commercial usage. Furthermore, the application of the algorithms requires knowledge about their usage, parameters, and study of original literature. The audience are students, researchers and software engineers.
Architecture
ELKI is modeled around a database core, which uses a vertical data layout that stores data in column groups similar to column families in NoSQL databases). This database core provides nearest neighbor search, range/radius search, and distance query functionality with index acceleration for a wide range of dissimilarity measures. Algorithms based on such queries (e.g. k-nearest-neighbor algorithm, local outlier factor and DBSCAN) can be implemented easily and benefit from the index acceleration. The database core also provides fast and memory efficient collections for object collections and associative structures such as nearest neighbor lists.
ELKI makes extensive use of Java interfaces, so that it can be extended easily in many places. For example custom data types, distance functions, index structures, algorithms, input parsers, and output modules can be added and combined without modifying the existing code. This includes the possibility of defining a custom distance function and using existing indexes for acceleration.
ELKI uses a service loader architecture to allow publishing extensions as separate jar files.
Visualization
The visualization module uses SVG for scalable graphics output, and Apache Batik for rendering of the user interface as well as lossless export into PostScript and PDF for easy inclusion in scientific publications in LaTeX. Exported files can be edited with SVG editors such as Inkscape. Since cascading style sheets are used, the graphics design can be restyled easily. Unfortunately, Batik is rather slow and memory intensive, so the visualizations are not very scalable to large data sets.
Awards
ELKI started as an implementation[2] of the doctoral dissertation of Arthur Zimek,[3] which was awarded "SIGKDD Doctoral Dissertation Award 2009 Runner-up"[4] by the Association for Computing Machinery for its contributions to correlation clustering. The algorithms published as part of the dissertation (4C, COPAC, HiCO, ERiC, CASH) are available in ELKI.[2]
Version 0.4, presented at the "Symposium on Spatial and Temporal Databases" 2011, which included various methods for spatial outlier detection,[5] won the conference's "best demonstration paper award".
Included algorithms
Select included algorithms:[6]
- Cluster analysis:
- K-means clustering
- Expectation-maximization algorithm
- Hierarchical clustering
- Single-linkage clustering
- DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- OPTICS (Ordering Points To Identify the Clustering Structure), including the extensions OPTICS-OF, DeLi-Clu, HiSC, HiCO and DiSH
- SUBCLU (Density-Connected Subspace Clustering for High-Dimensional Data)
- Canopy clustering algorithm
- Anomaly detection:
- LOF (Local outlier factor)
- OPTICS-OF
- DB-Outlier (Distance-Based Outliers)
- LOCI (Local Correlation Integral)
- LDOF (Local Distance-Based Outlier Factor)
- EM-Outlier
- Spatial index structures:
- Evaluation:
- Receiver operating characteristic (ROC curve)
- Scatter plot
- Histogram
- Parallel coordinates (also in 3D, using OpenGL)
- Other:
Version history
Version 0.1 (July 2008) contained several Algorithms from cluster analysis and anomaly detection, as well as some index structures such as the R*-tree. The focus of the first release was on subspace clustering and correlation clustering algorithms.[7]
Version 0.2 (July 2009) added functionality for time series analysis, in particular distance functions for time series.[8]
Version 0.3 (March 2010) extended the choice of anomaly detection algorithms and visualization modules.[9]
Version 0.4 (September 2011) added algorithms for geo data mining and support for multi-relational database and index structures.[5]
Version 0.5 (April 2012) focuses on the evaluation of cluster analysis results, adding new visualizations and some new algorithms.[10]
Version 0.6 (June 2013) introduces a new 3D adaption of parallel coordinates for data visualization, apart from the usual additions of algorithms and index structures.[11]
Related applications
- Weka a similar project by the University of Waikato, with a focus on classification algorithms.
- RapidMiner an application available both as open source as well as commercially with a focus on machine learning.
- Konstanz Information Miner (KNIME) - open source data analytics platform integrated in Eclipse.
External links
- Official web page of ELKI with download and documentation.
References
- ↑ Hans-Peter Kriegel, Peer Kröger, Arthur Zimek (2009). "Outlier Detection Techniques (Tutorial)". 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2009) (Bangkok, Thailand). Retrieved 2010-03-26.
- ↑ 2.0 2.1 Zimek, A. (2009). "Correlation clustering". ACM SIGKDD Explorations Newsletter 11 (1): 53–54. doi:10.1145/1656274.1656286.
- ↑ Zimek, Arthur (2008-06-30), Correlation Clustering, Munich, Germany: Ludwig Maximilian University of Munich, urn:nbn:de:bvb:19-87361
- ↑ "SIGKDD Doctoral Disseration Award". ACM SIGKDD. Retrieved 30 May 2010.
- ↑ 5.0 5.1 Elke Achtert, Achmed Hettab, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek (2011). "Spatial Outlier Detection: Data, Algorithms, Visualizations". 12th International Symposium on Spatial and Temporal Databases (SSTD 2011) (Minneapolis, MN: Spinger). doi:10.1007/978-3-642-22922-0_41.
- ↑ excerpt from "Data Mining Algorithms in ELKI 0.4". Retrieved August 17, 2011.
- ↑ Elke Achtert, Hans-Peter Kriegel, Arthur Zimek (2008). "ELKI: A Software System for Evaluation of Subspace Clustering Algorithms". Proceedings of the 20th international conference on Scientific and Statistical Database Management (SSDBM 08) (Hong Kong, China: Springer). doi:10.1007/978-3-540-69497-7_41.
- ↑ Elke Achtert, Thomas Bernecker, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek (2009). "ELKI in time: ELKI 0.2 for the performance evaluation of distance measures for time series". Proceedings of the 11th International Symposium on Advances in Spatial and Temporal Databases (SSTD 2010) (Aalborg, Dänemark: Springer). doi:10.1007/978-3-642-02982-0_35.
- ↑ Elke Achtert, Hans-Peter Kriegel, Lisa Reichert, Erich Schubert, Remigius Wojdanowski, Arthur Zimek (2010). "Visual Evaluation of Outlier Detection Models". 15th International Conference on Database Systems for Advanced Applications (DASFAA 2010) (Tsukuba, Japan: Spinger). doi:10.1007/978-3-642-12098-5_34.
- ↑ Elke Achtert, Sascha Goldhofer, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek (2012). "Evaluation of Clusterings Metrics and Visual Support". 28th International Conference on Data Engineering (ICDE) (Washington, DC). doi:10.1109/ICDE.2012.128.
- ↑ Elke Achtert, Hans-Peter Kriegel, Erich Schubert, Arthur Zimek (2013). "Interactive Data Mining with 3D-Parallel-Coordinate-Trees". Proceedings of the ACM International Conference on Management of Data (SIGMOD) (New York City, NY). doi:10.1145/2463676.2463696.