MonetDB

MonetDB
Developer(s) MonetDB Developer Team
Stable release Oct2014 / November 2014
Written in C
Operating system Cross-platform
Type Column-oriented DBMS
RDBMS
License MonetDB License (based on the MPL 1.1)
Website www.monetdb.org

MonetDB is an open source column-oriented database management system developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It was designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and multi-million rows. MonetDB has been applied in high-performance applications for online analytical processing (OLAP), data mining, GIS,[1] RDF,[2] streaming data processing,[3] text retrieval and sequence alignment processing.[4]

History

The older MonetDB logo

Data mining projects in the 1990s required improved analytical database support. This resulted in a CWI spin-off called Data Distilleries, which used early MonetDB implementations in its analytical suite. Data Distilleries eventually became a subsidiary of SPSS in 2003, which in turn was acquired by IBM in 2009.[3]

MonetDB in its current form was first created in 2002 by doctoral student Peter Alexander Boncz and professor Martin L. Kersten as part of the 1990s MAGNUM research project at University of Amsterdam.[5] It was initially called simply Monet, after the French impressionist painter Claude Monet. The first version under an open-source software license (a modified version of the Mozilla Public License) was released on September 30, 2004. When MonetDB version 4 was released into the open-source domain and many extensions to the code base were added by the MonetDB/CWI team. These included a new SQL frontend, supporting the SQL:2003 standard.[6]

MonetDB introduced innovations in all layers of the DBMS: a storage model based on vertical fragmentation, a modern CPU-tuned query execution architecture that often gave MonetDB a speed advantage over the same algorithm over a typical interpreter-based RDBMS. It was one of the first database systems to tune query optimization for CPU caches. MonetDB includes automatic and self-tuning indexes, run-time query optimization, and a modular software architecture.[7][8]

By 2008, a follow-on project called X100 (MonetDB/X100) started, which evolved into the VectorWise technology. VectorWise was acquired by Actian Corporation, integrated with the Ingres database and sold as a commercial product.[9][10]

In 2011 a major effort to renovate the MonetDB codebase was started. As part of it, the code for the MonetDB 4 kernel and its XQuery components were frozen. In MonetDB 5, parts of the SQL layer were pushed into the kernel.[6] The resulting changes created a difference in internal APIs, as it transitioned from MonetDB Instruction Language (MIL) to MonetDB Assembly Language (MAL). Older, no-longer maintained top-level query interfaces were also removed. First was XQuery, which relied on MonetDB 4 and was never ported to version 5.[11] The experimental Jaql interface support was removed with the October 2014 release.[12]

Architecture

MonetDB architecture is represented in three layers, each with its own set of optimizers.[13] The front-end is the top layer, providing query interfaces for SQL, SciQL, SPARQL. Queries are parsed into domain-specific representations, like relational algebra for SQL, and optimized. The generated logical execution plans are then translated into MonetDB Assembly Language (MAL) instructions, which are passed to the next layer. The middle or back-end layer provides a number of cost-based optimizers for the MAL. The bottom layer is the database kernel, which provides access to the data stored in Binary Association Tables (BATs). Each BAT is a table consisting of an Object-identifier and value columns, representing a single column in the database.[13]

MonetDB internal data representation also relies on the memory addressing ranges of contemporary CPUs using demand paging of memory mapped files, and thus departing from traditional DBMS designs involving complex management of large data stores in limited memory.

Query Recycling

Query recycling is an architecture for reusing the byproducts of the operator-at-a-time paradigm in a column store DBMS. Recycling makes use of the generic idea of storing and reusing the results of expensive computations. Unlike low level instruction caches, query recycling uses an optimizer to pre-select instructions to cache. The techniques is designed to improve query response times and throughput, while working in a self-organizing fashion.[14] The authors from the CWI Database Architectures group, composed of Milena Ivanova, Martin Kersten, Niels Nes and Romulo Goncalves, won the "Best Paper Runner Up" at annual ACM SIGMOD conference for their work on Query Recycling.[15][16]

Database Cracking

MonetDB was one of the first databases to introduce Database Cracking. Database Cracking is an incremental partial indexing and/or sorting of the data. It directly exploits the columnar nature of MonetDB. Cracking is a technique that shifts the cost of index maintenance from updates to query processing. The query pipeline optimizers are used to massage the query plans to crack and to propagate this information. The technique allows for improved access times and self-organized behavior.[17] Database Cracking received the ACM SIGMOD 2011 J.Gray best dissertation award.[18]

Components

A number of extensions exist for MonetDB that extend the functionality of the database engine. Due to the three-layer architecture, top-level query interfaces can benefit from optimizations done in the backend and kernel layers.

SQL

MonetDB/SQL is a top-level extension, which provides complete support for transactions in compliance with the SQL:2003 standard.[13]

GIS

MonetDB/GIS is an extension to MonetDB/SQL with support for the Simple Features Access standard of Open Geospatial Consortium (OGC).[1]

SciQL

SciQL an SQL-based query language for science applications with arrays as first class citizens. SciQL allows MonetDB to effectively function as an array database. SciQL is used in the European Union PlanetData and TELEIOS project, together with the Data Vault technology, providing transparent access to large scientific data repositories.[19] Data Vaults map the data from the distributed repositories to SciQL arrays, allowing for improved handling of spatio-temporal data in MonetDB.[20] SciQL will be further extended for the Human Brain Project.[21]

Data Vaults

Data Vault is a database-attached external file repository MonetDB, similar to the SQL/MED standard. The Data Vault technology allows for transparent integration with distributed/remote repositories file repositories. It is designed for scientific data data exploration and mining, specifically for remote sensing data.[20] There is support for the GeoTIFF (Earth observation), FITS (astronomy), MiniSEED (seismology) and NetCDF formats.[20][22] The data is stored in the file repository, in the original format, and loaded in the database in a lazy fashion, only when needed. The system can also process the data upon ingestion, if the data format requires it. [23] As a result, even very large file repositories to be efficiently analyzed, as only the required data is processed in the database. The data can be accessed through either the MonetDB SQL or SciQL interfaces. The Data Vault technology was used in the European Union's TELEIOS project, which was aimed at building a virtual observatory for Earth observation data.[22]

DataCell

MonetDB/DataCell adds stream processing facilities on top of the column-store architecture of MonetDB. It provides facilities for data analysis of-the-fly with the database system itself.[3] [13]

RDF/SPARQL

MonetDB/RDF is a SPARQL-based extension for working with linked data, which adds support for RDF and allowing MonetDB to function as a triplestore. Under development for the Linked Open Data 2 project.[2]

R integration

MonetDB/R module allows for UDFs written in R to be executed in the SQL layer of the system. This is done using the native R support for running embedded in another application, inside the RDBMS in this case. Previously the MonetDB.R connector allowed the using MonetDB data sources and process them in an R session. The newer R integration feature of MonetDB does not require data to be transferred between the RDBMS and the R session, reducing overhead and improving performance. The feature is intended to give users access to functions of the R statistical software for in-line analysis of data stored in the RDBMS. It complements the existing support for C UDFs and is intended to be used for in-database processing.[24]

SAM/BAM

MonetDB has a SAM/BAM module for efficient processing of sequence alignment data. Aimed at the bioinformatics research, the module has a SAM/BAM data loader and a set of SQL UDFs for working with DNA data.[4] The module uses the popular SAMtools library.[25]

See also

References

  1. 1.0 1.1 "GeoSpatial - MonetDB". 4 March 2014.
  2. 2.0 2.1 "MonetDB - LOD2 - Creating Knowledge out of Interlined Data". 6 March 2014.
  3. 3.0 3.1 3.2 "Streaming - MonetDB". 4 March 2014.
  4. 4.0 4.1 "Life Sciences in MonetDB". 24 November 2014.
  5. Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications. Ph.D. Thesis (Universiteit van Amsterdam). May 2002.
  6. 6.0 6.1 MonetDB historic background
  7. Stefan Manegold (June 2006). "An Empirical Evaluation of XQuery Processors". Proceedings of the International Workshop on Performance and Evaluation of Data Management Systems (ExpDB) (ACM). doi:10.1016/j.is.2007.05.004. Retrieved December 11, 2013.
  8. P. A. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, J. Teubner. MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, IL, USA, June 2006.
  9. Marcin Zukowski and Peter Boncz (May 20, 2012). "From x100 to vectorwise: opportunities, challenges and things most researchers do not think about". Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (ACM): 861–862. doi:10.1145/2213836.2213967. ISBN 978-1-4503-1247-9.
  10. Inkster, D. and Zukowski, M. and Boncz, P. A. (September 20, 2011). "Integration of VectorWise with Ingres". ACM SIGMOD Record (ACM).
  11. "XQuery". 12 December 2014.
  12. "MonetDB Oct2014 Release". 12 December 2014.
  13. 13.0 13.1 13.2 13.3 Idreos, S. and Groffen, F. E. and Nes, N. J. and Manegold, S. and Mullender, K. S. and Kersten, M. L. (March 2012). "MonetDB: Two Decades of Research in Column-oriented Database Architectures". IEEE Data Engineering Bulletin (IEEE): 40–45. Retrieved March 6, 2014.
    • Ivanova, Milena G and Kersten, Martin L and Nes, Niels J and Goncalves, Romulo AP (2010). "An architecture for recycling intermediates in a column-store". ACM Transactions on Database Systems (TODS) (ACM) 35 (4): 24.
  14. "CWI database team wins Best Paper Runner Up at SIGMOD 2009". CWI Amsterdam. Retrieved 2009-07-01.
  15. "SIGMOD Awards". ACM SIGMOD. Retrieved 2014-07-01.
  16. Idreos, Stratos and Kersten, Martin L and Manegold, Stefan (2007). Database cracking. Proceedings of CIDR.
  17. "SIGMOD Awards". ACM SIGMOD. Retrieved 2014-12-12.
  18. Zhang, Y. and Scheers, L. H. A. and Kersten, M. L. and Ivanova, M. and Nes, N. J. (2011). "Astronomical Data Processing Using SciQL, an SQL Based Query Language for Array Data". Astronomical Data Analysis Software and Systems.
  19. 20.0 20.1 20.2 Ivanova, Milena and Kersten, Martin and Manegold, Stefan (2012). Data vaults: a symbiosis between database technology and scientific file repositories. Springer Berlin Heidelberg. pp. 485–494.
  20. "SCIQL.ORG". 4 March 2014.
  21. 22.0 22.1 Ivanova, Milena and Kargin, Yagiz and Kersten, Martin and Manegold, Stefan and Zhang, Ying and Datcu, Mihai and Molina, Daniela Espinoza (2013). "Data Vaults: A Database Welcome to Scientific File Repositories". SSDBM. ACM. doi:10.1145/2484838.2484876. ISBN 978-1-4503-1921-8.
  22. Kargin, Yagiz and Ivanova, Milena and Zhang, Ying and Manegold, Stefan and Kersten, Martin (August 2013). "Lazy ETL in Action: ETL Technology Dates Scientific Data". Proceedings VLDB Endowment 6 (12) (VLDB Endowment). pp. 1286–1289. doi:10.14778/2536274.2536297. ISSN 2150-8097.
  23. "Embedded R in MonetDB". 13 November 2014.
  24. "SAM/BAM installation". 24 November 2014.

Bibliography

  • Boncz, Peter and Manegold, Stefan and Kersten, Martin (1999). Database architecture optimized for the new bottleneck: Memory access. Proceedings of International Conference on Very Large Data Bases: 54–65.
  • Schmidt, Albrecht and Kersten, Martin and Windhouwer, Menzo and Waas, Florian (2001). "Efficient relational storage and retrieval of XML documents". The World Wide Web and Databases (Springer): 137–150.
  • Idreos, Stratos and Kersten, Martin L and Manegold, Stefan (2007). Database cracking. Proceedings of CIDR.
  • Boncz, Peter A and Kersten, Martin L and Manegold, Stefan (2008). "Breaking the memory wall in MonetDB". Communications of the ACM (ACM) 51 (12): 77–85. doi:10.1145/1409360.1409380.
  • Sidirourgos, Lefteris and Goncalves, Romulo and Kersten, Martin and Nes, Niels and Manegold, Stefan (2008). "Column-store support for RDF data management: not all swans are white". Proceedings of the VLDB Endowment (VLDB Endowment) 1 (2): 1553–1563. doi:10.14778/1454159.1454227.
  • Ivanova, Milena G. and Kersten, Martin L. and Nes, Niels J. and Goncalves, Romulo A.P. (2009). "An Architecture for Recycling Intermediates in a Column-store". Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. SIGMOD '09. ACM. pp. 309–320. doi:10.1145/1559845.1559879. ISBN 978-1-60558-551-2.
  • Manegold, Stefan and Boncz, Peter A. and Kersten, Martin L. (Dec 2000). "Optimizing Database Architecture for the New Bottleneck: Memory Access". The VLDB Journal (Springer-Verlag New York, Inc.) 9 (3): 231–246. doi:10.1007/s007780000031. ISSN 1066-8888.
  • Ivanova, Milena G and Kersten, Martin L and Nes, Niels J and Goncalves, Romulo AP (2010). "An architecture for recycling intermediates in a column-store". ACM Transactions on Database Systems (TODS) (ACM) 35 (4): 24.
  • Goncalves, Romulo and Kersten, Martin (2011). "The data cyclotron query processing scheme". ACM Transactions on Database Systems (TODS) (ACM) 36 (4): 27.
  • Kersten, Martin L and Idreos, Stratos and Manegold, Stefan and Liarou, Erietta (2011). "The researcher’s guide to the data deluge: Querying a scientific database in just a few seconds". PVLDB Challenges and Visions.
  • Kersten, M and Zhang, Ying and Ivanova, Milena and Nes, Niels (2011). SciQL, a query language for science applications. ACM}. pp. 1–12.
  • Sidirourgos, Lefteris and Kersten, Martin and Boncz, Peter (2011). "SciBORQ: Scientific data management with Bounds On Runtime and Quality". Creative Commons.
  • Liarou, Erietta and Idreos, Stratos and Manegold, Stefan and Kersten, Martin (2012). "MonetDB/DataCell: online analytics in a streaming column-store". Proceedings of the VLDB Endowment (VLDB Endowment) 5 (12): 1910–1913. doi:10.14778/2367502.2367535.
  • Ivanova, Milena and Kersten, Martin and Manegold, Stefan (2012). Data vaults: a symbiosis between database technology and scientific file repositories. Springer Berlin Heidelberg. pp. 485–494.
  • Kargin, Yagiz and Ivanova, Milena and Zhang, Ying and Manegold, Stefan and Kersten, Martin (August 2013). "Lazy ETL in Action: ETL Technology Dates Scientific Data". Proceedings VLDB Endowment 6 (12) (VLDB Endowment). pp. 1286–1289. doi:10.14778/2536274.2536297. ISSN 2150-8097.
  • Sidirourgos, Lefteris and Kersten, Martin (2013). "Column imprints: a secondary index structure". Proceedings of the 2013 international conference on Management of data. ACM. pp. 893–904.
  • Ivanova, Milena and Kargin, Yagiz and Kersten, Martin and Manegold, Stefan and Zhang, Ying and Datcu, Mihai and Molina, Daniela Espinoza (2013). "Data Vaults: A Database Welcome to Scientific File Repositories". SSDBM. ACM. doi:10.1145/2484838.2484876. ISBN 978-1-4503-1921-8.

External links

This article is based on material taken from the Free On-line Dictionary of Computing prior to 1 November 2008 and incorporated under the "relicensing" terms of the GFDL, version 1.3 or later.