Druid (open-source data store)

Druid
Original author(s)	Eric Tschetter, Fangjin Yang
Developer(s)	The Druid Community
Stable release	0.7.0 / 23 February 2015 (2015-02-23)
Development status	Active
Written in	Java
Operating system	Cross-platform
Type	distributed, real-time, column-oriented data store
License	Apache License 2.0
Website	druid.io

Druid is a column-oriented open-source distributed data store written in Java. Druid is designed to quickly ingest massive quantities of time-series data, making that data immediately available to queries.^[1] This is sometimes referred to as real-time data.

On the developer Q&A site Stackoverflow, Druid is described as "open-source infrastructure for real-time exploratory analytics on large datasets."^[2] It is designed to ingest time-series data, chunking and compressing that data into column-based queryable segments.^[3]

Architecture^[4]

Fully deployed, Druid runs as a cluster of specialized nodes to support a fault-tolerant architecture where data is stored redundantly and there are multiple members of each node type.^[5] In addition, the cluster includes external dependencies for coordination (Apache ZooKeeper), storage of metadata (Mysql), and a deep storage facility (e.g., HDFS, Amazon S3, or Apache Cassandra).

Data Ingestion

Data is ingested by Druid directly through its real-time nodes, or batch-loaded into historical nodes from a deep storage facility. Real-time nodes accept JSON-formatted data from a streaming datasource. Batch-loaded data formats can be JSON, CSV, or TSV. Real-time nodes temporarily store and serve data in real time, but eventually push the data to the deep storage facility, from which it is loaded into historical nodes. Historical nodes hold the bulk of data in the cluster.

Real-time nodes chunk data into segments, and are designed to frequently move these segments out to deep storage. To maintain cluster awareness of the location of data, these nodes must interact with Mysql to update metadata about the segments, and with Apache ZooKeeper to monitor their transfer.

Query Management

Client queries first hit broker nodes, which forward them to the appropriate data nodes (either historical or real-time). Since Druid segments may be partitioned, an incoming query can require data from multiple segments and partitions (or shards) stored on different nodes in the cluster. Brokers are able to learn which nodes have the required data, and also merge partial results before returning the aggregated result.

Cluster Management

Operations relating to data management in historical nodes are overseen by coordinator nodes, which are the prime users of the Mysql metadata tables. Apache ZooKeeper is used to register all nodes, manage certain aspects of internode communications, and provide for leader elections.

Features

Time series queries
TopN queries
GroupBy queries
No join queries as of this writing

History

The project was started to power the analytics product of Metamarkets. The first line of code was committed by Eric Tschetter to a private Github repository in March of 2011. He converted the first Metamarkets customer to it by the end of March 2011 and all customers were converted by mid-May. Metamarkets subsequently determined to invest more in the system and hired Fangjin Yang in September 2011. With Fangjin added to the development team, feature development of Druid accelerated rapidly. The project was open-sourced under the GPL license in October 2012.^[6]^[7] Since then, a number of organizations and companies, including Netflix^[8] and Yahoo^[9] have integrated Druid into their backend technology.

References

↑ Hemsoth, Nicole. "Druid Summons Strength in Real-Time", datanami, 08 November 2012
↑ Stackoverflow shorthand tag description
↑ Monash, Curt. "Metamarkets Druid Overview", DBMS2, 16 June 2012
↑ Druid Project Documentation
↑ Yang, Fangjin; Tschetter, Eric; Léauté, Xavier; Ray, Nelson; Merlino, Gian; Ganguli, Deep. "Druid: A Real-time Analytical Data Store", Metamarkets, retrieved 6 February 2014
↑ Tschetter, Eric. "Introducing Druid", Druid.io, 24 October 2012
↑ Higginbotham, Stacey. "Metamarkets open sources Druid, its in-memory database", GigaOM, 24 October 2012
↑ Bae, Jae Hyeon; Yuan, Danny; Tonse, Sudhir. "Announcing Suro: Backbone of Netflix's Data Pipeline", Netflix, 9 December 2013
↑ Iranmanesh, Reza; Chandrashekar, Srikalyan. "Pushing the limits of Realtime Analytics using Druid", Slideshare, 19 July 2014