Resilient distributed dataset
In computer science, a resilient distributed dataset (RDD) is a logical collection of data partitioned across machines.[1][2]
Background
With the advent of big data, the technology that needs to look at databases in huge sizes which are in terabytes is evolving rapidly in response to the various challenges and requirements by its adopters. Hadoop's file system HDFS is now ubiquitous in the big data space and it has MapReduce as its computation framework.
Apache Spark was created as another computational framework that is slowly coming up as a solution of a universal computational framework on an underlying Big Data file system. It was initially published as a research paper in July 2011 at UC Berkeley. The paper proposed a new concept or abstraction of RDD that would be stored in memory and have a key characteristic of being fault-tolerant.
Characteristics
Some of the key characteristics and advantages of RDD are illustrated as below.
- RDDs are read-only, partitioned collection of records.
- RDDs respond to the current challenges by providing a good solution to iterative applications and data mining tools.
- RDDs are good fit for many parallel applications because these applications apply the same operation on multiple data-items.
- RDDs can be generated by a set of deterministic operations applied on a datasource OR other RDDs.
- RDDs have information on their derivation(linkage) from the datasets that they can be used to re-created them anytime. This provides the basis of resilience and fault-tolerance for the data.
- RDDs allow the developers to use a particular dataset from memory and also makes a pipeline process possible. In MAPREDUCE we would have to read into the disks across the cluster in order to reconstruct the given dataset.
- Data Scientists can run through an RDD multiple times through their querying process and take advantage of their in-memory representation and not wait for disk IO.
- Representations of the RDDs provide interface to access the partitions, best locations to access the data, dependencies etc.
RDDs thus help in improving the performance relative to the existing implementations. The core of SPARK consists of RDDs and thus helped in creating one of the best big data computation frameworks yet.
References
- ↑ Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, Berkeley
- ↑ Advanced Analytics with Spark. Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills.