Resilient distributed dataset

In computer science, a resilient distributed dataset (RDD) is a logical collection of data partitioned across machines.[1][2]

Background

With the advent of big data, the technology that needs to look at databases in huge sizes which are in terabytes is evolving rapidly in response to the various challenges and requirements by its adopters. Hadoop's file system HDFS is now ubiquitous in the big data space and it has MapReduce as its computation framework.

Apache Spark was created as another computational framework that is slowly coming up as a solution of a universal computational framework on an underlying Big Data file system. It was initially published as a research paper in July 2011 at UC Berkeley. The paper proposed a new concept or abstraction of RDD that would be stored in memory and have a key characteristic of being fault-tolerant.

Characteristics

Some of the key characteristics and advantages of RDD are illustrated as below.

RDDs thus help in improving the performance relative to the existing implementations. The core of SPARK consists of RDDs and thus helped in creating one of the best big data computation frameworks yet.

References

  1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica University of California, Berkeley
  2. Advanced Analytics with Spark. Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills.
This article is issued from Wikipedia - version of the Tuesday, January 19, 2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.