Data lake
A data lake is a method of storing data within a system or repository, in its natural format,[1] that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files. The idea of data lake is to have a single store of all data in the enterprise ranging from raw data (which implies exact copy of source system data) to transformed data which is used for various tasks including reporting, visualization, analytics and machine learning. The data lake includes structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (images, audio, video) thus creating a centralized data store accommodating all forms of data.[2]
A data swamp is a deteriorated data lake, that is inaccessible to its intended users and provides little value.[3][4]
Background
James Dixon, then chief technology officer at Pentaho allegedly coined the term[5] to contrast it with data mart, which is a smaller repository of interesting attributes extracted from raw data.[6] He argued that data marts have several inherent problems, and promoted data lakes. These problems are often referred to as information siloing. PricewaterhouseCoopers said that data lakes could "put an end to data silos.[7] In their study on data lakes they noted that enterprises were "starting to extract and place data for analytics into a single, Hadoop-based repository."
Many companies have now entered into this space: Google, Microsoft, Zeloni, Teradata, Cloudera, and Amazon all have data lake offerings to name a few. [8]
Examples
One example of a data lake is the distributed file system used in Apache Hadoop.
Many companies also use cloud storage services such as Amazon S3.[9] There is a gradual academic interest in the concept of data lakes, for instance, Personal DataLake[10] at Cardiff University to create a new type of data lake which aims at managing big data of individual users by providing a single point of collecting, organizing, and sharing personal data.[11]
Analytics is a primary reason why this new kind of data architecture is becoming popular. It has tremendous advantages if you have a variety of data structures [12] and are doing "big data" analysis but you need to be wary of the hype [13] either pro or con around it.
An earlier data lake (Hadoop 1.0) had limited capabilities with its batch oriented processing (Map Reduce) and was the only processing paradigm associated with it. Interacting with the data lake meant you had to have expertise in Java with map reduce and higher level tools like Pig & Hive (which by themselves were batch oriented). With the dawn of Hadoop 2.0 and separation of duties with Resource Management taken over by YARN (Yet Another Resource Negotiator), new processing paradigms like streaming, interactive, on-line have become available via Hadoop and the Data Lake.
Criticism
In June 2015, David Needle characterized "so-called data lakes" as "one of the more controversial ways to manage big data".[14] PricewaterhouseCoopers were also careful to note in their research that not all data lake initiatives are successful. They quote Sean Martin, CTO of Cambridge Semantics,
“ | We see customers creating big data graveyards, dumping everything into HDFS [Hadoop Distributed File System] and hoping to do something with it down the road. But then they just lose track of what’s there. The main challenge is not creating a data lake, but taking advantage of the opportunities it presents.[7] |
” |
They describe companies that build successful data lakes as gradually maturing their lake as they figure out which data and metadata are important to the organization.
One other criticism about the data lake is that the concept is fuzzy and arbitrary. It refers to any tool or data management practice that does not fit into the traditional data warehouse architecture. The data lake has been referred to as a technology such as Hadoop. The data lake has been labeled as a raw data reservoir or a hub for ETL offload. The data lake has been defined as a central hub for self-service analytics. The concept of the data lake has been overloaded with meanings, which puts the usefulness of the term into question.[15]
References
- ↑ The growing importance of big data quality
- ↑ Campbell, Chris. "Top Five Differences between DataWarehouses and Data Lakes". Blue-Granite.com. Retrieved May 19, 2017.
- ↑ Olavsrud, Thor. "3 keys to keep your data lake from becoming a data swamp". CIO. Retrieved 2017-07-05.
- ↑ Newman, Daniel. "6 Steps To Clean Up Your Data Swamp". Forbes. Retrieved 2017-07-05.
- ↑ Woods, Dan (21 July 2011). "Big data requires a big architecture". Tech. Forbes.
- ↑ Dixon, James. "Pentaho, Hadoop, and Data Lakes". James Dixon’s Blog. James. Retrieved 7 November 2015.
If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
- 1 2 Stein, Brian; Morrison, Alan (2014). Data lakes and the promise of unsiloed data (pdf) (Report). Technology Forecast: Rethinking integration. PricewaterhouseCooper.
- ↑ Weaver, Lance. "Why Companies are Jumping into Data Lakes". blog.equinox.com. Retrieved 19 May 2017.
- ↑ Tuulos, Ville (22 September 2015). "Petabyte-Scale Data Pipelines with Docker, Luigi and Elastic Spot Instances".
- ↑ http://ieeexplore.ieee.org/xpl/abstractAuthors.jsp?reload=true&arnumber=7310733
- ↑ http://www.researchgate.net/publication/283053696_Personal_Data_Lake_With_Data_Gravity_Pull
- ↑ Schmarzo, Bill. "Why do I need a Data Lake". infocus.ems.com. Retrieved May 24, 2017.
- ↑ "What, why and how of data lakes". 20 May 2016 – via TechiExpert.
- ↑ Needle, David (10 June 2015). "Hadoop Summit: Wrangling Big Data Requires Novel Tools, Techniques". Enterprise Apps. eWeek. Retrieved 1 November 2015.
Walter Maguire, chief field technologist at HP's Big Data Business Unit, discussed one of the more controversial ways to manage big data, so-called data lakes.
- ↑ "Are Data Lakes Fake News?". Sonra. 2017-08-08. Retrieved 2017-08-10.