Apache Hive

This article is about a data warehouse infrastructure. For the Java application framework, see Apache Beehive.
Apache Hive
Developer(s) Contributors
Stable release 1.1.0 [1] / March 8, 2015
Development status Active
Written in Java
Operating system Cross-platform
Type

Database engine or management software that has been released under an open source license.

License GNU General Public License (Apache License 2.0)
Website hive.apache.org

Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.[2] While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix.[3][4] Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.[5]

Features

Apache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem. It provides an SQL-like language called HiveQL[6] with schema on read and transparently converts queries to map/reduce, Apache Tez[7] and Spark jobs. All three execution engines can run in Hadoop YARN. To accelerate queries, it provides indexes, including bitmap indexes.[8]

By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.[9]

Currently, there are four file formats supported in Hive, which are TEXTFILE,[10] SEQUENCEFILE, ORC[11] and RCFILE.[12][13][14] Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13.[15][16]

Other features of Hive include:

HiveQL

While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. Also, HiveQL lacks support for transactions and materialized views, and only limited subquery support.[17][18] Support for insert, update, and delete with full ACID functionality was made available with release 0.14.[19]

Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce or Tez, or Spark jobs, which are submitted to Hadoop for execution.[20]

See also

References

External links