H2O (software)

H2O

Original author(s)	H2O.ai
Developer(s)	H2O.ai
Initial release	2011 (2011)

Stable release	Tutte (3.10.2.2) / January 12, 2017 (2017-01-12)

Repository	github.com/h2oai/h2o-3
Development status	Active
Written in	H2O (written in Java, Python, and R)^[1]^[2]^[3]
Operating system	Linux, macOS, and Microsoft Windows
Platform	Apache Hadoop Distributed File System; Amazon EC2, Google Compute Engine, and Microsoft Azure.
Standard(s)	Databricks certified on Spark.^[3]
Available in	English
Type	big data analytics, machine learning, statistical learning theory^[4]
License	Apache license 2.0^[5]
Website	www.h2o.ai
As of	1 June 2015

H2O is open-source software for big-data analysis. It is produced by the company H2O.ai (formerly 0xdata), which launched in 2011 in Silicon Valley. H2O allows users to fit thousands of potential models as part of discovering patterns in data.

H2O's mathematical core is developed with the leadership of Arno Candel, part of Fortune's 2014 "Big Data All Stars".^[6] The firm's scientific advisors are experts on statistical learning theory and mathematical optimization.

The H2O software runs can be called from the statistical package R, Python, and other environments. It is used for exploring and analyzing datasets held in cloud computing systems and in the Apache Hadoop Distributed File System as well as in the conventional operating-systems Linux, macOS, and Microsoft Windows. The H2O software is written in Java, Python, and R. Its graphical-user interface is compatible with four browsers: Chrome, Safari, Firefox, and Internet Explorer.

H2O

The H2O project aims to develop an analytical interface for cloud computing, providing users with tools for data analysis.^[1]

Leadership

Cliff Click (left) and SriSatish Ambati (right) speak at an event for H2O.ai (0xdata).

H2O.ai was co-founded by Cliff Click and SriSatish Ambati. (Photograph by H2O.ai released under Creative Commons BY 2.0 license.)

H2O's chief executive, SriSatish Ambati, had helped to start Platfora, a big-data firm that develops software for the Apache Hadoop distributed file system.^[7] Ambati was frustrated with the performance of the R programming language on large data-sets and started the development of H2O software with encouragement from John Chambers,^[2] who created the S programming language at Bell Labs and who is a member of R's core team (which leads the development of R).^[2]^[8]^[9]

Ambati co-founded 0xdata with Cliff Click, who served as the chief technical officer of H2O and helped create much of H2O's product. Click helped to write the HotSpot Server Compiler and worked with Azul Systems to construct a big-data Java virtual machine (JVM).^[10] Click left H2O in February 2016.^[11] Leland Wilkinson, author of The Grammar of Graphics, serves as Chief Scientist and provides visualization leadership.^[12]

Scientific advisory council

Stanford University professor Trevor J. Hastie serves as an advisor to H2O.ai.

H2O's Scientific Advisory Council lists three mathematical scientists, who are all professors at Stanford University:^[13] Professor Stephen P. Boyd is an expert in convex minimization and applications in statistics and electrical engineering.^[14] Robert Tibshirani, a collaborator with Bradley Efron on bootstrapping,^[15] is an expert on generalized additive models and statistical learning theory.^[16]^[17] Trevor Hastie, a collaborator of John Chambers on S,^[9] is an expert on generalized additive models and statistical learning theory.^[16]^[17]

H2O.ai: A Silicon Valley start-up

The software is open-source and freely distributed. The company receives fees for providing customer service and customized extensions. In November 2014, its twenty clients included Cisco, eBay, Nielsen, and PayPal, according to VentureBeat.^[2]

Mining of big data

Machine learning and data mining

Problems Classification Clustering Regression Anomaly detection Association rules Reinforcement learning Structured prediction Feature engineering Feature learning Online learning Semi-supervised learning Unsupervised learning Learning to rank Grammar induction
Supervised learning (classification • regression) Decision trees Ensembles (Bagging, Boosting, Random forest) k-NN Linear regression Naive Bayes Neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH Hierarchical k-means Expectation-maximization (EM) DBSCAN OPTICS Mean-shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA t-SNE
Structured prediction Graphical models (Bayes net, CRF, HMM)
Anomaly detection k-NN Local outlier factor
Neural nets Autoencoder Deep learning Multilayer perceptron RNN Restricted Boltzmann machine SOM Convolutional neural network
Reinforcement Learning Q-Learning SARSA Temporal Difference (TD)
Theory Bias-variance dilemma Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine learning venues NIPS ICML ML JMLR ArXiv:cs.LG
Related articles List of datasets for machine learning research Outline of machine learning
Machine learning portal

Big datasets are too large to be analyzed using traditional software like R. The H2O software provides data structures and methods suitable for big data. H2O allow users to analyze and visualize whole sets of data without using the Procrustean strategy of studying only a small subset with a conventional statistical package.^[2] H2O's statistical algorithms includes K-means clustering, generalized linear models, distributed random forests, gradient boosting machines, naive bayes, principal component analysis, and generalized low rank models.^[18]

H2O is also able to run on Spark.^[19]

Iterative methods for real-time problems

H2O uses iterative methods that provide quick answers using all of the client's data. When a client cannot wait for an optimal solution, the client can interrupt the computations and use an approximate solution.^[1] In its approach to deep learning,^[2]^[18]^[20] H2O divides all the data into subsets and then analyzing each subset simultaneously using the same method. These processes are combined to estimate parameters by using the Hogwild scheme,^[21] a parallel stochastic gradient method.^[22] These methods allow H2O to provide answers that use all the client's data, rather than throwing away most of it and analyzing a subset with conventional software.

Software

Programming languages

The H2O software has an interface to the following programming languages: Java (6 or later), Python (2.7.x, 3.5.x), R (3.0.0 or later) and Scala (1.4-1.6).^[2]^[3]

Operating systems

The H2O software can be run on conventional operating-systems: Microsoft Windows (7 or later), Mac OS X (10.9 or later), and Linux (Ubuntu 12.04 ; RHEL/CentOS 6 or later),^[3] It also runs on big-data systems, particularly Apache Hadoop Distributed File System (HDFS), several popular versions: Cloudera (5.1 or later), MapR (3.0 or later), and Hortonworks (HDP 2.1 or later). It also operates on cloud computing environments, for example using Amazon EC2, Google Compute Engine, and Microsoft Azure. The H2O Sparkling Water software is Databricks-certified on Apache Spark.^[3]

Graphical user interface and browsers

Its graphical user interface is compatible with four browsers (unless specified, in their latest versions as of 1 June 2015): Chrome, Safari, Firefox, Internet Explorer (IE10).^[3]

Notes

1 2 3 Harris (2012)
1 2 3 4 5 6 7 Novet (2014)
1 2 3 4 5 6 "Recommended systems for H2O". 0xdata.com. H2O.ai. May 2015.
↑ Hardy (2014)
↑ https://github.com/h2oai/h2o-2/blob/master/LICENSE.txt
↑ Hackett (2014)
↑ Gage (2013)
↑ ACM honors Dr. John M. Chambers of Bell Labs with the 1998 ACM Software System Award for creating "S System" software, ACM press release, March 29, 1999. Accessed 8 December 2008.
1 2 J. Chambers and T. Hastie, Statistical Models in S, Wadsworth/Brooks Cole, 1991.
↑ Schuster, Werner (10 January 2014). "Cliff Click on in-memory processing, 0xdata H20, efficient low latency Java and GCs". InfoQ. Retrieved 2 June 2015.
↑ "Winds of Change". Cliff Click. 2016.
↑ "H2O.ai". www.h2o.ai. Retrieved 2017-01-28.
↑ "About". 0xdata. 2015.
↑ Boyd, Stephen P.; Vandenberghe, Lieven (2004). Convex optimization. Cambridge University Press. ISBN 978-0-521-83378-3. Retrieved October 15, 2011. (Free download of PDF of corrected 7th printing, 2009)
↑ Bradley Efron; Robert Tibshirani (1994). An Introduction to the Bootstrap. Chapman & Hall/CRC. ISBN 978-0-412-04231-7.
1 2 Hastie, T. J.; Tibshirani, R. J. (1990). Generalized additive models. Chapman & Hall/CRC. ISBN 978-0-412-34390-2.
1 2 Hastie, Trevor; Tibshirani, Robert; Friedman, Jerome H. (2011). The Elements of Statistical Learning (second ed.). Retrieved 15 June 2012. (Free download of 10th printing, June 2013)
1 2 Aiello, Spencer; Tom Kraljevic; Petr Maj (2015), with contributions from the 0xdata team, "h2o: R Interface for H2O", The Comprehensive R Archive Network (CRAN), Contributed Packages, The R Project for Statistical Computing (3.0.0.12)
↑ "FAQ — H2O 3.10.2.1 documentation". docs.h2o.ai. Retrieved 2017-01-28.
↑ "Prediction of IncRNA using Deep Learning Approach". Tripathi, Rashmi; Kumari, Vandana; Patel, Sunil; Singh, Yashbir; Varadwaj, Pritish. International Conference on Advances in Biotechnology (BioTech). Proceedings: 138-142. Singapore: Global Science and Technology Forum. (2015)
↑ Description of the iterative method for computing maximum-likelihood estimates for a generalized linear model.
↑ Benjamin Recht; Re, Christopher; Wright, Stephen & Feng Niu (2011). J. Shawe-Taylor; R.S. Zemel; P.L. Bartlett; F. Pereira & K.Q. Weinberger, eds. "Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent" (PDF). Advances in Neural Information Processing Systems. Curran Associates, Inc. 24: 693–701. Recht's PDF

References

Gage, Deborah (15 April 2013). "Platfora founder goes in search of big-data answers". Wall Street Journal. Retrieved 2 June 2015.
Hackett, Robert (3 August 2014), Nusca, Andrew; Hackett, Robert; Gupta, Shalene, eds., "Arno Candel, physicist and hacker, 0xdata", Fortune, Meet Fortune's 2014 Big Data All-Stars, retrieved 2 June 2015
Hardy, Quentin (3 May 2014). "Valuable humans in our digital future". New York Times. Retrieved 1 June 2015.
Harris, Derrick (14 August 2012). "How 0xdata wants to help everyone become data scientists". Gigaom Research. Retrieved 1 June 2015.
Novet, Jordan (7 November 2014). "0xdata takes $8.9M and becomes H2O to match its open-source machine-learning project". VentureBeat. Retrieved 1 June 2015.

External links

Cross-platform	Data Desk GAUSS GraphPad InStat GraphPad Prism IBM SPSS Statistics IBM SPSS Modeler JMP Maple Mathcad Mathematica MATLAB OxMetrics RATS Revolution Analytics SAS SmartPLS Stata StatView SUDAAN S-PLUS TSP World Programming System (WPS)
Windows only	BMDP EViews GenStat LIMDEP LISREL MedCalc Microfit Minitab MLwiN NCSS SHAZAM SigmaStat Statistica StatsDirect StatXact SYSTAT The Unscrambler UNISTAT
Excel add-ons	Analyse-it SPC XL SigmaXL UNISTAT for Excel XLfit RExcel

Category
Comparison

Data warehouse

Creating the data warehouse

Concepts	Database Dimension Dimensional modeling Fact OLAP Star schema Aggregate
Variants	Anchor Modeling Column-oriented DBMS Data vault modeling HOLAP MOLAP ROLAP Operational data store
Elements	Data dictionary/Metadata Data mart Sixth normal form Surrogate key
Fact	Fact table Early-arriving fact Measure
Dimension	Dimension table Degenerate Slowly changing
Filling	Extract-Transform-Load (ETL) Extract Transform Load

Using the data warehouse

Concepts	Business intelligence Dashboard Data mining Decision support system (DSS) OLAP cube Data warehouse automation
Languages	Data Mining Extensions (DMX) MultiDimensional eXpressions (MDX) XML for Analysis (XMLA)
Tools	Business intelligence software Reporting software Spreadsheet

People	Bill Inmon Ralph Kimball
Products	Comparison of OLAP Servers Data warehousing products and their producers

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.