Data warehouse

From Wikipedia, the free encyclopedia

A data warehouse is a computer system for archiving and analyzing an organisation's historical data, such as sales, salaries, or other information from day-to-day operations. Normally, an organisation copies information from its operational systems (such as sales and human resources) to the data warehouse on a regular schedule, such as every night or every weekend; after that, management can perform complex queries and analysis (such as data mining) on the information without slowing down the operational systems.

Contents

[edit] Definition

A data warehouse is the main repository of the organization's historical data, its corporate memory. For example, an organization would use the information that's stored in its data warehouse to find out what day of the week they sold the most widgets in May 1992, or how employee sick leave the week before Christmas differed between California and Quebec from 2001-2005. In other words, the data warehouse contains the raw material for management's decision support system.

While operational systems are optimized for simplicity and speed of modification (online transaction processing, or OLTP) through heavy use of database normalization and an entity-relationship model, the data warehouse is optimized for reporting and analysis (online analytical processing, or OLAP). Frequently data in Data Warehouses is heavily denormalised, summarised and/or stored in a dimension-based model but this is not always required to achieve acceptable query response times.

More formally, Bill Inmon (one of the earliest and most influential practitioners) defined a data warehouse as follows:

  • Subject-oriented, meaning that the data in the database is organized so that all the data elements relating to the same real-world event or object are linked together;
  • Time-variant, meaning that the changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time;
  • Non-volatile, meaning that data in the database is never over-written or deleted, but retained for future reporting; and,
  • Integrated, meaning that the database contains data from most or all of an organization's operational applications, and that this data is made consistent.

[edit] History of data warehousing

Data Warehouses became a distinct type of computer database during the late 1980s and early 1990s. They were developed to meet a growing demand for management information and analysis that could not be met by operational systems. Operational systems were unable to meet this need for a range of reasons:

  • The processing load of reporting reduced the response time of the operational systems,
  • The database designs of operational systems were not optimized for information analysis and reporting,
  • Most organizations had more than one operational system, so company-wide reporting could not be supported from a single system, and
  • Development of reports in operational systems often required writing specific computer programs which was slow and expensive

As a result, separate computer databases began to be built that were specifically designed to support management information and analysis purposes. These data warehouses were able to bring in data from a range of different data sources, such as mainframe computers, minicomputers, as well as personal computers and office automation software such as spreadsheet, and integrate this information in a single place. This capability, coupled with user-friendly reporting tools and freedom from operational impacts, has led to a growth of this type of computer system.

As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages:

  • Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.
  • Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure
  • Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)
  • Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.

[edit] Data Warehouse Architecture

The term Data Warehouse Architecture is primarily used to today to describe the overall structure of a Business Intelligence system. Other historical terms include Decision Support Systems (DSS), Management Information Systems (MIS), and others.

The Data Warehouse Architecture describes the overall system from various perspectives such as Data, Process, and Infrastructure needed to communicate the structure, function and interrelationships of each component. The infrastructure or technology perspective details the various hardware and software products used to implement the distinct components of the overall system. The Data perspective typically diagrams the source and target data structures and aid the user in understanding what data assets are available and how they are related. The process perspective is primarily concerned with communicating the process and flow of data from the originating source system through the process of loading the data warehouse, and often the process that client products use to access and extract data from the warehouse.

[edit] Components of a data warehouse

The primary components of the majority of data warehouses are shown in the attached diagram and described in more detail below:

[edit] Data Sources

Data sources refers to any electronic repository of information that contains data of interest for management use or analytics. This definition covers mainframe databases (e.g. IBM DB2, ISAM, Adabas, Teradata, etc.), client-server databases (e.g. Teradata, IBM DB2, Oracle database, Informix, Microsoft SQL Server, etc.), PC databases (e.g. Microsoft Access, Alpha Five), spreadsheets (e.g. Microsoft Excel) and any other electronic store of data. Data needs to be passed from these systems to the data warehouse either on a transaction-by-transaction basis for real-time data warehouses or on a regular cycle (e.g.daily or weekly) for offline data warehouses.

[edit] Data Transformation Layer

The data transformation layer (aka Extract, transform, load - ETL or some variant) is the subsystem concerned with extraction of data from the data sources (source systems), transformation from the source format and structure into the target(data warehouse) format and structure, and loading into the data warehouse.

[edit] Data Warehouse

The data warehouse is normally (but does not have to be) a relational database. It must be organized to hold information in a structure that best supports not only query and reporting, but also advanced analysis techniques, like data mining. Most data warehouses hold information for at least 1 year and sometimes can reach half century, depending on the business/operations data retention requirement. As a result these databases can become very large.

[edit] Reporting

The data in the data warehouse must be available to the organization's staff if the data warehouse is to be useful. There are a very large number of software applications that perform this function, or reporting can be custom-developed. Examples of types of reporting tools include:

  • Business intelligence tools: These are software applications that simplify the process of development and production of business reports based on data warehouse data.
  • Executive information systems (known more widely as Dashboard (business): These are software applications that are used to display complex business metrics and information in a graphical way to allow rapid understanding.
  • OLAP Tools: OLAP tools form data into logical multi-dimensional structures and allow users to select which dimensions to view data by.
  • Data Mining: Data mining tools are software that allow users to perform detailed mathematical and statistical calculations on detailed data warehouse data to detect trends, identify patterns and analyze data.

[edit] Metadata

Metadata, or "data about data", is used not only to inform operators and users of the data warehouse about its status and the information held within the data warehouse, but also as a means of integration of incoming data and a tool to update and refine the underlying DW model.

Examples of data warehouse metadata include table and column names, their detailed descriptions, their connection to business meaningful names, the most recent data load date, the business meaning of a data item and the number of users that are logged in currently.

[edit] Operations

Data warehouse operations are comprised of the processes of loading, manipulating and extracting data from the data warehouse. Operations also cover user management, security, capacity management and related functions

[edit] Optional Components

In addition, the following components exist in some data warehouses:

  1. Dependent Data Marts: A dependent data mart is a physical database (either on the same hardware as the data warehouse or on a separate hardware platform) that receives all its information from the data warehouse. The purpose of a Data Mart is to provide a sub-set of the data warehouse's data for a specific purpose or to a specific sub-group of the organization. A data mart is exactly like a data warehouse technically, but it serves a different business purpose: it either holds information for only part of a company (such as a division), or it holds a small selection of information for the entire company (to support extra analysis without slowing down the main system). In either case, however, it is not the organization's official repository, the way a data warehouse is.
  2. Logical Data Marts: A logical data mart is a filtered view of the main data warehouse but does not physically exist as a separate data copy. This approach to data marts delivers the same benefits but has the additional advantages of not requiring additional (costly) disk space and it is always as current with data as the main data warehouse. The downside is that Logical Data Marts can have slower response times than physicalized ones.
  3. Operational Data Store: An ODS is an integrated database of operational data. Its sources include legacy systems, and it contains current or near-term data. An ODS may contain 30 to 60 days of information, while a data warehouse typically contains years of data. ODSs are used in some data warehouse architectures to provide near-real-time reporting capability in the event that the Data Warehouse's loading time or architecture prevents it from being able to provide near-real-time reporting capability.

[edit] The Data Architecture - Different methods of storing data in a data warehouse

In OLTP - online transaction processing systems relational database design use the discipline of data modeling and generally follow the Codd rules of data normalization in order to ensure absolute data integrity. Using these rules complex information is broken down into its most simple structures (a table) where all of the individual atomic level elements relate to each other and satisfy the normalization rules. Codd defines 5 increasing stringent rules of normalization and typically OLTP systems achieve a 3rd level normalization. Fully normalized OLTP database designs often result in having information from a business transaction stored in dozens to hundreds of tables. Relational database managers are efficient at managing the relationships between tables and result in very fast insert/update performance because only a little bit of data is affected in each relational transaction.

The business challenge arises when people need to reassemble all the atomic bits into a meaningful record for reporting and analysis. OLTP databases are efficient because they are typically only dealing with the information around a single transaction. In reporting and analysis, thousands to billions of transactions may need to be reassembled imposing a huge workload on the relational database. Given enough time the software can usually return the requested results, but because of the negative performance impact on the machine and all of its hosted applications, data warehousing professional recommend that reporting databases be physically separated from the OLTP database.

In addition, data warehousing suggests that data be restructured and reformatted to facilitate query and analysis by novice users. OLTP databases are designed to provide good performance by rigidly defined applications built by programmers fluent in the constraints and conventions of the technology. Add in frequent enhancements, and to many a database is just a collection of cryptic names, seemingly unrelated and obscure structures that store data using incomprehensible coding schemes. All factors that while improving performance, complicate use by untrained people.

Lastly, the data warehouse needs to support high volumes of data gathered over extended periods of time and are subject to complex queries and need to accommodate formats and definitions of inherited from independently designed package and legacy systems.

Designing the data warehouse data Architecture is the realm of Data Warehouse Architects.

The goal of a data warehouse is to bring data together from a variety of existing databases to support management and reporting needs. The generally accepted principle is that data should be stored at its most elemental level because this provides for the most useful and flexible basis for use in reporting and information analysis.

However, because of different focus on specific requirements, there can be alternative methods for design and implementing data warehouses. There are two leading approaches to organizing the data in a data warehouse. The dimensional approach advocated by Ralph Kimball and the normalized approach advocated by Bill Inmon.

In the "dimensional" approach, transaction data is partitioned into either a measured "facts" which are generally numeric data that captures specific values or "dimensions" which contain the reference information that gives each transaction its context. As an example, a sales transaction would be broken up into facts such as the number of products ordered, and the price paid, and dimensions such as date, customer, product, geographical location and salesperson. The main advantages of a dimensional approach is that the Data Warehouse is easy for business staff with limited information technology experience to understand and use. Also, because the data is pre-joined into the dimensional form, the Data Warehouse tends to operate very quickly. The main disadvantage of the dimensional approach is that it is quite difficult to add or change later if the company changes the way in which it does business.

The "normalized" approach uses database normalization. In this method, the data in the data warehouse is stored in third normal form. Tables are then grouped together by subject areas that reflect the general definition of the data (Customer, Product, Finance etc.). The main advantage of this approach is that it is quite straightforward to add new information into the database -- the primary disadvantage of this approach is that because of the number of tables involved, it can be rather slow to produce information and reports.

Subject areas are just a method of organizing information and can be defined along any lines. The traditional approach has subjects defined as the subjects or nouns within an problem space. E.g in a Financial Services business area, you might have Customers, products, Contracts, etc. An alternative approach is to organize around the business transactions. E.g Customer enrollment, Sales, Trades.

[edit] Advantages of using data warehouse

There are many advantages to using a data warehouse, some of them are:

  • Enhances end-user access to a wide variety of data.
  • Business decision makers can obtain various kinds of trend reports e.g. the item with the most sales in a particular area / country for the last two years.

[edit] Concerns in using data warehouse

  • Extracting, cleaning and loading data could be time consuming.
  • Data warehousing project scope might increase.
  • Problems with compatibility with systems already in place.
  • Providing training to end-users, who end up not using the data warehouse.
  • Security could develop into a serious issue, especially if the data warehouse is web accessible.
  • Reports and Data may not be trustworthy because anybody can generate reports.
  • Data Storage controversy leaves it unclear how to structure my warehouse

[edit] See also

[edit] References