Talk:Data warehouse

From Wikipedia, the free encyclopedia

This article is within the scope of WikiProject Databases.
B rated as B-Class on the assessment scale
High rated as high-importance on the assessment scale
This article is within the scope of the Business and Economics WikiProject.
B rated as b-Class on the assessment scale
Mid rated as mid-importance on the assessment scale

Contents

[edit] Removed Section

I have removed the section "Important Considerations in Building a Data Warehouse" as it is not appropriate to have a "how-to" section in a definition. I've written to the originator of the change but have received no response and therefore have taken this action. The text that has been removed is listed below


Important questions to ask when using a data warehouse 
  • Do you need a data warehouse? Data warehouses are expensive,additionally you need to train employees on how to use them which is also costly.
  • Do all employees need to have access to the data warehouse? Not all employees need to access the whole data warehouse, in case they need some information you can consider building data marts.
  • How frequently should you update it? This depends on the nature of the data, some needs to be updated instantly, others can be updated weekly or monthly.
  • Which data mining tools should you use? End-users should be the primary determinants of what tools to use, yet the most important key is training.

[edit] Major Edits Made

I've made some major changes to the page and welcome any new comments

[edit] This page needs serious work!

I'm a Senior Architect in a data warehouse in a Fortune 100 company. We have 150 people on staff and an annual budget of 30 million, and if my management came to this page to see what they're paying all that staff for, we'd be fired!!! There is VASTLY more to data warehousing than this article outlines. I will be back to enhance it dramatically, but in the mean time, don't think this is all there is to data warehousing! --Somewherepurple 00:40, 16 December 2005 (UTC)

I guess management must have read the page :-) Anyway, 150 people and an annual budget of 30 million is pretty extreme for a data warehouse. What typically happens is that there are companies that need data to manage their business (most of them) and companies for whom data IS their business. My guess is that this writer was in the latter category. Neil Raden (talk) 23:30, 17 February 2008 (UTC)

[edit] This text isn't quite right

Conventional database systems use highly normalized data formats so that they will execute transactions and queries as fast as possible, in minimal time and space.

Normalization gives you tremendous flexibility on the kinds of queries you may run, but often at the expense of time. (unsigned)

[edit] tzeh 5.jan.2004 wrote

This entry of "data warhouse" is not the definition resp. description of a data warehouse. This text defines resp. describes the data warehouse system with some components of the data warehouse system. A data warehouse is a data store only which is build out of separate internal and/or external data sources where the data is integrated in a consistent manner. (unsigned)

[edit] Shark Tank on Computer World

State agency's data warehouse project is renamed: It's now officially an "administrative systems reporting database." Why not just call it a data warehouse? "Several legislators threatened to block the project if the warehouse wasn't constructed within their own districts," grumbles on-scene pilot fish. "All attempts to explain that no physical structures would ever be built fell short. The lawmakers had never heard of a warehouse without a building somewhere." [1] 4.250.198.106 14:18, 29 Mar 2005 (UTC)


[edit] References

I removed the Wiley title from the bibliography since they publish many titles on the subject, and singling one out seemed rather arbitrary. The link provided in the References section includes all Wiley titles on data warehousing.

The "References" and "External links" sections should really be combined, since "External links are still "references". Ringbang 13:15, 22 July 2005 (UTC)

That sounds reasonable to me. I added an external link yesterday (to the data warehousing knowledge base at http://www.datamgmt.com), but I see someone has removed it with no comment. That knowledge base is a very good source of information on data warehousing, authored by the UK's leading DW authority, and is every bit as useful here as Kimball's book. GreenInker 22:12, 25 May 2006 (UTC)

The reason I feel these aren’t arbitrary book choices--even though both are from Wiley--is because these are the two books I recommend people start with. One is a general overview of the Corporate Information Factory approach and the Kimball book is more of a nuts-and-bolts view. They aren’t opposing, but some people feel that they are, and these are really the two biggest names in Data Warehousing. They deserve to be highlighted. Stephen Pace 9 Aug

Makes sense. If you re-add them, I won't remove them, but it would be nice if the listings were qualified in some way in the article (i.e., some mention that they're the standard references for beginners). Better yet would be subsections for special topics in data warehousing, general titles for beginners, etc. I definitely see the value in helping people to sift through Wiley's huge information management library. — Ringbang 20:22, 10 August 2005 (UTC)

I found an article discussing the two cited approaches for organizing data in a datawarehouse, (Inmon & Kimball), from an ISI magazine: March 2005/Vol. 48, No. 3 COMMUNICATIONS OF THE ACM, page 79-84, "A Comparison of Data Warehousing Methodologies", but I don't know how to put a reference. I think this reference could be in the "The Data Architecture - Different methods of storing data in a data warehouse" section. Gueta 20:31, 21 December 2006 (UTC)

[edit] Merge

Recommend merging the article Data mart with this one. Comments? SqlPac 14:46, 17 May 2007 (UTC)


Absolutely Not. Data mart is a separate concept from data warehouse. Data mart is worthy of its own article. Steve 13 June 2007

I agree that data marts are sufficiently distinct from data warehouses to warrant a separate article. Further, the two terms are often used interchangably by the non-cogniscenti and combining the articles might add to that misconception. However, data marts and data warehouses are often used as complementary components of an overall Business Intelligence solution. For that reason, I think the relationship between them needs to be drawn out in both articles. John 22 June 2007


Well, I have seen (as an IBM consultant for seven years in Asia and NA) and used (as a developer now for six years) data marts that source its data from data warehouses. So, they can be closely related. Djoni 5 July 2007.

It may be closely related but does that mean data mart does not have an identity of its own distinct from a data warehouse? Cant a data mart operate without one? If it cant, then it may be merged. But if it can, it is a different concept in its own right and worthy of a separate article. --soum talk 19:45, 5 July 2007 (UTC)

[edit] A bit unbalanced.

Parts of this article are written very pro Inmon at the expense of describing the Kimball method completely. It could use a writeup of the Kimball method from Ralph Kimball's perspective I think rather than from Inmon's. Perhaps each method needs it's own article instead of trying to add a small amount of information on each to this article and end up being a bit lean on both approache's pros and cons.

Both Kimball and Inmon have their own articles. I'd suggest your expand their POV's there. We can draw comparisons here. Neil Raden (talk) 22:58, 24 February 2008 (UTC)

[edit] I think "less" should be "more"

I believe the sentence... "Less complex information is broken down into its most simple structures (a table) where all of the individual atomic level elements relate to each other and satisfy the normalization rules." could be better crafted as something like... "In this approach, each of the more complex information items are resolved into a set of records in multiple tables, each of which satisfies the normalization rules." Even that sounds like a mouthful, though...any thoughts? Kiwi137 (talk) In the absence of any objections I will amend text as per the above. Kiwi137 (talk) 11:51, 14 January 2008 (UTC)

[edit] Article Needs a Rewrite

Any article about data warehousing that starts out with Inmon's 20-year-old principles and decade-old debate with Kimball is a history lesson, not an informative piece. I can't even agree with the first sentence: "A data warehouse is the main repository of an organization's historical data, its corporate memory." That is only partially when true when there is a true enterprise data warehouse, which is rare. A data warehouse only contains strucutured data culled from operational systems. That's pretty short of the corporate memory.

Is anyone interested in working with me to improve this article? I'd likke to see more explanation of how a data warehouse is integrat4ed into today's computing environment, the design implications of scale and real-time, etc. Neil Raden (talk) 23:58, 27 January 2008 (UTC)

I'd be happy to help out. Inmon and Kimball can't be ignored in the article, however, as their methodologies and opinions are widely regarded. SqlPac (talk) 20:50, 24 February 2008 (UTC)
I'm not suggesting we ignore them, but they are not the most important issue, or rather, the debate over them isn't. Data warehousing is on the verge of a major re-think thanks to the convergence of operational and analytical processes. How does the high-latency batch update process through multiple physical layers (the Corporate Information Factory) work with today's externalized business, huge scale and demand for immediacy? Where do we fit streaming databases and the processing of complex events and decision automation? So I guess my point is that the history of data warehousing is interesting, but it's more important that we focus on writing an article that is useful to the reader. Neil Raden (talk) 22:47, 24 February 2008 (UTC)
Why don't you add a final section on the future of data warehousing? (Writerguy71 (talk) 15:27, 26 February 2008 (UTC))
I took a stab at a brief section on data warehousing futures. (Writerguy71 (talk) 10:47, 7 March 2008 (UTC))
"Warehousing futures" - I feel like I should call my stock broker and invest heavily <g>. SqlPac (talk) 17:20, 9 March 2008 (UTC)
I agree that they shouldn't be ignored, and I can agree that the discussion over these two personalities in particular may not be the most important issue on the plate--but to understand the future we need to know the past right? :) I think information about these two is important because their methodologies are in such widespread use today. It's important to look at what the future is bringing, but perhaps even more important to most readers is what is in common use today. SqlPac (talk) 17:20, 9 March 2008 (UTC)
I've rewritten the "Kimball versus Inmon" section to be a bit less dogmatic, and to frame the discussion more in terms of the "top-down" vs. "bottom-up" design methodologies. Obviously Inmon and Kimball were mentioned, but since this article isn't about them per se, but rather about data warehousing in general I've basically stated "this is the design approach... and this guy is one of the leading proponents of this approach..." I've also added references and listed some of the benefits and risks associated with the different design methodologies. Feel free to tweak or change as you feel necessary. I think we can add some more information about the "hybrid design" approaches commonly in use. SqlPac (talk) 19:15, 10 March 2008 (UTC)

[edit] One physical repository?

Someone put this sentence in there: "These two influential experts represent the traditional views on one aspect of data warehousing - whether it should be in one physical repository." I disagree with the wording, as I don't believe either one has really said your data needs to be stored in one "physical repository". What's a "physical repository"? A single database on a single server? A single server on a single network? Does this exclude server farms and geographically separated redundancy systems? SqlPac (talk) 20:50, 24 February 2008 (UTC)

I think the point is that Kimball and Inmon were largely focused on the database aspects of data warehousing, which is not really true. Kimball has written extensively about ETL, Web Analytics and the whole data warehouse lifecycle. Inmon started the first company (Prism) to do ETL and has written extensively about many aspects of data warehousing. I think what this person meant was that the Inmon vs Kimball debate was about was about how to build the mega data warehouse, but that isn't really true. Their differences were mostly methodology (how to proceed), though Imon rejected dimensional modeling initially, aceeding to it later for "data marts." But the really interesting part of this statement is the comment about one physical repository. Today, it should be thought of as a logical repository as it can be, as you said, distributed. In addition, there is a rapid pick-up of in-memory databases and query federation, through a meta-layer, which is making slow but steady progress. Neil Raden (talk) 22:57, 24 February 2008 (UTC)

[edit] Another round of edits

Another round of edits were attempted. There was an attempt to keep most of the substance intact. These edits will be fodder for making the article better. For example, the above comment on the physical repository was correct. Writerguy71 (talk) 25 February 2008 (UTC)

There was an attempt to incorporate someone's personal diagram into the substance the article. Frankly, the diagram does not add much though. Writerguy71 (talk) 25 February 2008 (UTC)

Also, admittedly, the article greatly lacks the annotation it should. If there are unannotated controversial statements, then, per Wikipedia policy, then remove them. That being said, the controversial material may revolve around, for lack of a better term, "Inmon - Kimball religion issues". An attempt was to put those issues in a "fair and balanced" light. Writerguy71 (talk) 25 February 2008 (UTC)

Finally, I do not know how the above discusser infers there is a statement that Inmon and Kimball themselves were focused on the database issue. Nonetheless, the language has been changed with the attempt to make it so other readers do not make the same inference. - This article needed so much work that there probably are many other points that could be better said. Writerguy71 (talk) 25 February 2008 (UTC)

It would help if you use indenting under the comments you are commenting one. Also, what would you think about removing the picture at the top? It depicts one type of data warehouse environment only. But more importantly, it uses the term "Data Vault" which is a term used by a vendor. Neil Raden (talk) 21:07, 25 February 2008 (UTC)
As stated previously, it is my opinion that the diagram adds little. (Writerguy71 (talk) 11:53, 26 February 2008 (UTC))
It's gone Neil Raden (talk) 00:16, 27 February 2008 (UTC)

[edit] Repository

The first sentence uses the word "repository." Is a data warehouse really a repository? Most diagrams depict it as much more than that. Is ETL part of a warehouse? Is metadata part of a warehouse (metadata may be housed in its own repository, but isn't metadata itself much more than a repository?). If you want to get expansive about it, aren't reports, metrics, templates and even the forgotten users part of a data warehouse? Neil Raden (talk) 21:12, 25 February 2008 (UTC)

For purposes of simplicity, keep the first sentence as a "repository". If you want to be technically correct, then include ETL, metadata, etc. in the first paragraph. Like to see your edits and how you draw the boundary. Just don't call data warehousing the corporate memory. (Writerguy71 (talk) 16:10, 26 February 2008 (UTC))
I changed the intro. around, using two definitions - one definition from an oft-cited ACM paper and another definition from an oft-cited Inmon paper. I also listed more benefits as cited by the Data Warehousing at Stanford project. You may want to revise or rearrange as necessary. I think it's important to establish a level of credibility with citations in the intro. though. SqlPac (talk) 18:59, 9 March 2008 (UTC)
I noticed others have been rearranging content, and it looks good. I've rearranged the intro. and the first paragraph to combine them, and I also moved the benefits out of the intro and into a separate section. This was basically to keep the intro short, sweet, and to the point. I also reworded the "expanded definition" slightly since it started with "The previous definition focuses on data..." and then goes on to explain that the expanded definition includes "the movement of *data*, the analysis of *data*, the extraction, transformation, and loading of *data*". Seemed like the author might have been trying to say that the previous definition was focused entirely on the "storage of data". Anyway, feel free to change around whatever as necessary. SqlPac (talk) 02:05, 12 March 2008 (UTC)

[edit] Normalization/Denormalization

This is all wrong. In my experience, operational systems are the most denormalized systems. Look at SAP for example. It violates every rule. The most normalized schema in the whole system are the Star Schemas in the SAP BI. It is completely false that dimensional schema are denormalized. A well-designed star schema is in 2nd normal form and if you snowflake the dimensions, which I always do to promote sparse aggregation, they are in 3rd normal form. Some may argue that duplicating data (via aggregates) violates the normalization rule, but if you stop to think about it, so does an index. If you look at a snowflaked schema, what you see are a series of 3rd normal form stars. It is a complete MYTH that 3rd normal form doesn't provide good performance. It depends on the type of schema. Eliminate circular joins and it performs quite well. Neil Raden (talk) 21:30, 25 February 2008 (UTC)

Okay, should the section exist? Or should there be a section on "design" that, perhaps, explains granularity, that some data warehouse designs follow some/all the Codd rules, and that some data warehouses follow dimensional modeling which makes use of "facts" and "dimensions" and which may or may not follow some of the Codd rules? If not, what should be in the section if you agree something like it should exist? (Writerguy71 (talk) 15:56, 26 February 2008 (UTC))
Dimensional schemas consist of denormalized tables in many instances. Indexes are a separate issue from "data", and are an implementation-specific detail. They are managed internally by the system, are usually not exposed to users, and are not considered "data" in most systems. Some indexes do not duplicate data, so I'm at a loss here as to what you're talking about. Whether or not 3NF provides good performance is dependent on many factors. Some systems perform better with a star schema, some perform better with a snowflake schema. I would be interested to see a "well-designed star schema in 2NF." Usually a simple Time dimension can be a determining factor as to whether it even meets the criteria for 1NF, much less 2NF or 3NF. Do you have any examples of 2NF well-designed star schemas (schemata? whatever...)? SqlPac (talk) 17:35, 9 March 2008 (UTC)
I took a stab at rewrites and moving information to a more appropriate section. The section still does not read well so your edits are welcome (Writerguy71 (talk) 11:57, 4 March 2008 (UTC))
Big improvement Neil Raden (talk) 04:48, 11 April 2008 (UTC)

[edit] Dates for key developments in early years of data warehousing

A list of some key dates was addded. I thought a little historical context would be good for the article. The list is debatable so edit away. (Writerguy71 (talk) 02:26, 6 March 2008 (UTC))

[edit] Data mart article badly in need of help!

I just visited what might be considered a "sister article" to this one - the Data mart article. It is badly in need of attention. I figured I'd post a quick note here, since people who are editing this one are more likely to be interested in that one as well (at least more likely to be interested than the general population of editors). Thanks. SqlPac (talk) 02:15, 12 March 2008 (UTC)

[edit] External Links, Inline References

Can anyone inline these references? I'm not sure what material was sourced from these "seminal works" in the article. If they're truly considered seminal works, there should be at least one sentence worth referencing directly from each, right? :)

  • William H. Inmon, Richard D. Hackathorn: Using the Data Warehouse, John Wiley & Son's, ISBN 0-471-05966-8 - One of Inmon's seminal work on data warehousing
  • Ralph Kimball, Margy Ross: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling (Second Edition), John Wiley & Sons, ISBN 0-471-20024-7 - One of Kimball's (with the help of Ross) seminal works on data warehhousing

Also I moved all the external links to blogs, etc., from the References section to a new External Links section. Someone might want to go through those links at some point and make sure they meet the Wikipedia criteria for inclusion in the article. I'm not familiar enough with the criteria to do it myself, but there are a couple of blogs in there I'm not familiar with and I wouldn't be a good judge as to whether or not they meet inclusion criteria. Thanks. SqlPac (talk) 04:00, 12 March 2008 (UTC)