Data proliferation

From Wikipedia, the free encyclopedia

Data proliferation refers to the unprecedented amount of data, structured and unstructured, that business and government continue to generate at an unprecedented rate and the usability problems that result from attempting to store and manage that data. While originally pertaining to problems associated with paper documentation, data proliferation has become a major problem in primary and secondary data storage on computers.

At the simplest level, company e-mail systems spawn large amounts of data. Business e-mail – some of it important to the enterprise, some much less so – is estimated to be growing at a rate of 25-30% annually. And whether it’s relevant or not, the load on the system is being magnified by practices such as multiple addressing and the attaching of large text, audio and even video files.

—IBM Global Technology Services[1]


Data proliferation has been documented as a problem for the U.S. military since August of 1971, in particular regarding the excessive documentation submitted during the acquisition of major weapon systems.[2] Efforts to mitigate data proliferation and the problems associated with it are ongoing.[3]

Contents

[edit] Problems caused by data proliferation

The problem of data proliferation is affecting all areas of commerce as the result of the availability of relatively inexpensive data storage devices. This has made it very easy to dump data into secondary storage immediately after its window of usability has passed. This masks problems that could gravely affect the profitability of businesses and the efficient functioning of health services, police and security forces, local and national governments, and many other types of organization.[1] Data proliferation is problematic for several reasons:

  • Difficulty when trying to find and retrieve information. At Xerox, on average it takes employees more than one hour per week to find hard-copy documents, costing $2,152 a year to manage and store them. For businesses with more than 10 employees, this increases to almost two hours per week at $5,760 per year.[4] In large networks of primary and secondary data storage, problems finding electronic data are analogous to problems finding hard copy data.
  • Data loss and legal liability when data is disorganized, not properly replicated, or cannot be found in a timely manner. In April of 2005, Ameritrade Holding Corporation told 200,000 current and past customers that a tape containing confidential information had been lost or destroyed in transit. In May of the same year, Time Warner Incorporated reported that 40 tapes containing personal data on 600,000 current and former employees had been lost en route to a storage facility. In March of 2005, a Florida judge hearing a $2.7 billion lawsuit against Morgan Stanley issued an "adverse inference order" against the company for "willful and gross abuse of its discovery obligations." The judge cited Morgan Stanley for repeatedly finding misplaced tapes of e-mail messages long after the company had claimed that it had turned over all such tapes to the court.[5]
  • Increased manpower requirements to manage increasingly chaotic data storage resources.
  • Slower networks and application performance due to excess traffic as users search and search again for the material they need.[1]
  • High cost in terms of the energy resources required to operate storage hardware. A 100 terabyte system will cost up to $35,040 a year to run—not counting cooling costs.[6]

[edit] Proposed solutions

  • Applications that better utilize modern technology
  • Reductions in duplicate data (especially as caused by data movement)
  • Improvement of metadata structures
  • Improvement of file and storage transfer structures
  • User education and discipline[2]
  • The implementation of Information Lifecycle Management solutions to eliminate low-value information as early as possible before putting the rest into actively managed long-term storage in which it can be quickly and cheaply accessed.[1]

[edit] See also

[edit] References