Open Data

From Wikipedia, the free encyclopedia

Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control. It has a similar ethos to a number of other "Open" movements and communities such as open source and open access. However these are not logically linked and many combinations of practice are found. The practice and ideology itself is well established (for example in the Mertonian tradition of science) but the term "Open Data" itself is recent. Much of the emphasis in this entry is on data from scientific research and from the data-driven web. In some cases Open Data may be considered as more properly Open Metadata and there is not yet a consistent formalisation. This article uses recent publications and activities to define the scope of the concept and term.

Contents

[edit] Overview

The concept of Open Data is not new; but although the term is currently in frequent use, there are no commonly agreed definitions (unlike, for example, Open Access where several formal declarations have been made and signed).

Open Data is often focussed on non-textual material such as maps, genomes, chemical compounds, mathematical and scientific formulae, medical data and practice, bioscience and biodiversity. Problems often arise because these are commercially valuable or can be aggregated into works of value. Access to, or re-use of, the data are controlled by organisations, both public and private. Control may be through access restrictions, licenses, copyright, patents and charges for access or re-use. Advocates of Open Data argue that these restrictions are against the communal good and that these data should be made available without restriction or fee. In addition, it is important that the data are re-usable without requiring further permission, though the types of re-use (such as the creation of derivative works) may be controlled by license.

A typical depiction of the need for Open Data:

Numerous scientists have pointed out the irony that right at the historical moment when we have the technologies to permit worldwide availability and distributed process of scientific data, broadening collaboration and accelerating the pace and depth of discovery…..we are busy locking up that data and preventing the use of correspondingly advanced technologies on knowledge

[1] John Wilbanks, Executive Director, Science Commons

Creators of data often do not consider the need to state the conditions of ownership, licensing and re-use. For example, many scientists do not regard the published data arising from their work to be theirs to control and the act of publication in a journal is an implicit release of the data into the commons. However the lack of a license makes it difficult to determine the status of a data set and may restrict the use of data offered in an Open spirit. Because of this uncertainty it is also possible for public or private organisations to aggregate such data, protect it with copyright and then resell it.

Under "Toward Open Data" Connolly (2005, v.i.) gives two quotations:

  • I want my data back. (Jon Bosak circa 1997)
  • I've long believed that customers of any application own the data they enter into it. [2]. (This quote refers to Veen's own heart-rate data.)

These quotations suggest that Openness refers to the metadata (formats, licenses, ontologies) rather than the data themselves.

[edit] History

Keith Jeffery writes:

Although the term open data is rather new, the concept is rather old. The International Geophysical Year of 1957-8 caused the setting up of several world data centres and - more importantly - set standards for descriptive metadata to be used for data exchange and utilisation.[3]

In 1995 GCDIS (US) put the position clearly in On the Full and Open Exchange of Scientific Data (A publication of the Committee on Geophysical and Environmental Data - National Research Council):

"The Earth's atmosphere, oceans, and biosphere form an integrated system that transcends national boundaries. To understand the elements of the system, the way they interact, and how they have changed with time, it is necessary to collect and analyze environmental data from all parts of the world. Studies of the global environment require international collaboration for many reasons:

  • to address global issues, it is essential to have global data sets and products derived from these data sets;
  • it is more efficient and cost-effective for each nation to share its data and information than to collect everything it needs independently; and
  • the implementation of effective policies addressing issues of the global environment requires the involvement from the outset of nearly all nations of the world.
International programs for global change research and environmental monitoring crucially depend on the principle of full and open data exchange (i.e., data and information are made available without restriction, on a non-discriminatory basis, for no more than the cost of reproduction and distribution."

[4]

The last phrase highlights the traditional cost of disseminating information by print and post. It is the removal of this cost through the Internet which has made data vastly easier to disseminate technically. It is correspondingly cheaper to create, sell and control many data resources and this has led to the current concerns over non-Open data.

More recent uses of the term include:

  • SAFARI 2000 (South Africa, 2001) used a license informed by ICSU and NASA policies [5]
  • the human genome [6] (Kent, 2002)
  • An Open Data Consortium on geospatial data [7] (2003)
  • The Blue Obelisk group in chemistry (mantra: Open Data, Open Source, Open Standards) [8] (2004) doi:10.1021/ci050400b
  • Manifesto for Open Chemistry [9] (Murray-Rust and Rzepa, 2004) (2004)
  • Presentations to JISC and OAI under the title "Open Data" [10] (Murray-Rust, 2005)
  • Science Commons launch [11] (2004)
  • The Petition for Open Data in Crystallography is launched by the Crystallography Open Database Advisory Board. [12](2005)
  • XML Conference & Exposition 2005 [13] (Connolly 2005)
  • SPARC Open Data mailing list [14] (2005)
  • XTech [15] (Dumbill, 2005), [16] (Bray and O'Reilly 2006)


In 2004, the Science Ministers of all nations of the OECD (Organisation for Economic Co-operation and Development), which includes most developed countries of the world, signed a declaration which essentially states that all publicly-funded archive data should be made publicly available.[17] Following a request and an intense discussion with data-producing institutions in member states, the OECD published in 2007 the OECD Principles and Guidelines for Access to Research Data from Public Funding as a soft-law recommendation.[18]

In 2005 Edd Dumbill introduced an "Open Data" theme in XTech, including:

In 2006 Science Commons [19] ran a 2-day conference in Washington where the primary topic could be described as Open Data. It was reported that the amount of micro-protection of data (e.g. by license) in areas such as biotechnology was creating a Tragedy of the anticommons. In this the costs of obtaining licenses from a large number of owners made it uneconomic to do research in the area.

In 2007 SPARC and Science Commons announced a consolidation and enhancement of their author addenda [20]

[edit] Fundamental Open Rights

Arguments made on behalf of Open Data include:

  • "Data belong to the human race". Typical examples are genomes, data on organisms, medical science, environmental data.
  • Public money was used to fund the work and so it should be universally available.
  • It was created by or at a government institution (this is common in US National Laboratories and government agencies)
  • Facts cannot legally be copyrighted.
  • Sponsors of research do not get full value unless the resulting data are freely available
  • Restrictions on data re-use create an anticommons
  • Data are required for the smooth process of running communal human activities (map data, public institutions)
  • In scientific research, the rate of discovery is accelerated by better access to data. [21]


It is generally held that factual data cannot be copyrighted.[22] However publishers frequently add their copyright statements (often forbidding re-use) to scientific data accompanying (supporting, supplementing) a publication. It is also usually unclear whether the factual data embedded in full text are part of the copyright.

While the human abstraction of facts from paper publications is normally accepted as legal there is often an implied restriction on the machine extraction by robots.

As the term Open Data is relatively new it is difficult to collect arguments against it. Unlike Open Access where groups of publishers have stated their concerns, Open Data is normally challenged by individual institutions. Their arguments may include:

  • this is a non-profit organisation and the revenue is necessary to support other activities (e.g. learned society publishing supports the society)
  • the government gives specific legitimacy for certain organisations to recover costs (NIST in US, Ordnance Survey in UK)
  • government funding may not be used to duplicate or challenge the activities of the private sector (e.g. PubChem)

[edit] Relation to Open Access

Much data is made available through scholarly publication, which now attracts intense debate under "Open Access". The Budapest Open Access Initiative (2001) coined this term:

By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.

The logic of the declaration permits re-use of the data although the term "literature" has connotations of human-readable text and can imply a scholarly publication process. In Open Access discourse the term "full-text" is often used which does not emphasize the data contained within or accompanying the publication.

Some Open Access publishers do not require the authors to assign copyright and the data associated with these publications can normally be regarded as Open Data. Some publishers have Open Access strategies where the publisher requires assignment of the copyright and where it is unclear that the data in publications can be truly regarded as Open Data.

The ALPSP and STM publishers have issued a statement about the desirability of making data freely available [23]:

Publishers recognise that in many disciplines data itself, in various forms, is now a key output of research. Data searching and mining tools permit increasingly sophisticated use of raw data. Of course, journal articles provide one ‘view’ of the significance and interpretation of that data – and conference presentations and informal exchanges may provide other ‘views’ – but data itself is an increasingly important community resource. Science is best advanced by allowing as many scientists as possible to have access to as much prior data as possible; this avoids costly repetition of work, and allows creative new integration and reworking of existing data.

and

We believe that, as a general principle, data sets, the raw data outputs of research, and sets or sub-sets of that data which are submitted with a paper to a journal, should wherever possible be made freely accessible to other scholars. We believe that the best practice for scholarly journal publishers is to separate supporting data from the article itself, and not to require any transfer of or ownership in such data or data sets as a condition of publication of the article in question.

Even though this statement was without any effect on the open availability of primary data related to publications in journals of the ALPSP and STM members. Data tables provided by the authors as supplement with a paper are still available to subscribers only.

[edit] Relation to other Open Activities

There are a number of other "Open" philosophies which are similar to, but not synonymous with Open Data but which may overlap, be supersets, or subsets. Here they are briefly listed and compared.

  • Open Source (Software) is concerned with the licenses under which computer programs can be distributed and is not normally concerned primarily with data.
  • Open Content has similarities to Open Data and may be seen as a superset but differs in that it emphasizes creative works while Open Data is more oriented towards factual data and the output of the scientific research process.
  • Open Notebook Science refers to the application of the Open Data concept to as much of the scientific process as possible, including failed experiments and raw experimental data. [24]
  • Open Knowledge. The Open Knowledge Foundation argues for Openness in a range of issues including, but not limited to, those of Open Data. It covers (a) scientific, historical, geographic or otherwise (b) Content such as music, films, books (c) Government and other administrative information

[edit] Funders' mandates

Several funding bodies which mandate Open Access also mandate Open Data. A good expression of requirements (truncated in places) is given by the Canadian Institutes of Health Research (CIHR) [25]:

  • to deposit bioinformatics, atomic and molecular coordinate data, experimental data into the appropriate public database immediately upon publication of research results.
  • to retain original data sets for a minimum of five years after the grant. This applies to all data, whether published or not.

Note the fundamental requirement to be able to replicate the experiment.

Other bodies active in promoting the deposition of data as well as fulltext include the Wellcome Trust.

[edit] Closed Data

Several intentional or unintentional mechanisms exist for restricting access to or re-use of data. They include:

  • compilation in databases or websites to which only registered members or customers can have access.
  • use of a proprietary or closed technology or encryption which creates a barrier for access.
  • copyright forbidding (or obfuscating) re-use of the data.
  • license forbidding (or obfuscating) re-use of the data (such as share-alike or non-commercial)
  • patent forbidding re-use of the data (for example the 3-dimensional coordinates of some experimental protein structures have been patented)
  • restriction of robots to websites, with preference to certain search engines
  • aggregating factual data into "databases" which may be covered by "database rights" or "database directives" (e.g. Directive on the legal protection of databases)
  • time-limited access to resources such as e-journals (which on traditional print were available to the purchaser indefinitely)
  • political, commercial or legal pressure on the activity of organisations providing Open Data (for example the American Chemical Society lobbied the US Congress to limit funding to the National Institutes of Health for its Open PubChem data. [26]

[edit] Organisations promoting Open Data

[edit] See also

[edit] References

  1. ^ Science Commons
  2. ^ Jeffrey Veen
  3. ^ Keith G Jeffery on Peter Murray-Rust's blog
  4. ^ GCDIS
  5. ^ http://mercury.ornl.gov/safari2k/s2kpolicy.pdf
  6. ^ Jim Kent 2002
  7. ^ Open Data Consortium ca. 2003
  8. ^ http://www.blueobelisk.org Blue Obelisk, 2004
  9. ^ Peter Murray-Rust, Henry Rzepa 2004
  10. ^ "Open Data" at CERN Workshop on Innovations in Scholarly Communication (OAI4) Peter Murray-Rust, 2005
  11. ^ Report on Science Commons Dec 2004
  12. ^ http://www.crystallography.net/
  13. ^ [http://www.w3.org/2002/12/cal/mash/slides#(1) Semantic Web Data Integration with hCalendar and GRDDL; Dan Connolly | From Syntax to Semantics (XML 2005) Atlanta, GA, USA]
  14. ^ SPARC Open Data Mailing list
  15. ^ XTech 2005
  16. ^ Tim Bray and Tim O'Reilly
  17. ^ OECD Declaration on Open Access to publicly-funded data
  18. ^ OECD Principles and Guidelines for Access to Research Data from Public Funding
  19. ^ Science Commons in Washington 2006
  20. ^ SPARC-OAF forum
  21. ^ How to Make the Dream Come True argues in one research area (Astronomy) that access to open data increases the rate of scientific discovery.
  22. ^ Towards a Science Commons includes an overview of the basis of Openness in science data.
  23. ^ http://www.alpsp.org/ForceDownload.asp?id=129
  24. ^ http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html creation of term
  25. ^ SPARC-OpenData@arl.org Mailing List Archive
  26. ^ Review of history and positions by the University of California