Open Data
From Wikipedia, the free encyclopedia
Open Data is a philosophy and practice requiring that certain data are freely available to everyone, without restrictions from copyright, patents or other mechanisms of control. It has a similar ethos to a number of other "Open" movements and communities such as Open Source and Open access. However these are not logically linked and many combinations of practice are found. The practice and ideology itself is well established (for example in the Mertonian tradition of science) but the term "Open Data" itself is recent. Much of the emphasis in this entry is on data from scientific research. There is not yet a consistent formalisation of Open Data and this article uses recent publications and activities to define it.
Contents |
[edit] Overview
The concept of Open Data is not new; but although the term is commonly used, there are no commonly agreed definitions (cf. Open Access where several formal declarations have been made and signed).
Open Data is often focussed on non-textual material such as maps, genomes, chemical compounds, mathematical and scientific formulae, medical data and practice, bioscience and biodiversity. Problems often arise because these are commercially valuable or can be aggregated into works of value. Access to, or re-use of, the data are controlled by organisations, both public and private. Control may be through access restrictions, licenses, copyright, patents and charges for access or re-use. Advocates of Open Data argue that these restrictions are against the communal good and that these data should be made available without restriction or fee. In addition, it is important that the data are re-usable without requiring further permission, though the types of re-use (such as the creation of derivative works) may be controlled by license.
A typical depiction of the need for Open Data:
Numerous scientists have pointed out the irony that right at the historical moment when we have the technologies to permit worldwide availability and distributed process of scientific data, broadening collaboration and accelerating the pace and depth of discovery…..we are busy locking up that data and preventing the use of correspondingly advanced technologies on knowledge
[1] John Wilbanks, Executive Director, Science Commons
Creators of data often do not consider the need to state the conditions of ownership, licensing and re-use. For example, many scientists do not regard the published data arising from their work to be theirs to control and the act of publication in a journal is an implicit release of the data into the commons. However the lack of a license makes it difficult to determine the status of a data set and may restrict the use of data offered in an Open spirit. Because of this uncertainty it is also possible for public or private organisations to aggregate such data, protect it with copyright and then resell it.
[edit] Applicability
Open Data has been applied to many areas including
- scientific data deemed to belong to the commons (e.g. the human genome)
- infrastructural data essential for scientific endeavour (e.g. in Geographic information systems)
- data published in scientific articles which are factual and therefore not copyrightable
- data as opposed to software and therefore not covered by Open Source licenses and so potentially capable of being misappropriated.
- maps and other artifacts required for communal infrastructure.
[edit] History
Keith Jeffery writes:
Although the term open data is rather new, the concept is rather old. The International Geophysical Year of 1957-8 caused the setting up of several world data centres and - more importantly - set standards for descriptive metadata to be used for data exchange and utilisation.[2]
In 1995 GCDIS (US) put the position clearly in On the Full and Open Exchange of Scientific Data (A publication of the Committee on Geophysical and Environmental Data - National Research Council):
"The Earth's atmosphere, oceans, and biosphere form an integrated system that transcends national boundaries. To understand the elements of the system, the way they interact, and how they have changed with time, it is necessary to collect and analyze environmental data from all parts of the world. Studies of the global environment require international collaboration for many reasons:
International programs for global change research and environmental monitoring crucially depend on the principle of full and open data exchange (i.e., data and information are made available without restriction, on a non-discriminatory basis, for no more than the cost of reproduction and distribution."
- to address global issues, it is essential to have global data sets and products derived from these data sets;
- it is more efficient and cost-effective for each nation to share its data and information than to collect everything it needs independently; and
- the implementation of effective policies addressing issues of the global environment requires the involvement from the outset of nearly all nations of the world.
the last phrase highlights the traditional cost of disseminating information by print and post. It is the removal of this cost through the Internet which has made data vastly easier to disseminate technically. It is correspondingly cheaper to create, sell and control many data resources and this has led to the current concerns over non-Open data.
More recent uses of the term include:
- SAFARI 2000 (South Africa, 2001) used a license informed by ICSU and NASA policies [4]
- the human genome [5] (Kent, 2002)
- An Open Data Consortium on geospatial data [6] (2003)
- The Blue Obelisk group in chemistry (mantra: Open Data, Open Source, Open Standards) [7] (2004) 10.1021/ci050400b
- Manifesto for Open Chemistry [8] (Murray-Rust and Rzepa, 2004) (2004)
- Presentations to JISC and OAI under the title "Open Data" [9] (Murray-Rust, 2005)
- Science Commons launch [10] (2004)
- The Petition for Open Data in Crystallography is launched by the Crystallography Open Database Advisory Board. [11](2005)
- SPARC Open Data mailing list [12] (2005)
- XMLTech [13] (Bray and O'Reilly 2006)
in 2006 Science Commons [14] ran a 2-day conference in Washington where the primary topic could be described as Open Data. It was reported that the amount of micro-protection of data (e.g. by license) in areas such as biotechnology was creating a Tragedy of the anticommons. In this the costs of obtaining licenses from a large number of owners made it uneconomic to do research in the area.
[edit] Fundamental Open Rights
Arguments made on behalf of Open Data include:
- "Data belong to the human race". Typical examples are genomes, data on organisms, medical science, environmental data.
- Public money was used to fund the work and so it should be universally available.
- It was created by or at a government institution (this is common in US National Laboratories and government agencies)
- Facts cannot legally be copyrighted.
- Sponsors of research do not get full value unless the resulting data are freely available
- Restrictions on data re-use create an anticommons
- Data are required for the smooth process of running communal human activities (map data, public institutions)
It is generally held that factual data cannot be copyrighted.[15] However publishers frequently add their copyright statements (often forbidding re-use) to scientific data accompanying (supporting, supplementing) a publication. It is also usually unclear whether the factual data embedded in full text are part of the copyright.
While the human abstraction of facts from paper publications is normally accpted as legal there is often an implied restriction on the machine extraction by robots.
As the term Open Data is relatively new it is difficult to collect arguments against it. Unlike Open Access where groups of publishers have stated their concerns, Open Data is normally challenged by individual institutions. Their arguments may include:
- this is a non-profit organisation and the revenue is necessary to support other activities (e.g. learned society publishing supports the society)
- the government gives specific legitimacy for certain organisations to recover costs (NIST in US, Ordnance Survey in UK)
- government funding may not be used to duplicate or challenge the activities of the private sector (e.g. Pubchem)
[edit] Relation to Open Access
Much data is made available through scholarly publication, which now attracts intense debate under "Open Access". The Budapest Open Access Initiative (2001) coined this term:
By "open access" to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.
The logic of the declaration permits re-use of the data although the term "literature" has connotations of human-readable text and can imply a scholarly publication process. In Open Access discourse the term "full-text" is often used which does not emphasize the data contained within or accompanying the publication.
Some Open Access publishers do not require the authors to assign copyright and the data associated with these publications can normally be regarded as Open Data. Some publishers have Open Access strategies where the publisher requires assignment of the copyright and where it is unclear that the data in publications can be truly regarded as Open Data.
The ALPSP and STM publishers have issued a statement about the desirability of making data freely available [16]:
Publishers recognise that in many disciplines data itself, in various forms, is now a key output of research. Data searching and mining tools permit increasingly sophisticated use of raw data. Of course, journal articles provide one ‘view’ of the significance and interpretation of that data – and conference presentations and informal exchanges may provide other ‘views’ – but data itself is an increasingly important community resource. Science is best advanced by allowing as many scientists as possible to have access to as much prior data as possible; this avoids costly repetition of work, and allows creative new integration and reworking of existing data.
and
We believe that, as a general principle, data sets, the raw data outputs of research, and sets or sub-sets of that data which are submitted with a paper to a journal, should wherever possible be made freely accessible to other scholars. We believe that the best practice for scholarly journal publishers is to separate supporting data from the article itself, and not to require any transfer of or ownership in such data or data sets as a condition of publication of the article in question.
The statement also comments on the tensions between freedom of data and ownership.
[edit] Relation to other Open Activities
There are a number of other "Open" philosophies which are similar to, but not synonymous with Open Data but which may overlap, be supersets, or subsets. Here they are briefly listed and compared.
- Open Source (Software) is concerned with the licenses under which computer programs can be distributed and is not normally concerned primarily with data.
- Open Content has similarities to Open Data and may be seen as a superset but differs in that it emphasizes creative works while Open Data is more oriented towards factual data and the output of the scientific research process.
- Open Notebook Science refers to the application of the Open Data concept to as much of the scientific process as possible, including failed experiments and raw experimental data. [17]
- Open Knowledge. The Open Knowledge Foundation argues for Openness in a range of issues including, but not limited to, those of Open Data. It covers (a) scientific, historical, geographic or otherwise (b) Content such as music, films, books (c) Government and other administrative information
[edit] Funders' mandates
Several funding bodies which mandate Open Access also mandate Open Data. A good expression of requirements (truncated in places) is given by the Canadian Institutes of Health Research (CIHR) [18]:
- "Final research data" refers to the factual information that is necessary to replicate and verify research results. Data can include original data sets, data sets that are too large to be included in the peer-reviewed publication, and any other data sets supporting the research publication. Research data is typically an electronic data set, and may include interview transcripts and survey results provided confidential data and subject privacy is protected
- to make final data sets, generally in electronic form, available upon request after the publication date
- ensure the quality of the data and have accompanying metadata or codebooks.
- to deposit bioinformatics, atomic and molecular coordinate data, experimental data into the appropriate public database immediately upon publication of research results.
- to retain original data sets for a minimum of five years after the grant. This applies to all data, whether published or not.
Note the fundamental requirement to be able to replicate the experiment.
Other bodies active in promoting the deposition of data as well as fulltext include the Wellcome Trust.
[edit] Closed Data
Several intentional or unintentional mechanisms exist for restricting access to or re-use of data. They include:
- compilation in databases or websites to which only registered members or customers can have access.
- use of a proprietary or closed technology or encryption which creates a barrier for access.
- copyright forbidding (or obfuscating) re-use of the data.
- license forbidding (or obfuscating) re-use of the data
- patent forbidding re-use of the data (for example the 3-dimensional coordinates of some experimental protein structures have been patented)
- restriction of robots to websites, with preference to certain search engines
- aggregating factual data into "databases" which may be covered by "database rights" or "database directives" (e.g. Directive on the legal protection of databases)
- time-limited access to resources such as e-journals (which on traditional print were available to the purchaser indefinitely)
- political, commercial or legal pressure on the activity of organisations providing Open Data (for example the American Chemical Society lobbied the US Congress to limit funding to the National Institutes of Health for its Open Pubchem data. [19]
[edit] Organisations and Activities promoting Open Data
- CODATA
- Science Commons
- SPARC
- "Free our data" (The Guardian technology section)
- The Open Knowledge Foundation
- Talis
- Web2Express.org, Open data on semantic web
- Linking Open Data on the Semantic Web
[edit] See also
- Budapest Open Access Initiative
- Open Access
- Open Content
- Open Source
- Merton Thesis
- Talis Community License
[edit] References
- ^ Science Commons
- ^ Keith G Jeffery on Peter Murray-Rust's blog
- ^ GCDIS
- ^ http://mercury.ornl.gov/safari2k/s2kpolicy.pdf
- ^ Jim Kent 2002
- ^ Open Data Consortium ca. 2003
- ^ http://www.blueobelisk.org Blue Obelisk, 2004
- ^ Peter Murray-Rust, Henry Rzepa 2004
- ^ "Open Data" at CERN Workshop on Innovations in Scholarly Communication (OAI4) Peter Murray-Rust, 2005
- ^ Report on Science Commons Dec 2004
- ^ http://www.crystallography.net/
- ^ SPARC Open Data Mailing list
- ^ XTech 2005, Tim Bray and Tim O'Reilly
- ^ Science Commons in Washington 2006
- ^ Towards a Science Commons includes an overview of the basis of Openness in science data.
- ^ http://www.alpsp.org/news/ALPSP-STM-data-accessibility.pdf
- ^ http://drexel-coas-elearning.blogspot.com/2006/09/open-notebook-science.html creation of term
- ^ https://mx2.arl.org/Lists/SPARC-OpenData/Message/34.html
- ^ Review of history and positions by the University of California