ENCODE Project at NHGRI    ENCODE Pilot Project at UCSC
  ENCODE Pilot Phase Data Release Policy

(This is a copy of NHGRI's official ENCODE Data Release Policy (2003-2007))

Data Release Principle and Standard

The NHGRI is committed to the principle of rapid data release to the scientific community. This principle was initially implemented during the Human Genome Project and has been recognized as leading to one of the most effective ways of promoting the use of the human genome sequence to advance scientific knowledge. At a meeting in Ft. Lauderdale co-sponsored by the Wellcome Trust and NHGRI in January 2003, the concept of rapid data release by genomic sequence data producers was reaffirmed, and the attendees strongly recommended applying the practice to other types of data produced by "community resource projects". The attendees recognized, however, that different issues, particularly with respect to data validation, would be involved in the development of appropriate release practices for different types of data. Since they also recognized that sustaining the practice of rapid, prepublication data release by community resources requires that the interests of all involved - including the data producers, data users, and funding agencies - be addressed, they emphasized the need to develop a tripartite system of responsibility. A report summarizing the meeting at Ft. Lauderdale is also available.

The NHGRI has identified the Encyclopedia of DNA Elements (ENCODE) Project, designed to comprehensively identify functional elements in the human genome sequence, as a community resource project. ENCODE has begun as a pilot effort to test and compare methods for the exhaustive identification and validation of functional sequence elements in a limited (~1% or 30 Mb) amount of the human genome. In practice, the ENCODE data release policy will be affected by two important considerations: (1) several different data types will be generated, as a variety of experimental approaches will be taken in the Project to identify functional sequence elements, and (2) the criteria for validation for each data type, which will vary, need to be taken into account in developing appropriate data release standards for each data type.

At the outset of the project, the ENCODE Consortium considers it relevant to distinguish between data verification and data validation. 'Data verification' is understood to refer to assessing the reproducibility of an experiment, while 'data validation' is understood to refer to confirmation by other, independent methods. As outlined below, the Consortium believes that early deposit of data in public databases is important, and this should happen as soon as data is verified - even if it has not yet been validated. For each data type, the Consortium is attempting to identify a minimal verification standard necessary for public release of each data type. The Consortium members will also identify additional levels of validation that will be applied in subsequent analyses of the data or with additional experimentation where appropriate. When possible, estimates of the false positive and false negative rates for the particular experimental approach will be included in the data releases as a measure of data validation. The data will be deposited to public databases, such as GenBank or ENCODE Consortium databases, and the data will be available for all to use without restriction (See Appendix A).


ENCODE Publication Policy / Intellectual Property Considerations

As recommended at the Ft. Lauderdale meeting for a community resource project, the ENCODE Consortium has published an initial manuscript, a so-called "marker paper", describing the goals of the project, its data release practices, and the publication policies that it intends to follow.

As noted, the main goal of the ENCODE pilot project is to compare the ability of a set of research methods to identify comprehensively all sequence-based functional elements in genomic DNA. Thus, the final product of the Consortium, which it intends to publish in a peer-reviewed journal, is planned to be an overall analysis of the different methods tested by the Consortium members, an annotated version of the full set of selected ENCODE target sequences, with all of the functional elements identified by the Project, and a recommendation for how to expand the ENCODE project to annotate the entire human genome. The Consortium expects to submit this manuscript or manuscripts for publication within six months of the end of the pilot project. In addition to group publication(s), all of the individual research groups in the ENCODE Consortium are free to publish the results of their own efforts in independent publications at any time. In these individual papers, Consortium participants will not be restricted to describing the methods developed for the project, but can and should expand into describing biological insights that arise from their analyses. To facilitate comparison of data between different groups involved in ENCODE, all publications by Consortium members should, when possible, include data on a common reference set of reagents agreed upon by the Consortium, e.g., a common cell line or a common antibody, as applicable.

Users of Consortium data, whether members of the Consortium or not, should be aware of the publication status of the data they use and treat them accordingly. For example, all investigators, including other Consortium members, should obtain the consent of the data producers before using unpublished data in their individual publications. Consortium members will not have privileged access to data from other members of the Consortium. Rather, all data shared by the Consortium members will be obtained from the data that has been released to public databases. Investigators outside of the ENCODE Consortium are free to use the ENCODE Consortium data, either en masse or specific subsets, but are asked to follow the guidelines developed at the Ft. Lauderdale meeting. Specifically, data users should cite the source of the data (referencing the initial ENCODE marker paper) and should acknowledge the data producers from the ENCODE Consortium. In addition, the data users are asked to recognize the interests of the data producers to publish reports on the generation and analysis of their data. The ENCODE data are released to public databases as pre-publication data and remain unpublished until they appear in peer-reviewed publications. Outside investigators who perform an in-depth analysis of data from the ENCODE Consortium and are interested in publishing a report before the data producers do so should discuss their results with the data producer(s) and are encouraged to establish collaborations. However, the ENCODE Consortium members are not required to collaborate with any outside investigators. All investigators, through their roles as journal and grant reviewers, should enforce a high standard of respect for the scientific contribution of the data producers. This discussion of the ENCODE data release policy has been primarily directed at issues concerning the use of ENCODE data in scientific publications. The intent of the policy is to accelerate the use of the data by the scientific community. To facilitate this goal, the data producers agree not to restrict the use of the data by others while the data users are encouraged to act in a manner that is consistent with this unrestricted access policy. The associated issue of intellectual property as it pertains to the ENCODE data is addressed in Appendix B.


Appendix A: Data Release Standard for the First Level of Verification

The Data Sharing/Release working group has recommended that the ENCODE Consortium establish a well-articulated description of a first-level verification standard for each data type produced by Consortium members: ENCODE labs should release, to an appropriate public database, data obtained in experiments when this standard has been met. In most cases, it is anticipated that additional efforts for further verification and validation of the data will be carried out, but these should not delay the initial release of data. The working group acknowledges that releasing preliminary data may not be the first choice of the data producers. However, on the assumption that such data can be useful to the scientific community, NHGRI has adopted the policy for the ENCODE Project to make such data available in a timely manner. This policy is consistent with the Institute's commitment to rapid data release to the scientific community.

All of the data generated by the ENCODE project will be linked to the human genome sequence. Data from the ENCODE Project that can be directly displayed on the human genome sequence will be stored and delivered by the University of California, Santa Cruz (UCSC) Genome Browser; other Project data will be stored and delivered by the appropriate databases to be coordinated by the NHGRI Genome Technology Branch. All ENCODE data must have the associated information on how the experiment was performed and how the raw data were analyzed to generate the conclusions (i.e., sequence elements) to be displayed. As data are deposited into public databases, individual tracks will be created to display these data on the UCSC Browser. Where applicable, the primary data underlying any sequence elements will be linked directly to the browser track. Participating labs are encouraged to submit their data rapidly even if they conflict with data from other groups. As additional data validations are performed, the investigators can modify the submitted data or even withdraw the data if further tests call into question the validity of the released data. All data will be accompanied by prominent caveats to notify users of the level of verification of the data and that frequent data release and updates will be forthcoming as further validation and analyses are performed.


Appendix B: ENCODE Intellectual Property Issues

Since the inception of the Human Genome Project, NHGRI policy has encouraged the rapid release and ready accessibility of genomic data to the broad research community. A related issue of availability pertains to any intellectual property rights that might be sought by data generators, and the effect that the exercise of such rights has on access to the data.

The Bayh-Dole Act of 1980 provides a statutory mandate to NIH grantees and contractors to seek patent protection, when appropriate, on inventions made using government funds and to license those inventions with the goal of promoting their utilization, commercialization and public accessibility. While the NHGRI has, in accordance with that law, encouraged grantees to seek patent protection for genomic technologies that have been developed with grant funds, the Institute has been concerned about the claims and exercises of those claims in the case of large-scale genomic data sets because of the Institute's belief that broad accessibility to the data is of paramount importance, and that such data are generally pre-competitive, i.e., a considerable amount of work would need to be performed beyond the initial data production to demonstrate utility. For genomic sequence data, for example, NHGRI indicated its opinion that raw data, in the absence of additional experimental biological information, lack demonstrated specific utility and therefore are inappropriate materials for patent filing. The grantees participating in the NHGRI large-scale sequencing program have been monitored for whether they filed patent claims and, to date, none have. In the case of the HapMap Project, the participants (including the NHGRI grantees) agreed not to file for patents on the bulk data from the Project. However, there was a complication because the raw data produced by the Project (SNPs and individual genotypes) had to be processed to generate the Project's ultimate output (haplotypes). In considering the issue of data release, HapMap participants were concerned about the possibility that researchers outside of the Project could add some of their own data to the raw Project data, develop haplotypes prior to the Project's ability to do so, file patent claims based on the combined data, and then potentially restrict access by others to the HapMap data (a so-called parasitic patent). To deal with this concern, a click-wrap license was imposed on the individual genotype data; to gain access to the data, researchers are required to agree not to restrict the access of others to the data and not to share the data with anyone who has not agreed to the click-wrap license.

In some respects, the cases of genomic sequence data and haplotype data were relatively easy to deal with because the data themselves do not have "utility" (in the patent law sense of the term). As a result, grantees did not express concern about the NHGRI policies on data release. In the case of the ENCODE Project, however, the applicability of this argument is not as obvious. The ENCODE Consortium will include both members funded by NHGRI ENCODE grants and those funded by other sources. The purpose of the ENCODE Project is to generate data that identify or define genomic DNA sequence elements that have biological function, and therefore might be considered to have utility and be able to be patented. Therefore, the use of patents in ways that might restrict access to large amounts or broad categories of data, e.g., all transcription factor binding sites, is an issue that needs to be addressed.

NHGRI's primary interest is to ensure the widespread availability of all information and any inventions that are generated during the ENCODE Project. NHGRI, therefore, encourages all ENCODE data producers to consider placing all information generated from their project-related efforts in the public domain and to address the NIH guidelines on the sharing of research tools (http://www.ott.nih.gov/policy/rt_guide_final.html). In the cases in which the Consortium members elect to exercise their intellectual property rights, NHGRI encourages consideration of maximal use of non-exclusive licensing of patents to allow for broad access and stimulate the development of multiple products. As a criterion for joining the ENCODE Consortium, investigators have agreed to abide by the Project's data release policy.

NHGRI also encourages users of the ENCODE data to act responsibly and share the effort involved in maintaining unrestricted access to the data. Thus, for example, if a data user were to incorporate ENCODE data into an invention, the subsequent license should not restrict the access of others to the ENCODE data. For this purpose, the term "data users" is meant to include both researchers who are members of the ENCODE Consortium and researchers who are not.

The ENCODE pilot phase, during which time data corresponding to only 1% of the human genome will be produced, will provide NHGRI with an opportunity to observe data producer and data user practices with respect to intellectual property and the ENCODE Project. NHGRI grantees are reminded that the grantee institution is required to disclose each subject invention to the Federal Agency providing research funds within two months after the inventor discloses it in writing to grantee institution personnel responsible for patent matters. NHGRI will monitor grantee activity in this area to learn whether or not attempts are being made to patent large amounts of information derived from the ENCODE Project. If, in the future, circumstances arise that convince NIH that additional measures are needed to achieve the goal of widespread access to the results of the Project, the Institute reserves the right to consider a determination of exceptional circumstance to restrict or eliminate the right of parties, under future grants, to elect to retain title. Similarly, NHGRI will monitor the activity of data users to attempt to determine whether access to the ENCODE data is being encumbered by any restrictive licenses. If the policy of reliance on data user responsibility to maintain unrestricted data access is not effective, the NHGRI will consider adopting a click-wrap license similar to that used by the HapMap Project to protect the ENCODE data and to ensure unrestricted access to the use of this data.