ENA and INSDC Policies¶
The International Nucleotide Sequence Database Collaboration (INSDC) has been an international collaboration between DDBJ, EMBL, and GenBank for over 20 years. Its advisory committee, the International Advisory Committee (IAC), is made up of European, Japanese and US chapters; membership of the European chapter overlaps that of the ENA Scientific Advisory Board (SAB). In 2002, the IAC endorsed and reaffirmed the existing data-sharing policy of the three databases that make up the INSDC, which is stated below.
Individuals submitting data to the international sequence databases managed collaboratively by DDBJ, EMBL, and GenBank should be aware of the following:
- The INSDC has a uniform policy of free and unrestricted access to all of the data records their databases contain. Scientists worldwide can access these records to plan experiments or publish any analysis or critique. Appropriate credit is given by citing the original submission, following the practices of scientists utilising published scientific literature.
- The INSDC will not attach statements to records that restrict access to the data, limit the use of the information in these records, or prohibit certain types of publications based on these records. Specifically, no use restrictions or licensing requirements will be included in any sequence data records, and no restrictions or licensing fees will be placed on the redistribution or use of the database by any party.
- All database records submitted to the INSDC will remain permanently accessible as part of the scientific record. Corrections of errors and update of the records by authors are welcome and erroneous records may be removed from the next database release, but all will remain permanently accessible by accession number.
- Submitters are advised that the information displayed on the Web sites maintained by the INSDC is fully disclosed to the public. It is the responsibility of the submitters to ascertain that they have the right to submit the data.
- Beyond limited editorial control and some internal integrity checks (for example, proper use of INSDC formats and translation of coding regions specified in CDS entries are verified), the quality and accuracy of the record are the responsibility of the submitting author, not of the database. The databases will work with submitters and users of the database to achieve the best quality resource possible.
The INSDC is an outstanding example of success in building an immensely valuable, widely used public resource through voluntary cooperation across the international scientific community. This success has been achieved by following the guidelines and principles outlined above.
Data availability policy¶
While the INSDC databases hold public data, there are several levels of data availability which control access to these data. See the INSDC Data Availability Policy for full details of INSDC data access and control.
The two main levels to data availability are when data are confidential pre-publication and then after public release.
|Confidential Data||Public Data|
A data owner can indicate during study/project registration that confidentiality is required until an owner-managed release date or publication in the literature, whichever comes earlier.
During the confidential phase, data are not available publicly through any means.
A project is subsequently and automatically released as Public on reaching the specified release date or when the relevant INSDC accession cited online or in a publication prior to this date.
In the event that a release date must be extended, data owners can extend the release of their data before it becomes public.
Removing data from the public browser¶
ENA general policy is that data which has been released into the public domain should remain public. As the submitter you need to make sure you specify the correct release date when submitting and send release date extension requests to ENA at least two weeks before the release date. Once the data has been fully released, the availability of the data is then managed at ENA and you must contact us in the event of there being an issue with the public availability of your data.
In particular, please contact us in the event that:
- You realise that your data is incorrect or contaminated with no immediate opportunity to be updated.
- You failed to manage your project release date and your project is released earlier than intended. If this is the case, please provide a reason that your data requires suppression from the browser and provide a new date for the project release.
- You requested a Confidential status or an extension to an existing release date, but the ENA, or their submissions brokering collaborator, has failed to apply the appropriate release date correctly.
- Data are found to have been submitted to the databases without the permission of the rightful owner. This is expected to be extremely rare and requires formal institutional contact with the submitting institution.
In any case where the data has been distributed as public, the INSDC partners cannot exercise any control on the resultant use of the data by third parties, even if it is subsequently removed from the service.
ENA policy relating to compression of submitted data¶
The European Nucleotide Archive (ENA) is committed to the safeguarding into the future of the world’s public domain nucleic acid sequencing data.
In order to provide economically sustainable archiving, ENA team is actively developing CRAM, a technology for raw sequence read data compression. This technology offers both lossless compression, in which read sequence and per-base quality information is faithfully preserved, and lossy models, in which data are selectively reduced to reach an optimal balance between data preservation and compression.
It is our aim with CRAM to provide a flexible technological framework in which data producers, the broad scientific community that consumes ENA data, and funding agencies are empowered to make decisions about the level of compression that can appropriately be applied to different data sets.
ENA does not currently apply CRAM compression on incoming data and will not in the future apply lossy compression on submitted data without prior announcement and prior consultation with principal stakeholders. In addition, for legacy data already submitted and loaded into ENA, we will not seek to apply lossy compression without discussion with data owners.
Users may be aware that we currently preserve original submitted data files. Once data are loaded, these files contain redundant information with that integrated into ENA. As such, we have never committed to preserving these submitted files and will, in due course, cease to sustain their storage.
Third party data¶
Third PArty data (TPA) are submitted to the International Nucleotide Sequence Databases as part of the process of publishing biological studies that include the assembly and/or annotation of existing INSDC reads and primary sequences. Publicly accessible TPA data are therefore linked to a publication or publications that document the derivation of the data supported by peer-reviewed scientific evidence.
The ENA Content team review and assist with TPA submissions on a case-by-case basis. Please contact us if you would like to submit a record which fits the above description.
Based on the nature of TPA data, i.e. a type of record that is generated from public INSDC Read or Sequence/Trace data, which is not owned by the submitting group, these records undergo a strict release policy. TPA sequences should be planned for publication in a peer-reviewed journal, which discusses the TPA records unambiguously and encompasses the concepts of (re-)annotation, (re-)assembly or a combination of these. Once TPA records have been accepted by the database, they must be cited by accession number in the peer-reviewed journal article.
Soren Brunak, Antoine Danchin, Masahira Hattori, Haruki Nakamura, Kazuo Shinozaki, Tara Matise, Daphne Preuss (2002) Nucleotide Sequence Database Policies
Science 298 (5597): 1333 15 Nov 2002