Moving away from the release

Introduction

The ENA retired its periodic assembled/annotated sequence release in March 2020. The last release was number 143.

The European Nucleotide Archive (ENA) captures, preserves and presents the world’s nucleotide sequence data. Since 1982 the European Nucleotide Archive has made more than 140 individual releases, providing a quarterly snapshot of ENA assembled/annotated sequence data. During this time, changes to the ways in which users access ENA data, have led us to develop a portfolio of data access tools, such as our daily FTP products and the ENA Browser API, which are currently offered in parallel to the traditional release. In recent years we have faced growing pressure on the release process in response to increases in data volume and have also seen a shift towards our newer services from the majority of users. Our release process has remained largely unchanged for the last two decades, and following an internal review we have concluded that it is no longer viable for us to continue the current release process as part of our presentation portfolio.

New data is already included in the ENA on a continuous basis and distributed daily from our ENA Browser, FTP and RESTful API services. The key change is that we will no longer make an additional separate quarterly release of the assembled/annotated subset of sequences. We will focus our resources on further developing and supporting our continuous distribution presentation products.

Additionally, as part of the release retirement we will no longer be creating cumulative FTP files in the FTP update folders (e.g. http://ftp.ebi.ac.uk/pub/databases/ena/sequence/update/). These cumulative files tracked daily changes in between release cycles and thus cannot continue to be produced sustainably. Release 143 will be the last available in the release folder (available here once released http://ftp.ebi.ac.uk/pub/databases/ena/sequence/release/), the update folder will be removed after this last release. Set based sequences have already been removed from the release and will continue to be added to the FTP in their corresponding folders (e.g. http://ftp.ebi.ac.uk/pub/databases/ena/wgs/public/; http://ftp.ebi.ac.uk/pub/databases/ena/tsa/public/; http://ftp.ebi.ac.uk/pub/databases/ena/tls/public/).

Deprecated: Release vs Update search results

We no longer maintain separately indexed datasets for RELEASE and UPDATE data for Sequence and Coding & NonCoding RNA records. Where RELEASE referred to the last ENA release, and UPDATE referred to any records that had been added or modified since the last release.

After the final release (143) in March 2020, in our Advanced Search service, we’ve now merged the ‘_release’ and ‘_update’ data types for sequence, coding and non-coding. So the data types ‘sequence_release’ and ‘sequence_update’ were replaced with the data type ‘sequence’. This affects users of our API and Browser advanced search services, who will need to use the updated data type end points.

The following guide has been created to assist users in moving away from the release. This guide outlines accessing assembled/annotated sequences, guidance on how to identify data based on a last updated timestamp and advice for establishing your own mirroring procedures using our portfolio of other access services.

Accessing assembled/annotated sequences

Assembled/annotated sequences can be obtained from our continuous daily distribution resources, with API, FTP and web browser-based options. For most use cases we would recommend the ENA Browser API as it provides the greatest specificity and flexibility for obtaining a tailored dataset of assembled/annotated sequences for your requirements.

ENA API

Assembled/annotated sequences can be identified and downloaded with our ENA Browser API. The http API Swagger interface lists the endpoints, documenting expected parameter and errors.

Examples: (we provide curl examples, but you could use wget or a web browser or a rest client)

Obtaining the latest version of a sequence record by accession:

In EMBL format:

curl -X GET "https://www.ebi.ac.uk/ena/browser/api/embl/BN000065"

In FASTA format

curl -X GET "https://www.ebi.ac.uk/ena/browser/api/fasta/BN000065"

Obtaining a specific version, including suppressed versions, of a sequence record by accession:

In EMBL format:

curl -X GET "https://www.ebi.ac.uk/ena/browser/api/embl/KF961410.1"

The ENA Browser API also allows the user to conduct a search for multiple Assembled/annotated sequences records and download them. In this example searching the sequence data type for human data distributed or updated since 19th August 2019: In EMBL format

curl 'https://www.ebi.ac.uk/ena/browser/api/embl/search?result=sequence&query=tax_eq(9606)%20AND%20last_updated%3E%3D2019-08-18&limit=5' -o embl.txt

or FASTA

curl 'https://www.ebi.ac.uk/ena/browser/api/fasta/search?result=sequence&query=tax_eq(9606)%20AND%20last_updated%3C%3D2019-08-18&limit=5' -o fasta.txt

We have added limits to the above examples to only return 5 records.

If not provided, limit defaults to 100000. To retrieve ALL records matching a query, user limit=0.

You can search using the sequence, coding or noncoding data type endpoints. In general when using the API search it is important to be as specific as possible with your query to save on downloading sequences that you do not require.

ENA FTP

The release folders, for example the sequence release folder (http://ftp.ebi.ac.uk/pub/databases/ena/sequence/release/) will contain the final release 143 made in March 2020. No further FTP releases will be made after release 143.

ENA Browser

For the majority of use cases we would recommend utilizing the ENA Browser API for obtaining assembled/annotated sequences. However, these are also available to search and download from the ENA Browser.

The ENA Browser provides direct access to sequences by accession, with subsequent options for downloading in EMBL or FASTA format; e.g. see https://www.ebi.ac.uk/ena/browser/view/BN000065

The ENA Browser also provides an Advanced Search for finding appropriate assembled/annotated sequences for download. This feature is also useful for assistance with constructing complex API queries. In particular one could use the graphical interface to construct the query and then export it for command line using the “Copy Curl Request” button.

Detailed guidance on the usage of Advanced Search is available in our Advanced Search documentation, but we make a brief mention here:

  1. Start an advanced search at https://www.ebi.ac.uk/ena/browser/advanced-search
  2. Select an assembled/annotated sequence data type such as ‘sequence’, ‘coding’ or ‘noncoding’
  3. (Recommended) Use the Query builder to be as specific as possible with the available filters to construct a query that will limit the resulting dataset to match your needs. e.g. Key filters include:
    • limiting by date. Database record -> last updated
    • taxon. Taxonomy and related -> NCBI taxonomy.
  4. (optional) Select the fields you want in the resulting data. By default, the INSDC accession and description is provided.
  5. (Optional) Use inclusion and exclusion lists of accessions to finely alter the returned records.
  6. Once you have run your query you can click the hyperlinks to download the full data files in in either EMBL or FASTA format.
  7. (Optional) If desired you can copy your query for command line use with the ENA APIs using the “Copy Curl Request” button.
  8. (Optional) You can save this query for future use, by saving it to your Rulespace account using the ‘Save To Rulespace’ button, please refer to this guide for more information.

Periodic Snapshots & Support API

For sequence, coding and noncoding RNA data, we produce a periodic snapshot which includes all public records at that time point. These are available from FTP. These snapshots are different from the old release approach in these aspects:

  1. Are more frequent. We aim to produce these twice a month.
  2. Release numbers will not be updated in the flatfile DT lines

Assembled/Annotated Sequences

Latest snapshot is available at ftp.ebi.ac.uk/pub/databases/ena/sequence/snapshot_latest/.

snapshot_latest is a symlink that points to the most recent snapshot. This is also listed in the text file snapshot_latest.txt in the parent folder. In this folder, the records are divided into con, expanded_con and std subfolders. std subfolder contains all dataclasses that are not CON (STD, EST, GSS, PAT etc.) Records are in gzip files, further divided by taxonomic division, with upto 1,000,000 records per file.

Coding & Noncoding RNA Sequences

CDS and NCRNA subproducts from CON & STD (incl. EST, GSS etc) are treated the same was as Assembled/Annotated Sequences. The latest snapshots are available at

ftp.ebi.ac.uk/pub/databases/ena/coding/snapshot_latest and

ftp.ebi.ac.uk/pub/databases/ena/non-coding/snapshot_latest respectively.

But for subproducts from WGS/TSA/TLS sequences, the records are made available in a different manner. We group the coding records from a given WGS set in to one file. Then files are grouped set-name based on 3 character prefix into a tar file. e.g. ftp.ebi.ac.uk/pub/databases/ena/non-coding/snapshot_latest/wgs/aaa.tar contains Coding features from AAAA02, AAAB01 and so on.

Individual set files are also made available separately on FTP. e.g. Consider the WGS set WYAA01, that includes the individual WGS records WYAA01000001-WYAA01000116. The WGS sequence set for this is available on FTP at ftp.ebi.ac.uk/pub/databases/ena/wgs/public/wya.

Correspondingly, the coding subproducts from sequences WYAA01000001-WYAA01000116 are available together in ftp.ebi.ac.uk/pub/databases/ena/coding/wgs/public/wya with the name WYAA01.cds.gz

Similarly, the noncoding RNA file is available in ftp.ebi.ac.uk/pub/databases/ena/non-coding/wgs/public/wya with the name WYAA01.ncr.gz

So, if you wanted all coding from WGS, you would need to start at the ftp.ebi.ac.uk/pub/databases/ena/coding/wgs/public level, delve into each subfolder and download the *.cds.gz files.

Find Deleted (suppressed/killed) Records

For Sequence, Coding & Non-coding, to find deleted record IDs since a given date, call the API as follows:

https://www.ebi.ac.uk/ena/browser/api/deleted/sequence/2020-07-01

https://www.ebi.ac.uk/ena/browser/api/deleted/coding/2020-07-01

https://www.ebi.ac.uk/ena/browser/api/deleted/noncoding/2020-07-01

Find Changed Sets

To get a list of Coding or ncRNA set files that have been added/updated since a given date, without having to check through all the subfolders, we provide an API. Call it as follows.

https://www.ebi.ac.uk/ena/browser/api/changed_sets/coding/2020-07-01

and

https://www.ebi.ac.uk/ena/browser/api/changed_sets/noncoding/2020-07-01

How to identify data based on a last updated timestamp

One common usage of the ENA release was to obtain all assembled/annotated sequence data changed since the last release, either from the entire new release or from the incremental update folders. This can be fully replicated in the ENA Browser API or ENA Browser Advanced Search by using the “last_updated” query filter with a date value.

For the ENA Browser API search endpoint, you can include the ‘last_updated’ filter and provide a timestamp. This is essentially performing a ‘less than or equal to’ search, so will provide all records that are new or have been updated from the provided date to the present day). It is recommended that you further customize the query with further filters (for example taxon or geographic) to avoid unnecessarily downloading data you do not require.

Example in FASTA format

curl 'https://www.ebi.ac.uk/ena/browser/api/fasta/search?result=sequence&query=last_updated%3E%3D2019-08-18&limit=5' -o fasta.txt

or in EMBL format

curl 'https://www.ebi.ac.uk/ena/browser/api/embl/search?result=sequence&query=last_updated%3E%3D2019-08-18&limit=5' -o embl.txt

You can also provide multiple timestamp filters to give a specific from and to date range, rather than all data to this date, for example data for the first 5 days of August 2019:

curl 'https://www.ebi.ac.uk/ena/browser/api/fasta/search?result=sequence&query=last_updated%3E%3D2019-08-01%20AND%20last_updated%3C%3D2019-08-05&limit=5' -o fasta.txt

We have added limits to the above examples to only return 5 records. Use limit=0 to retrieve ALL matching records. You can search using the sequence, coding or non-coding data type endpoints. In general when using the API search it is important to be as specific as possible with your query to save on downloading sequences that you do not require.

For the ENA Browser advanced search the ‘last_updated’ filter can be included in your query. It is located in the Database record filter section.

Establishing your own release mirroring procedures - Conducting your own release

This section covers the establishment of a mirroring of ENA assembled/annotated sequences without the ENA release. Successful mirroring includes the following concepts:

  • Data provenance: Track the accessions obtained in your mirroring, so that the data can be obtained again in future.
  • Periodic release: Obtain ENA assembled/annotated sequence data from a defined last updated timestamp.
  • Data specificity: By preference use a filtered query to only obtain the data you need, unless you really do need to mirror everything.
  • Recapturing the same data in future: Instructions for you or your users to use a summary file that you create to obtain the same dataset in future.

This equates to utilizing two separate ENA API services: - The Data Discovery API to obtain a summary for data provenance - The Browser API to obtain the data most efficiently.

Data provenance

Save the accessions and sequence versions that match your search criteria as a report, which will act as the master document for creating the release. To create such a list, you can query the ENA Portal API with search parameters and save the results to a TSV or JSON file, which you can then use to retrieve the EMBL format or FASTA format records from the ENA Browser API. If you would like to get the current public versions of the records even at a later time, in the query to Portal API, include ‘sequence_version’ in the fields list. A reason for doing this is to have a fixed list with which you could re-download the same set of records in the future. As records are added,updated or suppressed, the public dataset is regularly changing, and as such you may not get a certain record, or get a different version of a record were you to run the same query in a future date.

e.g.

curl 'https://www.ebi.ac.uk/ena/portal/api/search?result=sequence&query=last_updated%3E%3D2019-08-01%20AND%20last_updated%3C%3D2019-08-05&fields=sequence_version,last_updated' -o sequence_report.tsv

Periodic release and data specificity

Do the above based on your preferred time period for releases and use the last_updated search parameter.

Instructions for verifying changes since you conducted your release

At a future date, you could rerun the same query and save a new version of the report, which then can be compared with the original master report to look for any differences. We are working on an endpoint that you could upload the original report to and get the list of differences as a response. This is important step as you need to be aware of any sequences that have been killed, as these will not appear in the new data acquisition.

Instructions for obtaining same specific versions of sequences obtained in your release

If the sequence version has been captured in the report, you could retrieve the same specific versions at any time from the Browser API, except for any that may have been killed.

Using the accession and sequence_version fields from this report, you can then retrieve the specific version of the record from Browser API in EMBL or FASTA format. If your list is large, this is obviously not very efficient. So you could run the exact same query against the Browser API’s search endpoint to retrieve all the matching records in EMBL or FASTA format at once.

e.g.

curl 'https://www.ebi.ac.uk/ena/browser/api/embl/search?result=sequence&query=last_updated%3E%3D2019-08-01%20AND%20last_updated%3C%3D2019-08-05' -o sequences.txt

Either of the above, you could parallelize by using the offset and limit parameters to get different chunks of the data simultaneously.

curl 'https://www.ebi.ac.uk/ena/browser/api/fasta/search?result=sequence&query=last_updated%3E%3D2019-08-01%20AND%20last_updated%3C%3D2019-08-05&offset=0&limit=100000' -o sequences_1.txt

curl 'https://www.ebi.ac.uk/ena/browser/api/fasta/search?result=sequence&query=last_updated%3E%3D2019-08-01%20AND%20last_updated%3C%3D2019-08-05&offset=100000&limit=100000' -o sequences_2.txt

etc.

Hint: If in the future you want to only retrieve records that have been added or changed since your last pull, it is important that you record the timestamp from when you run the current query and store this so that you can use it for repeating the process for your next update. Obviously you can now pick an update frequency that most suits your use case.

retrieved so far (e.g. using grep), and then use the offset parameter to get the rest from there onwards. If there is a significant delay between the first and the second call, please be aware that the indexed data may have been updated.

More information resources

Further documentation on the above services is available in their respective documentation: - ENA Discovery Portal API documentation - ENA Browser documentation

Further assistance

If you currently rely on any aspect of the separate assembled/annotated sequence release process for your work or resource, and cannot switch to one of our continuous distribution processes outlined above, please feel free to contact us to discuss your requirements.

In your query please list what features you utilised from the release process. We can discuss your requirements and determine how we might support your use case through

one of our existing services or collaborate on an adapted or novel solution. Contacting us promptly with your requirements will allow us to ensure adequate time and resources to collaborate on a solution.
Please contact us with your questions or concerns at https://www.ebi.ac.uk/ena/browser/support
with subject ‘ENA release retirement’.

Spot an edit or improvement to this page? Please report it using our ENA Support Service quoting the URL of this page in your query.