Entrez programming utilities for downloading the nucleotide and protein sequences from NCBI

Renesh Bedre    2 minute read

  • The Entrez programming utilities (E-utilities) are a set of server-side programs and helps to download various biomedical data including nucleotide and protein sequences, molecular structures. etc., from National Center for Biotechnology Information (NCBI) using a programmatic approach.
  • E-utilities access the Entrez database (molecular biology database system) for downloading biomedical data.
  • E-utilities are helpful when we have to download a large number of nucleotide and protein sequences from NCBI. For example, download all plant protein sequences. The GUI approach for sequence download may not always work as expected when dealing with a large number of sequences.

Download nucleotide or protein sequences based on the GI list

  • If you have a list of nucleotide or protein GenInfo identifier (GI), you can download the sequences in FASTA format using the following program (see original code here)
  • To run the following Perl scripts, you need to have Perl and LWP::Simple Perl module are installed
use LWP::Simple;

# Download protein records corresponding to a list of GI numbers.
# nucleotide or protein database
$db = 'protein';
$id_list = '2026800804,2026800803,2026800802,2026800801,2026800800';

# assemble the epost URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; # basic URL to make all E-utility requests
$url = $base . "epost.fcgi?db=$db&id=$id_list";

# post the epost URL
$output = get($url);

# parse WebEnv and QueryKey
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

# get the sequences in FASTA (rettype)
# Retrieval mode (retmode) in plain text format
$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url .= "&rettype=fasta&retmode=text";

$data = get($url);
print "$data";

# save this code in a file and run using perl command
  • Download the above code and run as perl gi_download.pl

Download large number of nucleotide or protein sequences

  • E-utilities are helpful to download all protein or nucleotide sequences for a particular organism or whole taxonomic branch
  • See here for generating query ($query variable) to retrieve the sequences
  • The NCBI E-utility recommends running large jobs on weekends or after office hours (between 9:00 PM and 5:00 AM)
use LWP::Simple;

# nucleotide or protein database
$db = 'nucleotide';

# download all nucleotide sequences from Arabidopsis thaliana plant
# avoid spaces in queries. if there are spaces, replace them with a plus sign (+)
$query = 'txid3702[Organism:noexp]';
# use following query to download all plant sequences
# $query = 'all[filter]+AND+plants[filter]';

# assemble the esearch URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';  # basic URL to make all E-utility requests
$url = $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

# post the esearch URL
$output = get($url);

# parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);

# open output file for writing
# all sequences will be saved in this file
open(OUT, ">Athaliana.fasta") || die "Can't open file!\n";

# retrieve data in batches of 500 Entrez Unique Identifiers (UIDs) 
# you can set this up to a maximum of 100,000 records
$retmax = 500;
for ($retstart = 0; $retstart < $count; $retstart += $retmax) {
        $efetch_url = $base ."efetch.fcgi?db=$db&WebEnv=$web";
        $efetch_url .= "&query_key=$key&retstart=$retstart";
        $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
        $efetch_out = get($efetch_url);
        print OUT "$efetch_out";
}
close OUT;

# save this code in a file and run using perl command
  • Download the above code and run as perl bulk_download.pl

References

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License