Entrez programming utilities for downloading the nucleotide and protein sequences from NCBI

Renesh Bedre    5 minute read

  • The Entrez programming utilities (E-utilities) are a set of server-side programs and helps to download various biomedical data including nucleotide and protein sequences, molecular structures. etc., from National Center for Biotechnology Information (NCBI) using a programmatic approach.
  • E-utilities access the Entrez database (molecular biology database system) for downloading biomedical data.
  • E-utilities are helpful when we have to download a large number of nucleotide and protein sequences from NCBI. For example, download all plant protein sequences. The GUI approach for sequence download may not always work as expected when dealing with a large number of sequences.
  • Entrez Direct (EDirect), which accesses the Entrez database through E-utilities, provides an option to download the nucleotide or protein sequences from a Linux/Unix command line.
  • In addition to E-utilities, ncbi-genome-download Python package can be specifically used to download the genome sequences from the NCBI database

Download nucleotide or protein sequences based on the GI list

  • If you have a list of nucleotide or protein GenInfo identifier (GI), you can download the sequences in FASTA format using the following program (see original code here)
  • To run the following Perl scripts, you need to have Perl and LWP::Simple Perl module are installed
use LWP::Simple;

# Download protein records corresponding to a list of GI numbers.
# nucleotide or protein database
$db = 'protein';
$id_list = '2026800804,2026800803,2026800802,2026800801,2026800800';

# assemble the epost URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; # basic URL to make all E-utility requests
$url = $base . "epost.fcgi?db=$db&id=$id_list";

# post the epost URL
$output = get($url);

# parse WebEnv and QueryKey
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);

# get the sequences in FASTA (rettype)
# Retrieval mode (retmode) in plain text format
$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url .= "&rettype=fasta&retmode=text";

$data = get($url);
print "$data";

# save this code in a file and run using perl command
  • Download the above code and run as perl gi_download.pl

Download large number of nucleotide or protein sequences

  • E-utilities are helpful to download all protein or nucleotide sequences for a particular organism or whole taxonomic branch
  • See here for generating query ($query variable) to retrieve the sequences
  • The NCBI E-utility recommends running large jobs on weekends or after office hours (between 9:00 PM and 5:00 AM)
use LWP::Simple;

# nucleotide or protein database
$db = 'nucleotide';

# download all nucleotide sequences from Arabidopsis thaliana plant
# avoid spaces in queries. if there are spaces, replace them with a plus sign (+)
$query = 'txid3702[Organism:noexp]';
# use following query to download all plant sequences
# $query = 'all[filter]+AND+plants[filter]';

# assemble the esearch URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';  # basic URL to make all E-utility requests
$url = $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";

# post the esearch URL
$output = get($url);

# parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);

# open output file for writing
# all sequences will be saved in this file
open(OUT, ">Athaliana.fasta") || die "Can't open file!\n";

# retrieve data in batches of 500 Entrez Unique Identifiers (UIDs) 
# you can set this up to a maximum of 100,000 records
$retmax = 500;
for ($retstart = 0; $retstart < $count; $retstart += $retmax) {
        $efetch_url = $base ."efetch.fcgi?db=$db&WebEnv=$web";
        $efetch_url .= "&query_key=$key&retstart=$retstart";
        $efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
        $efetch_out = get($efetch_url);
        print OUT "$efetch_out";
}
close OUT;

# save this code in a file and run using perl command
  • Download the above code and run as perl bulk_download.pl

Download nucleotide or protein sequences from Linux/Unix command line

  • EDirect tool (E-utilities for command line) can be used for programmatic download of nucleotide or protein sequences through the command line. It works on Linux and Mac OS. On Windows OS, it can be used using Cygwin (Unix/Linux environment).
  • EDirect provides navigation functions (esearch, elink, efilter, and efecth) to download the sequences through NCBI’s sequences databases.
  • esearch performs Entrez search based on query and database, elink search for associated records with a query in other databases, efilter provides filter options for results, and efecth allows to download the records in a specific format. You can combine these commands using Unix pipe redirection (|).

Check how to install EDirect

Download nucleotide sequences using gi and GenBank accession in FASTA format,

# using gi
esearch -db nuccore -query 6002679 | efetch -format fasta

# using GenBank accession
esearch -db nuccore -query AF105064.1 | efetch -format fasta

# both commands will return same output 
>AF105064.1 Arabidopsis thaliana GIGANTEA (GI) mRNA, complete cds
CAGGGTTTAGCTGTTTGATTCAGCTTCGATTTAGTGTACAGTGTGTTGATTAGTATAAAAAGGATTTAAA
.
.

Download protein sequences using gi and GenBank accession,

esearch -db protein -query AAF00092.1| efetch -format fasta

# output 
>AAF00092.1 GIGANTEA [Arabidopsis thaliana]
MASSSSSERWIDGLQFSSLLWPPPRDPQQHKDQVVAYVEYFGQFTSEQFPDDIAELVRHQYPSTEKRLLD
.
.

Download nucleotide sequences in GenBank (gb) format,

esearch -db nuccore -query AF105064.1 | efetch -format gb

# output
LOCUS       AF105064                4001 bp    mRNA    linear   PLN 01-OCT-1999
DEFINITION  Arabidopsis thaliana GIGANTEA (GI) mRNA, complete cds.
ACCESSION   AF105064
.
.

Download protein sequences associated with nucleotide accessions,

# using GenBank accession
esearch -db nuccore -query AF105064.1 | elink -target protein | efetch -format fasta
>AAF00092.1 GIGANTEA [Arabidopsis thaliana]
MASSSSSERWIDGLQFSSLLWPPPRDPQQHKDQVVAYVEYFGQFTSEQFPDDIAELVRHQYPSTEKRLLD
.
.

Get SRA accessions associated with BioSample accessions,

esearch -db sra -query SAMN07304757 | efetch -format runinfo | cut -f1 -d','
# output
Run
SRR5790106

Download genome sequences using ncbi-genome-download

ncbi-genome-download Python package provides various options to download the genome sequences from RefSeq NCBI database

Install ncbi-genome-download

pip install ncbi-genome-download

Download the Arabidopsis thaliana genome sequence using plant names,

ncbi-genome-download --genera "Arabidopsis thaliana" plant 

# multiple plant species
ncbi-genome-download --genera "Arabidopsis thaliana,Sorghum bicolor" plant 

Download the Arabidopsis thaliana genome sequence using NCBI taxonomy ID (3702)

ncbi-genome-download -t 3702 plant  

Download multiple genome sequences [Arabidopsis thaliana (3702) and Sorghum bicolor (4558)]

ncbi-genome-download -t 3702,4558 plant  

By default, genome sequences will be saved in GenBank format. To save in FASTA format,

ncbi-genome-download -t 3702,4558 -F fasta plant  

Download all plant genome sequences,

ncbi-genome-download -F fasta plant  

Download all plant genome sequences with completed genome assemblies,

ncbi-genome-download -F fasta --assembly-levels complete plant  

Download all plant genome sequences with completed and chromosome level genome assemblies,

ncbi-genome-download -F fasta --assembly-levels complete,chromosome plant  

References

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License