blastp: Command-line Utility for Protein Sequence Search

Renesh Bedre    3 minute read

The blastp is a command-line utility from the NCBI BLAST toolkit that is used for performing protein-protein sequence similarity searches using the BLAST algorithm.

blastp compares a query protein sequence against a protein BLAST database to identify homologous protein sequences. If you want to compare nucleotide sequence against a nucleotide BLAST database, please see blastn tool.

The general syntax of blastp looks like this:

# basic command
blastp -query query_fasta -db blast_protein_db  -outfmt output_format -out output_file

# command with advanced regularly used options
blastp -query query_fasta -db blast_protein_db -evalue 1e-05  -max_target_seqs 5  \
    -num_threads 10  -outfmt output_format -out output_file


Parameter Description
-query Input protein sequences in FASTA format to search against a protein BLAST database
-db Formatted protein BLAST database. See makeblastdb for creating a formatted BLAST database.
-evalue Expectation value (E) value threshold you want to use for the search (default 10). Matches with lower evalue represent significant matches
-max_target_seqs Maximum number of aligned sequences to be reported in the output (default 500). A value of >=5 is recommended
-num_threads Number of threads (CPU cores) for the search (default 1). More is better for a faster search.
-outfmt Numerical value representing a predefined output format or a custom string specifying the fields you want to include in the BLAST output (default 0, pairwise)
-out Name of the output file where results will be saved

In addition to the above frequently used parameters, you can see more parameters and their usage using the blastp -help command

Note: blastp requires the formatted BLAST database. You can create it using the makeblastdb command or you can download the preformatted BLAST database from NCBI.

The following examples explain how to use blastp on the command line for protein-protein sequence similarity searches.

Let’s say you have an input query protein sequence (input.fasta) and a formatted protein database (target_protein_db).

Run basic blastp command

blastp -query input.fasta -db target_protein_db -outfmt 6 -out blastp_output.txt

Above blastp compare the protein sequences in input.fasta against the formatted target_protein_db, and save the results in tabular format (-outfmt 6) in the blastp_output.txt file.

The output should look like this:

head -n5 blastp_ouput.txt
seq1    seq1    100.000 70      0       0       1       70      1       70      5.11e-48        133
seq1    seq2    100.000 25      0       0       13      37      32      56      1.18e-15        51.6
seq1    seq2    100.000 13      0       0       41      53      1       13      0.029   17.3
seq1    seq3    76.744  43      0       1       21      53      1       43      3.31e-12        43.1
seq1    seq3    100.000 13      0       0       7       19      57      69      3.64e-08        32.7

The columns in the output file (with -outfmt 6) represent query id, target id, % identical matches, alignment length, mismatches, gap openings, query start, query end, target start, target end, evalue, and bitscore

Run blastp command with customized options

blastp -query input.fasta -db target_protein_db -evalue 1e-05  -max_target_seqs 5  -num_threads 10 \
  -outfmt "6 qseqid qlen sseqid slen qstart qend sstart send nident pident length mismatch gaps qcovs evalue bitscore" \
  -out blastp_output.txt

Above blastp compare the protein sequences in input.fasta against the target_protein_db with given parameter cut-offs, and save the results with in a tabular format with customized fields in the blastp_output.txt file.

The output should look like this:

head -n5 blastp_output.txt
seq1    70      seq1    70      1       70      1       70      70      100.000 70      0       0       100     5.11e-48                                                                                                                                                                                                                                 133
seq1    70      seq2    56      13      37      32      56      25      100.000 25      0       0       36      1.18e-15                                                                                                                                                                                                                                 51.6
seq1    70      seq3    69      21      53      1       43      33      76.744  43      0       10      66      3.31e-12                                                                                                                                                                                                                                 43.1
seq1    70      seq3    69      7       19      57      69      13      100.000 13      0       0       66      3.64e-08                                                                                                                                                                                                                                 32.7
seq2    56      seq2    56      1       56      1       56      56      100.000 56      0       0       100     1.85e-37                                                                                                                                                                                                                                 105

The columns in the output file represent the customized columns mentioned in -outfmt parameter.

Enhance your skills with courses on genomics and bioinformatics

This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.