How to Use bedtools getfasta to Extract DNA Sequences (With Example)

Renesh Bedre    3 minute read

bedtools getfasta is a command-line utility for extracting DNA sequences from the reference FASTA file based on the genomic coordinates given in the BED/GFF/VCF file format.

The general syntax of bedtools getfasta looks like this:

# default
bedtools getfasta -fi reference.fasta -bed regions.bed -fo output.fasta

# Extract sequences using name column
bedtools getfasta -fi reference.fasta -bed regions.bed -fo output.fasta -name

# Extract sequences and strand information
bedtools getfasta -s -fi reference.fasta -bed regions.bed -fo output.fasta

Where,

Parameter Description  
-fi Input FASTA file from where sequences needs to extract  
-bed BED file for regions  
-fo Save extracted sequences in this file in FASTA format  
-name Assign name (fourth column of BED file) to sequences in output FASTA file  
-s Extract sequence strand information (sixth column of BED file) in output FASTA file  

In addition to the above parameters, the bedtools getfasta has several other parameters for extracting sequences from the reference FASTA file.

Learn how to install bedtools

The following examples demonstrate how to use bedtools getfasta to extract DNA sequences and other information from the FASTA file.

Extract the sequence from the BED file (default behavior)

The following example shows how to use bedtools getfasta to extract DNA sequences from the genomic coordinates provided in the BED file.

The first three columns in BED format are chr, start, and end (BED3 file).

# input sequence
head reference.fasta

>chr1
ATGGCCTTAAATTTTAAA

# input BED file (three columns)
head regions.bed

chr1    4   7

# extract the sequence based on BED genomic coordinates 
bedtools getfasta -fi reference.fasta -bed regions.bed -fo output.fasta

The above command will extract the sequence from chr1 in between 4 to 7 interval.

The output (extracted sequence) will be saved in output.fasta. By default, the sequence name in the output FASTA file will be written as Chr:start-end.

head output.fasta

>chr1:4-7
CCT

Troubleshooting Tip: The sequence name in the BED file’s first column should exactly match the sequence name in the reference FASTA file. The BED file should be TAB separated. FASTA and BED files should have a Unix line break (use the dos2unix command).

Similarly, you can also use seqtk subseq or Python for extracting the sequences from specific regions of the FASTA file.

Extract the sequence from the BED file (Assign value in name column to sequence header)

If you use the -name parameter with bedtools getfasta, it will assign a sequence header based on the value in the name column in the BED file.

To use the -name parameter of bedtools getfasta, you should have a BED file with four columns.

The first four columns in BED format are chr, start, end, and name.

# input sequence
head reference.fasta

>chr1
ATGGCCTTAAATTTTAAA

# input BED file (four columns)
head regions.bed

chr1    4   7   geneA

# extract the sequence based on BED genomic coordinates 
bedtools getfasta -fi reference.fasta -bed regions.bed -fo output.fasta -name

The output (extracted sequence) will be saved in output.fasta. The sequence name in the output FASTA file will be written as geneA (as defined by the -name parameter).

head output.fasta

>geneA
CCT

Extract the sequence from the BED file (with sequence and strand information)

You can use the -s parameter with bedtools getfasta to extract and output the strand information in output FASTA file.

To extract the strand information, you need six column BED file (BED6). The sixth column in the BED file is sequence strand information.

# input sequence
head reference.fasta

>chr1
ATGGCCTTAAATTTTAAA

# input BED file (six columns)
head regions.bed

chr1    4   7   geneA   0   +
chr1    9   11   geneB   0   -

# extract the sequence based on BED genomic coordinates 
bedtools getfasta -s -fi reference.fasta -bed regions.bed -fo output.fasta -name

The output (extracted sequence) will be saved in output.fasta. The sequence name and strand in the output FASTA file will be written as geneA(+).

head output.fasta

>geneA(+)
CCT
>geneB(-)
TT

Enhance your skills with courses on genomics and bioinformatics


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.