bedtools for genomics analysis

Renesh Bedre    8 minute read

1. bedtools introduction

  • bedtools is an essential command line toolkit for genomic arithmetics (sorting, finding overlap, merging, etc.), file format conversions, coverage calculations, multi-way file comparisons, and sequence manipulation analysis.
  • bedtools works with several input file formats including BED, GFF, VCF, FASTA, and SAM/BAM. The records in BED files must be tab separated.
  • BED (Browser Extensible Data) file contains 12 columns including chr, start, end, name, score, strand, thickstart, thickend, itemRgb, blockCount, blockSizes, and blockStarts. But, bedtools also supports the first three columns (chr, start, end) for common purpose utilities. Check the description of each column here.
  • There are various types of BED files (e.g. BED3, BED6, BED12, etc.) based on the presence of a number of columns. The most common BED3 format contains the first three columns and the BED12 format contains all 12 columns.
  • The genomic coordinates in BED files are 0-based (start position) as opposed to GFF and VCF which contains the coordinates in 1-based format. bedtools adjusts the coordinates accordingly if the input is in GFF or VCF file formats.

2. How to install bedtools

bedtools is available only for Linux and Mac. The easiest way to install bedtools using package managers such as apt-get (Ubuntu) and homebrew (Mac OS)

Learn more about Linux commands for Bioinformatics

2.1 Install on Linux (Ubuntu)

# using package manager
sudo apt update
sudo apt-get install bedtools

# using static binary
wget https://github.com/arq5x/bedtools2/releases/download/v2.30.0/bedtools.static.binary
mv bedtools.static.binary bedtools
chmod a+x bedtools
export PATH=/home/renesh/lin_proj/software:$PATH

2.2 Install on Mac

brew tap homebrew/science
brew install bedtools

# MacPorts
port install bedtools

2.3 Check version

bedtools --version
# output
bedtools v2.30.0

3. bedtools common utilities

3.1 Compare two BED files for overlapping coordinates

bedtools intersect (aka intersectBed) helps you to identify overlapping and non-overlapping genomic features in two BED, GFF, VCF, or BAM files.

bedtools intersect

Find overlapping features between two BED files,

Download files: ath_cds_1.bed, ath_cds_2.bed and ath_cds_3.bed

bedtools intersect -a ath_cds_1.bed -b ath_cds_2.bed
# output
Chr1    5300    5326
Chr1    5439    5630

Reports all genomic coordinates from -a that overlapped with -b file,

bedtools intersect -a ath_cds_1.bed -b ath_cds_2.bed -wa
# output
Chr1    5174    5326
Chr1    5439    5630

Report all non-overlapped genomic coordinates from -a file,

bedtools intersect -a ath_cds_1.bed -b ath_cds_2.bed -v
# output
Chr1    3996    4276
Chr1    4486    4605
Chr1    4706    5095

Report all overlapped and non-overlapped genomic coordinates from -a file (like left outer join)

bedtools intersect -a ath_cds_1.bed -b ath_cds_2.bed -loj
# output
Chr1    3996    4276    .       -1      -1
Chr1    4486    4605    .       -1      -1
Chr1    4706    5095    .       -1      -1
Chr1    5174    5326    Chr1    5300    5326
Chr1    5439    5630    Chr1    5439    5630

3.2 Compare multiple files for overlapping coordinates

bedtools multiinter (aka multiIntersectBed)) helps you to identify overlapping and non-overlapping genomic features in multiple BED, GFF, or VCF files. The input files must be sorted by chromosome and start positions.

Find overlapping features between three BED files,

bedtools multiinter -i ath_cds_1.bed ath_cds_2.bed ath_cds_3.bed -header
# output
chrom   start   end     num     list    ath_cds_1.bed   ath_cds_2.bed   ath_cds_3.bed
Chr1    3996    4276    2       1,3     1       0       1           #  overlapped in first and third bed files
Chr1    4486    4605    1       1       1       0       0
Chr1    4706    5095    2       1,3     1       0       1           #  overlapped in first and third bed files
Chr1    5174    5300    1       1       1       0       0
Chr1    5300    5326    3       1,2,3   1       1       1           #  overlapped in all three bed files
Chr1    5439    5630    2       1,2     1       1       0           #  overlapped in first two bed files
Chr1    7942    7987    1       2       0       1       0
Chr1    8236    8325    1       2       0       1       0
Chr1    8417    8464    1       2       0       1       0
Chr2    5439    5630    1       3       0       0       1

3.3 Calculate the depth and breadth of coverage

bedtools coverage utility helps you to calculate both depth and breadth of coverage between features between two BED, GFF, or VCF files. Basically, it is similar to intersect but give detailed reports on coverage analysis such as how many features from -b file overlapped with -a file (depth of coverage), how many bases are covered (breadth of coverage), and their covered fractions (0 to 1 scale).

bedtools coverage

By default, all features from -a file are reported in output. After each column in -a file, the output contains columns for number of overlapped features from -b file, the number of overlapped bases from -b file, length of feature from -a file, and their covered fractions (0 to 1 scale).

bedtools coverage -a ath_cds_1.bed -b ath_cds_2.bed
Chr1    3996    4276    0       0       280     0.0000000
Chr1    4486    4605    0       0       119     0.0000000
Chr1    4706    5095    0       0       389     0.0000000
Chr1    5174    5326    1       26      152     0.1710526
Chr1    5439    5630    1       191     191     1.0000000

For example, the feature (Chr1: 5174, 5326) from ath_cds_1.bed overlaps with one feature from ath_cds_2.bed, and it’s 17.10% of bases are covered.

3.4 Extract the sequence in FASTA format from BED file

bedtools getfasta (aka fastaFromBed) utility is useful to extract the sequence in FASTA format using the feature coordinates provided in BED, GFF, or VCF files.

bedtools getfasta -fi Athaliana_167_TAIR10.fa -bed at_cds.bed  -fo at_cds.fasta

The output from above command will be saved in at_cds.fasta file

head at_cds.fasta
# output
>Chr1:3631-5899
AATTATTAGATATACCAAACCAGAGAAAACAAATACATAATCGGAGAAATACAGATTACAGAGAGCGAGAGAGATCGACGGCGAAGCTCTTTACCCGGAAACCATTGAAATCGGACG
GTTTAGTGAAAATGGAGGATCAAGTTGGGTTTGGGTTCCGTCCGAACGACGAGGAGCTCGTTGGTCACTATCTCCGTAACAAAATCGAAGGAAACACTAGCCGCGACGTTGAAGTAG
CCATCAGCGAGGTCAACATCTGTAGCTACGATCCTTGGAACTTGCGCTGTAAGTTCCGAATTTTCTGAATTTCATTTGCAAGTAATCGATTTAGGTTTTTGATTTTAGGGTTTTTTT
T
(sequence is truncated for representation)

Note: The input genome FASTA must be indexed. You can index the genome using samtools faidx utility

Learn more about how to extract the DNA sequences from reference FASTA using bedtools getfasta

3.5 File format conversions

bedtools bedtobam (aka bedToBam), bamtobed (aka bamToBed), and bamtofastq are commonly used functions for file format conversions.

Convert BED to BAM format,

Download files: ath_cds.bed

bedtools bedtobam -i ath_cds.bed -g ath_chr1.sizes > ath_cds.bam
# view BAM records
samtools view ath_cds.bam | head -2
# output
cds1    0       Chr1    3997    255     280M    *       0       0       *       *
cds2    0       Chr1    4487    255     119M    *       0       0       *       *

Note: BED file should also contain the 4th name column. In addition, bedtools uses genome file in chromosome coordinates format and not in FASTA format. The genome file should contain chr and size columns separated by a tab.

Convert BAM to BED format,

bedtools bamtobed -i ath_cds.bam > ath_cds_con.bed
# view BED records
head -2 ath_cds_con.bed
# output
Chr1    3996    4276    cds1    255     +
Chr1    4486    4605    cds2    255     +

Convert BAM to FASTQ format,

bedtools bamtofastq -i sample.bam -fq sample.fastq

# BAM with paired-end reads
bedtools bamtofastq -i sample.bam -fq sample_1.fastq -fq2 sample_2.fastq

# create interleaved FASTQ file
bedtools bamtofastq -i sample.bam -fq /dev/stdout -fq2 /dev/stdout > sample_interleaved.fastq

3.6 Calculate GC and AT content

bedtools nuc (aka nucBed) can be used for calculating the GC content for genomic features available in BED, GFF, or VCF files. It requires sequence (FASTA) and genomic features (BED/GFF/VCF).

bedtools nuc -fi Athaliana_167_TAIR10.fa -bed Athaliana_167_gene.gff3 > gc_content.txt

Once you run the above command successfully, the gc_content.txt file contains several records including GC content (11th), AT content (10th column), and other sequence statistics. The output columns are appended to the GFF columns. You can see all detailed output information by using bedtools nuc -h

cut -f1-11 gc_content.txt | head
# output
#1_usercol      2_usercol       3_usercol       4_usercol       5_usercol       6_usercol       7_usercol       8_usercol       9_usercol       10_pct_at       11_pct_gc
Chr1    phytozome9_0    gene    3631    5899    .       +       .       ID=AT1G01010;Name=AT1G01010     0.613927        0.386073
Chr1    phytozome9_0    mRNA    3631    5899    .       +       .       ID=PAC:19656964;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010      0.613927        0.386073
Chr1    phytozome9_0    five_prime_UTR  3631    3759    .       +       .       ID=PAC:19656964.five_prime_UTR.1;Parent=PAC:19656964;pacid=19656964     0.612403        0.387597
Chr1    phytozome9_0    CDS     3760    3913    .       +       0       ID=PAC:19656964.CDS.1;Parent=PAC:19656964;pacid=19656964        0.487013        0.512987
Chr1    phytozome9_0    CDS     3996    4276    .       +       2       ID=PAC:19656964.CDS.2;Parent=PAC:19656964;pacid=19656964        0.551601        0.448399
Chr1    phytozome9_0    CDS     4486    4605    .       +       0       ID=PAC:19656964.CDS.3;Parent=PAC:19656964;pacid=19656964        0.541667        0.458333
Chr1    phytozome9_0    CDS     4706    5095    .       +       0       ID=PAC:19656964.CDS.4;Parent=PAC:19656964;pacid=19656964        0.582051        0.417949
Chr1    phytozome9_0    CDS     5174    5326    .       +       0       ID=PAC:19656964.CDS.5;Parent=PAC:19656964;pacid=19656964        0.588235        0.411765
Chr1    phytozome9_0    CDS     5439    5630    .       +       0       ID=PAC:19656964.CDS.6;Parent=PAC:19656964;pacid=19656964        0.567708        0.432292

The %GC and %AT content for AT1G01010 gene is 38.60% and 61.37% respectively.

3.7 Sort BED files

Sort BED file based on chromosome name and start position,

Download files: ath_cds_4.bed

bedtools sort -i ath_cds_4.bed
# output
Chr1    3996    4276
Chr1    4706    5095
Chr2    5439    5630

Sort BED file based on feature size,

# ASC order of feature size 
bedtools sort -i ath_cds_4.bed -sizeA
# output
Chr2    5439    5630
Chr1    3996    4276
Chr1    4706    5095

# DESC order of feature size
bedtools sort -i ath_cds_4.bed -sizeD
# output
Chr1    4706    5095
Chr1    3996    4276
Chr2    5439    5630

3.8 Split BED file by chromosome

bedtools does not have function to split the BED file by a choromosme. But you can use bioinfokit.HtsAna.split_bed package to split the BED file by chromosome names.

from bioinfokit.analys import HtsAna
HtsAna.split_bed('ath_cds_1.bed')

The split files by each chromosome name will be saved in the same directory

3.8 Merge intervals in a BED file

If there are any overlapping or nearby coordinate intervals in the BED (or VCF or GFF) file, you can merge them into a single interval.

Download example BED file: ath_cds_5.bed

For example, merge a overlapping coordinate intervals,

# use -s parameter if you want merge features from same strand
bedtools merge -i ath_cds_5.bed
# output
Chr1    3996    4276
Chr1    4000    4276
Chr1    4486    4605
Chr1    4706    5095
Chr1    5174    5326
Chr1    5439    5630
Chr1    5670    6680

Merge coordinate intervals with a specified distance,

# merge inbtervals when they are within 60 bp apart
# use -s parameter if you want merge features from same strand
bedtools merge -d 60 -i ath_cds_5.bed
# output
Chr1    3996    4276
Chr1    4486    4605
Chr1    4706    5095
Chr1    5174    5326
Chr1    5439    6680

Enhance your skills with courses on genomics and bioinformatics

References

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.