Parsing and analyzing BAM files

Renesh Bedre    5 minute read

Introduction

  • Binary Alignment/Map (BAM) file (.bam) is a compressed binary format of Sequence Alignment/Map (SAM) file (.sam), which is used for storing the sequence alignment information. BAM file is compressed by the BGZF library and it takes less disk space as compared to text-based SAM file. For example, the 6 GB SAM file can be stored as ~800 MB BAM file.
  • The majority of downstream bioinformatics analyses, including sequence assembly, read quantification, alignment viewer (IGV), and so on, require BAM files.
  • SAMtools, a software package, allows parsing and analyzing the SAM/BAM files. Here, I am using SAMtools v1.15.

How to install SAMtools

Using Source

To install SAMtools, you need to first install HTSlib. Download the latest version of SAMtools (v1.15) and HTSlib from here. You can directly download from the website or using wget,

# download
wget https://github.com/samtools/samtools/releases/download/1.15/samtools-1.15.tar.bz2

# install
tar  xvjf samtools-1.15.tar.bz2
cd samtools-1.15/
./configure
make
make install

Follow the similar steps to install the HTSlib.

Using Git

Using Git you need to first install HTSlib and then SAMtools as below,

git clone --recurse-submodules https://github.com/samtools/htslib
cd htslib
autoreconf -i
./configure 
make 
make install

git clone https://github.com/samtools/samtools
cd samtools
autoheader
autoconf -Wno-syntax
./configure
make
make install

You can use samtools version or which samtools commands to check if you have installed the right version of samtools.

SAMtools utilities

Interconversion of SAM and BAM files

Convert SAM to BAM,

samtools view -bS PC14_L001_R1.sam > PC14_L001_R1.bam

Convert BAM to SAM,

samtools view -h PC14_L001_R1.bam > out.sam

Convert BAM to CRAM,

CRAM is a highly compressed alternative to the BAM file. CRAM file uses indexed reference sequences for the compression. It is fully compatible with BAM. For example, the 800 MB BAM file can be stored as a ~400 MB CRAM file.

samtools view -C -T potato_dm_v404_all_pm_un.fasta PC14_L001_R1_pos_sorted.bam > PC14_L001_R1_pos_sorted.cram

# convert CRAM to BAM (slow as compared to BAM to CRAM)
samtools view -b -T potato_dm_v404_all_pm_un.fasta PC14_L001_R1_pos_sorted.cram > out.bam

Sorting and indexing BAM files

BAM files can be unsorted. Sorted (position-based) BAM files are essential for several Bioinformatics applications such as for StringTie transcript assembly, viewing the alignment using IGV, etc.

# sort BAM file by chromosomal position
# @: number of threads
samtools sort -@ 16 -o PC14_L001_R1_pos_sorted.bam PC14_L001_R1.bam

# sort BAM file by read names (QNAME field)
samtools sort -@ 16 -n -o PC14_L001_R1_read_sorted.bam PC14_L001_R1.bam

# index sorted BAM file (generate .bai file)
# indexing allows fast random access of records of sorted BAM files
samtools index -@ 16 PC14_L001_R1_pos_sorted.bam

You can also use the Picard SortSam command to sort the BAM file by chromosomal position and read name. Read more here

Split BAM file by chromosome

Split the BAM file based on chromosome name

samtools view -b PC14_L001_R1.bam chr00 > chr00.bam

Merge BAM files

Merge multiple BAM files into a single BAM file,

samtools merge merge_out.bam chr00.bam chr01.bam

# add RG tag to each alignment record for source file name (e.g., RG:Z:chr00)
samtools merge -r merge_out.bam chr00.bam chr01.bam

# merge large number of BAM files
samtools merge merge_out.bam *.bam

Calculate sequence coverage or depth

Get sequence coverage or depth from genome mapped data

# get read depth for each position on chromosome
# use -a parameter to get read depth for all positions
samtools depth PC14_L001_R1.bam > read_depth.txt

# get overall read depth
awk '{sum+=$3;} END {print sum/NR;}' read_depth.txt
14.6749

# $3 means read depth at each position of chromosome (third column from read_depth.txt)
# NR means total rows in read_depth.txt [total mapped chromosome size (with -a option, you will get whole chromosome 
# size)]

Extract records from specific regions

Specific region alignment records can be extracted from the BAM file by providing chromosome name and region coordinates

# extract alignment records from chr01 between specific regions
samtools view PC14_L001_R1.bam chr01:1322100-1332100

BAM to FASTA and FASTQ

Convert BAM file into FASTA and FASTQ file format,

# get all mapped reads
# 4 is SAM flag for reads unmapped (0x4)
samtools fasta -F 4 PC14_L001_R1.bam > PC14_L001_R1.fasta

# get unmapped reads
samtools fasta -f 4 PC14_L001_R1.bam > PC14_L001_R1.fasta

# similarly replace fasta with fastq for FASTQ format file

Count total, mapped, and unmapped reads from BAM

Get total number of alignment in a BAM file (mapped and unmapped). This count may also include secondary, supplementary, and duplicate alignments. For paired-end read, both reads are counted together.

# Get total number of alignment
samtools view -c PC14_L001_R1.bam
# output
69815336

Get total mapped reads (includes secondary alignments),

Parameter -F 0x04 or -F 4 (excludes the unmapped reads),

# total mapped reads
samtools view -c -F 4 PC14_L001_R1.bam
# output
66420111

Get total unmapped reads,

Parameter -f 0x04 or -f 4 includes the unmapped reads,

# total unmapped reads
samtools view -c -f 4 PC14_L001_R1.bam
# output
3395225

Get total count of single or paired reads (which is used in FASTQ file for mapping to genome),

For paired-end reads, sum of the counts of both reads is provided

# read counts
samtools view -c -f 1 -F 3328 PC14_L001_R1.bam
# output
62074528

Get primary mapped read counts ,

Parameter -F 0x04,0x100 or -F 260 (excludes unmapped and secondary alignments),

# primary mapped read counts
samtools view -c  -F 260 PC14_L001_R1.bam
# output
58679303

Get complete alignment statistics for each flag,

samtools flagstat PC14_L001_R1.bam
# output
69815336 + 0 in total (QC-passed reads + QC-failed reads)
7740808 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
66420111 + 0 mapped (95.14% : N/A)
62074528 + 0 paired in sequencing
31037264 + 0 read1
31037264 + 0 read2
58679160 + 0 properly paired (94.53% : N/A)
58679160 + 0 with itself and mate mapped
143 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Enhance your skills with courses on genomics and bioinformatics

References

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

If you enhanced your knowledge and practical skills from this article, consider supporting me on

Buy Me A Coffee


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.