Parsing and analyzing BAM files

Renesh Bedre    3 minute read

  • Binary Alignment/Map (BAM) file (.bam) is a compressed binary format of Sequence Alignment/Map (SAM) file (.sam), which is used for storing the sequence alignment information. BAM file is compressed by the BGZF library and it takes less disk space as compared to text-based SAM file. For example, the 6 GB SAM file can be stored as ~800 MB BAM file.
  • The majority of downstream bioinformatics analyses, including sequence assembly, read quantification, alignment viewer (IGV), and so on, require BAM files.
  • SAMtools, a software package, allows parsing and analyzing the SAM/BAM files. Here, I am using SAMtools v1.10.

Interconversion of SAM and BAM files

Convert SAM to BAM,

samtools view -bS PC14_L001_R1.sam > PC14_L001_R1.bam

Convert BAM to SAM,

samtools view -h PC14_L001_R1.bam > out.sam

Convert BAM to CRAM,

CRAM is a highly compressed alternative to the BAM file. CRAM file uses indexed reference sequences for the compression. It is fully compatible with BAM. For example, the 800 MB BAM file can be stored as a ~400 MB CRAM file.

samtools view -C -T potato_dm_v404_all_pm_un.fasta PC14_L001_R1_pos_sorted.bam > PC14_L001_R1_pos_sorted.cram

# convert CRAM to BAM (slow as compared to BAM to CRAM)
samtools view -b -T potato_dm_v404_all_pm_un.fasta PC14_L001_R1_pos_sorted.cram > out.bam

Sorting and indexing BAM files

BAM files can be unsorted. Sorted (position-based) BAM files are essential for several Bioinformatics applications such as for StringTie transcript assembly, viewing the alignment using IGV, etc.

# sort BAM file by chromosomal position
# @: number of threads
samtools sort -@ 16 -o PC14_L001_R1_pos_sorted.bam PC14_L001_R1.bam

# sort BAM file by read names (QNAME field)
samtools sort -@ 16 -n -o PC14_L001_R1_read_sorted.bam PC14_L001_R1.bam

# index sorted BAM file (generate .bai file)
# indexing allows fast random access of records of sorted BAM files
samtools index -@ 16 PC14_L001_R1_pos_sorted.bam

You can also use the Picard SortSam command to sort the BAM file by chromosomal position and read name. Read more here

Split BAM file by chromosome

Split the BAM file based on chromosome name

samtools view -b PC14_L001_R1.bam chr00 > chr00.bam

Merge BAM files

Merge multiple BAM files into a single BAM file,

samtools merge merge_out.bam chr00.bam chr01.bam

# add RG tag to each alignment record for source file name (e.g., RG:Z:chr00)
samtools merge -r merge_out.bam chr00.bam chr01.bam

# merge large number of BAM files
samtools merge merge_out.bam *.bam

Calculate sequence coverage or depth

Get sequence coverage or depth from genome mapped data

# get read depth for each position on chromosome
# use -a parameter to get read depth for all positions
samtools depth PC14_L001_R1.bam > read_depth.txt

# get overall read depth
awk '{sum+=$3;} END {print sum/NR;}' read_depth.txt
14.6749

# $3 means read depth at each position of chromosome (third column from read_depth.txt)
# NR means total rows in read_depth.txt [total mapped chromosome size (with -a option, you will get whole chromosome 
# size)]

Extract records from specific regions

Specific region alignment records can be extracted from the BAM file by providing chromosome name and region coordinates

# extract alignment records from chr01 between specific regions
samtools view PC14_L001_R1.bam chr01:1322100-1332100

BAM to FASTA and FASTQ

Convert BAM file into FASTA and FASTQ file format,

# get all mapped reads
# 4 is SAM flag for reads unmapped (0x4)
samtools fasta -F 4 PC14_L001_R1.bam > PC14_L001_R1.fasta

# get unmapped reads
samtools fasta -f 4 PC14_L001_R1.bam > PC14_L001_R1.fasta

# similarly replace fasta with fastq for FASTQ format file

References

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License