Samtools: Extract Reads from Specific Genomic Regions

Renesh Bedre    2 minute read

In genomics and bioinformatics, samtools is widely used for extracting sequence reads from BAM file that fall within specific genomic regions.

samtools view command can be used as shown below to extract reads from single or multiple regions from the BAM file.

# extract reads from single region
samtools view -b input.bam "chr:start-end"  > regions.bam

# extract reads from multiple regions
samtools view -b -L region.bed input.bam > regions.sam

To extract reads from a BAM file using samtools, you need to first create an index file (.bai) for the BAM file using samtools index input.bam. This index file is necessary for extracting reads within specific genomic regions.

Before we begin, ensure that samtools is installed on your system. If not, you can read this article on installing samtools.

The following examples demonstrate how to extract reads within a specified region from the BAM file using samtools.

Extract reads from single region

If you want to extract the reads from chromosome 1 spanning positions 3000 to 5000, you can use the samtools as:

samtools view -b  input.bam "chr1:3000-5000" > regions.bam

Where, -b parameter specify the output should be in BAM format

The above command will create a new BAM file regions.bam which will contain the sequence reads that overlap the given region (chr1:3000-5000).

If you want to include header in the output, you can add -h parameter to the above command.

If you want to create an output file in SAM format, you can use the following command.

samtools view input.bam "chr1:3000-5000" > regions.sam

Extract reads from multiple regions

You can extract the sequence reads from the multiple regions using the samtools. You need to provide the multiple region coordinates as a BED file.

Suppose you have multiple region coordinates given in the BED file as below,

# region.bed
chr1    3000    5000
chr10   5000    9000

You can use this BED file to extract the sequences reads that overlaps genomic regions in BED file.

samtools view -b -L region.bed input.bam > regions.bam

Where, -b parameter specifies the output should be in BAM format, -L parameter specifies to extract the reads overlapping genomic region given in BED file.

The above command will create a new file regions.bam which will contain reads overlapping the multiple regions given in the BED file.

If you want to create an output file in SAM format, you can use the following command.

samtools view -L region.bed input.bam > regions.sam

Enhance your skills with courses on genomics and bioinformatics


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.