Two-pass alignment of RNA-seq reads with STAR

Renesh Bedre    4 minute read

STAR aligner 2-pass tutorial

Background

STAR 1-pass mapping mode aligns RNA-seq reads to the genome and generates alignment (SAM/BAM) and splice junctions (SJ.out.tab) files. The 1-pass mapping mode generates all required data essential for many downstream analyses such as differential gene expression analysis.

However, if your goal is to robustly and accurately identify novel splice junctions for differential splicing analysis and variant discovery, it is highly recommended to use the STAR in 2-pass mode.

In the two-pass mode, the genome indices are re-generated from splice junctions obtained from a 1-pass mode with the usual parameters and then run the mapping step (2-pass mapping). Optionally, you can also bypass genome indices re-generation and directly provide a 1-pass splice junction file during the 2-pass mapping step.

STAR 2-pass mapping workflow

How to run STAR in 2-pass mode?

STAR can run in a 2-pass mode in two ways,

  • By re-generating the genome indices from splice junctions obtained from 1-pass
  • By using splice junctions directly during the mapping step

Re-generating the genome indices

Once you have successfully run the STAR in 1-pass mode and generated splice junctions (SJ.out.tab), you can use these splice junctions for genome index re-generation.

Before using splice junctions, you should filter out the likely false positives splice junctions such as junctions supported by very few reads (e.g. ≤ 2 reads), non-canonical junctions (The 0 value in column5 from SJ.out.tab denotes non-canonical junctions), and annotated junctions (The 1 value in column6 from SJ.out.tab denotes annotated junctions). The annotated junctions should be filtered out as they are already included in the gene annotation (GTF/GFF3 file).

Below is the example splice junctions filtering command (you can modify it as per your requirement). The seed_sampleSJ.out.tab is generated from running STAR in 1-pass mode.

cat seed_sampleSJ.out.tab | awk '($5 > 0 && $7 > 2 && $6==0)' | cut -f1-6 | sort | uniq > SJ_out_filtered.tab

# if you have multiple splice junctions files
cat *.tab | awk '($5 > 0 && $7 > 2 && $6==0)' | cut -f1-6 | sort | uniq > SJ_out_filtered.tab

Code source: https://groups.google.com/g/rna-star/c/Cpsf-_rLK9I

Non-canonical junctions may include a high number of false positive splice junctions and therefore it is recommended to filter them out. But if your study is interested in exploring the non-canonical junctions, you can keep them for 2-pass mapping.

Build genome index (with an additional parameter --sjdbFileChrStartEnd for filtered splice junctions),

# STAR version=2.7.10b
STAR --runThreadN 12 \ 
    --runMode genomeGenerate \ 
    --genomeDir ath_star_index_2pass \ 
    --genomeFastaFiles Athaliana_TAIR10.fasta \
    --sjdbGTFfile Athaliana_gene.gtf \
    --sjdbFileChrStartEnd SJ_out_filtered.tab \
    --sjdbOverhang 149

Once the genome indices are re-built, the single and paired-end RNA-seq reads can be mapped to the reference genome using STAR (2-pass mapping).

Genome mapping for single-end reads,

# STAR version=2.7.10b
# map single end reads to genome
STAR --runThreadN 12 \
     --readFilesIn ath_seed_sample.fastq \
     --genomeDir ath_star_index_2pass \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix seed_sample  \
     --outSAMunmapped Within

Genome mapping for paired-end reads,

# map paired-end reads to genome
STAR --runThreadN 12 \
     --readFilesIn ath_sample_read1.fastq ath_sample_read2.fastq \
     --genomeDir ath_star_index_2pass \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix seed_sample  \
     --outSAMunmapped Within

If the read files are gzip compressed (*.fastq.gz), you can add an additional --readFilesCommand zcat or --readFilesCommand gunzip -c parameter to the above mapping commands.

Direct 2-pass mapping

In this case, you can directly map the RNA-seq reads to the genome without re-building the genome indices (i.e. use genome indices created during 1-pass mode). You need to provide splice junctions obtained from 1-pass during the 2-pass mapping step. Please refer to my previous article on STAR 1-pass mapping for genome index building

Genome mapping for single-end reads,

# STAR version=2.7.10b
# map single end reads to genome
STAR --runThreadN 12 \
     --readFilesIn ath_seed_sample.fastq \
     --genomeDir ath_star_index \
     --sjdbFileChrStartEnd SJ_out_filtered.tab \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix seed_sample  \
     --outSAMunmapped Within

Genome mapping for paired-end reads,

# map paired-end reads to genome
STAR --runThreadN 12 \
     --readFilesIn ath_sample_read1.fastq ath_sample_read2.fastq \
     --genomeDir ath_star_index \
     --sjdbFileChrStartEnd SJ_out_filtered.tab \
     --outSAMtype BAM SortedByCoordinate \
     --outFileNamePrefix seed_sample  \
     --outSAMunmapped Within

After successful 2-pass mapping, the STAR generates several output files (based on the input command) such as aligned BAM , mapping summary statistics, and log files for analyzing the mapping data.

You can refer to my previous article on STAR for detailed descriptions of output files

Enhance your skills with courses on genomics and bioinformatics

References

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.