How to Determine Strandedness in RNA-seq Data (With Example)

Renesh Bedre    3 minute read

Strandedness in RNA-seq

In order to analyze RNA-seq data, it is crucial to know whether the sequencing data was generated from stranded or unstranded libraries.

Because different RNA-seq tools rely on library preparation protocols, incorrect use of RNA-seq strandedness can lead to false positives in differential gene expression analysis and significant loss in mapping rates to reference genomes.

As an example, the htseq-count command (from HTSeq) requires stranded parameter for counting the reads. By default, the stranded parameter is set to yes (stranded library). The default stranded parameter of the htseq-count function could result in a significant loss of read counts if the RNA-seq data is generated from an unstranded library.

Based on strandedness, the sequencing library could be fr-unstranded, fr-firststrand, and fr-secondstrand.

Library Description
fr-unstranded Orientation of the original mRNA transcripts is not known
read 1 (first sequenced read) and read 2 (second sequenced read) could be sequenced from either sense or antisense strand
fr-firststrand Orientation of the original mRNA transcripts is known
read 1 (first sequenced read) is sequenced from sense strand
read 2 (second sequenced read) is sequenced from antisense strand
fr-secondstrand Orientation of the original mRNA transcripts is known
read 1 (first sequenced read) is sequenced from antisense strand
read 2 (second sequenced read) is sequenced from sense strand

Note: When reads are paired-end, read 1 is left read (R1) and read 2 is right read (R2). Single-end reads result in only one read (read 1).

RNA-seq stranded and 
unstranded library protocol

How to determine the strandedness of RNA-seq data?

The strandedness of RNA-seq data can be determined by comparing the strandedness of reads with that of orientation of the transcripts.

The infer_experiment.py script from the RSeQC package compares the “strandedness of reads” with the known “strandedness of transcripts” to determine strandedness of RNA-seq data. The “strandedness of transcripts” could be obtained from GTF file.

We will use the example of Arabidopsis thaliana RNA-seq data. You can download the genome sequence and gene annotation (GTF format) from Ensembl Plants.

The infer_experiment.py script requires a GTF file in BED format and an alignment file (SAM/BAM format).

Convert GTF to BED format using gtf2bed,

gtf2bed --gtf Arabidopsis_thaliana.TAIR10.56.gtf --bed Arabidopsis_thaliana.TAIR10.56.bed

You can use a STAR or HISAT2 aligner to obtain a alignment file in SAM/BAM format. Please read my articles on STAR and HISAT2 to obtain SAM/BAM file.

Now, run the infer_experiment.py script to determine the strandedness of RNA-seq data,

infer_experiment.py -i alignment.sam -r Arabidopsis_thaliana.TAIR10.56.bed

# output
Reading reference gene model Arabidopsis_thaliana.TAIR10.56.bed ... Done
Loading SAM/BAM file ...  Total 200000 usable reads were sampled


This is PairEnd Data
Fraction of reads failed to determine: 0.0385
Fraction of reads explained by "1++,1--,2+-,2-+": 0.4824
Fraction of reads explained by "1+-,1-+,2++,2--": 0.4792

Based on the output obtained from infer_experiment.py, we can interpret it as below,

Output Interpretation
This is PairEnd Data The RNA-seq data sequenced using paired-end sequencing
Fraction of reads failed to determine: 0.0385 3.85% reads were aligned to genome region where “strandedness of transcripts” could not be determined
Fraction of reads explained by “1++,1–,2+-,2-+”: 0.4824 48.24% of the reads could be explained by “1++,1–,2+-,2-+” (fr-secondstrand)
Fraction of reads explained by “1+-,1-+,2++,2–”: 0.4792 47.92% of the reads could be explained by “1+-,1-+,2++,2–” (fr-firststrand)

A roughly equal number of reads are explained by the “1++,1-,2+-,2-+” and “1+-,1-+,2++,2-“. This concludes that RNA-seq data is unstranded.

If the RNA-seq data is stranded, you could get the majority of the reads explained by either “1++,1–,2+-,2-+” or “1+-,1-+,2++,2–”.

Tip: A majority of your reads are from “1+-,1-+,2++,2–”, which means they were sequenced using the fr-firststrand stranded method. When the majority of your reads come from “1++,1-,2+-,2-+”, the reads were sequenced with fr-secondstrand stranded methods.

Enhance your skills with courses on genomics and bioinformatics


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.