HISAT2: Fast Aligner for NGS Data (complete Tutorial)

Renesh Bedre    6 minute read

What is HISAT2?

HISAT2 (hierarchical indexing for spliced alignment of transcripts 2) is a fast and sensitive splice-aware sequence alignment tool for aligning NGS generated DNA and RNA reads to the reference genomes.

For example, in transcriptomics (RNA-seq) analysis, HISAT2 can be used for aligning RNA-seq reads to the genome.

The aligned reads (HISAT2 output) are further used in various bioinformatics downstream analyses such as gene quantification, transcript assembly, and identifying differentially expressed genes and transcripts.

HISAT2 alignment workflow

As compared to STAR, HISAT2 is a memory-efficient aligner that requires relatively low (~6.7 GB) memory for mapping NGS reads to the human genome. Further, HISAT2 is a faster aligner as compared to other competitive aligners such as STAR and Bowtie2.

Besides, HISAT2 is known to have a higher accuracy for aligning the NGS reads containing SNPs.

Tip: HISAT2 is a successor of HISAT and TopHat2 and it is recommended to use HISAT2 over HISAT and TopHat2

Getting started with HISAT2

This tutorial explains computational requirements for HISAT2, how to download and install HISAT2, and how to align NSG reads to the reference genome using HISAT2.

Computational requirements for HISAT2

Though HISAT2 is a memory-efficient aligner, your computer need to have the following specifications to run HISAT2.

  • 64 bit computer with either Linux or Mac OS
  • At least 8 GB RAM
  • Sufficient space on hard drive (recommended 100 to 500 GB based on the input file and genome sizes)
  • Multiple cores for parallel computing (e.g. Xeon processor)

How to download and install HISAT2

If you have not installed the HISAT2 on your computer, you can install it on Linux (e.g. Ubuntu) or Mac OS using the following two preferred methods.

Tip: If you download executables, you should add the HISAT2 directory to the system PATH. This will help to run hisat2 commands without specifying the complete path to the HISAT2 directory.

  • Install using conda
conda install -c bioconda hisat2

Alignment with HISAT2

The HISAT2 workflow for alignment of NGS reads with HISAT2 to the reference genome involves two steps,

  • Building a genome index
  • Aligning the NGS reads to the indexed genome

Building genome index

In this tutorial, I will use the example of Arabidopsis thaliana genome for building the genome index using HISAT2 (v2.2.1). You can download the Arabidopsis thaliana genome sequence (FASTA format) from TAIR.

You can use hisat2-build build command for building genome index as follows,

hisat2-build -p 6 -f Athaliana_TAIR10.fasta Athaliana

hisat2-build will take ~5 min to create genome index. Time to create genome index is depends on the genome size.

Where,

  • -p : numbers of threads for parallel computation
  • -f : reference genome is in FASTA format

The Athalina at the end of the command is a basename for the genome index. Once the genome index is created, the basename will be appended to the indexed genome.

Note: HISAT2 is capable of generating a genome index for any genome size. If the genome size is small (< 4 B nucleotides), it creates a small index (.ht2 extension index files). For larger genome sizes, HISAT2 generates large index (.ht2l extension index files).

Optionally, HISAT2 also uses the gene annotation file (GTF format) to extract the exons and splice sites. This is recommended to use the gene annotation for RNA-seq data analysis.

# extract exons
hisat2_extract_exons.py Athaliana_gene.gtf > Athaliana.exon

# extract splice sites
hisat2_extract_splice_sites.py Athaliana_gene.gtf > Athaliana.ss

Now, build the genome index,

hisat2-build -p 6 --exon Athaliana.exon --ss Athaliana.ss Athaliana_TAIR10.fasta Athaliana

Tip: To build a genome index with gene annotation, the hisat2-build will require larger RAM (200 GB for human size genome) and may take a longer time to run (~1 hour for human genome). As an alternative, you can provide the splice sites file during the alignment stage (hisat2) using the --known-splicesite-infile parameter.

Aligning reads to genome

HISAT2 can be used to align the single and paired-end NGS reads to the reference genome once the genome indices are created.

If you want to align single-end reads, then you should pass -U parameter to hisat2,

hisat2 --phred33 --dta -x Athaliana -U ath_seed_sample_R1.fastq -S alignment.sam

Where,

  • --phred33 : Sequence quality score. Most Illumina sequencers generate sequences in PHRED33 format.
  • --dta : Use this option to output alignments suitable for transcriptome assembly. Mostly used for RNA-seq data analysis.
  • -S: Output alignment to file (SAM format) instead of standard output
  • -x: basename for indexed genome

You can read the HISAT2 manual for controlling other parameters such as output options, mismatch penalties, and input options.

If you want to align paired-end reads, then you should pass -1 and -2 parameter to hisat2,

hisat2 --phred33 --dta -x Athaliana -1 ath_seed_sample_R1.fastq -2 ath_seed_sample_R2.fastq -S alignment.sam

hisat2 will take ~15 min to align the paired-end reads to genome. This can vary based on genome and input file size.

If you have multiple paired-end files (from different samples), you can align them to genome at once,

hisat2 --phred33 --dta -x Athaliana -1 ath_seed_sample_R1.fastq,ath_root_sample_R1.fastq -2 ath_seed_sample_R2.fastq,ath_root_sample_R2.fastq -S alignment.sam

If you have skipped the gene annotation (splice sites and exons) for building genome index (due to memory restrictions), you can provide the gene annotation information such as splice sites during the alignment stage using --known-splicesite-infile parameter.

hisat2 --phred33 --dta -x Athaliana --known-splicesite-infile Athaliana.ss -1 ath_seed_sample_R1.fastq -2 
    ath_seed_sample_R2.fastq -S alignment.sam

HISAT2 alignment output

The HISAT2 writes the alignment in SAM format. You can view and read the SAM files (similar to general text files).

But, most of the downstream bioinformatics analysis such as gene quantification, assembly, differential expression analysis, etc. requires alignment file in BAM format (binary file). BAM file is also useful for storage purposes as it takes less space than SAM file.

Please read this article on how to convert a SAM file into a BAM file.

In addition to alignment in SAM format, hisat2 also writes the summary statistics of alignment on the console as a standard error message.

The following is an example of alignment summary statistics obtained from aligning the paired-end reads.

11889751 reads; of these:
  11889751 (100.00%) were paired; of these:
    1298966 (10.93%) aligned concordantly 0 times
    9865016 (82.97%) aligned concordantly exactly 1 time
    725769 (6.10%) aligned concordantly >1 times
    ----
    1298966 pairs aligned concordantly 0 times; of these:
      74231 (5.71%) aligned discordantly 1 time
    ----
    1224735 pairs aligned 0 times concordantly or discordantly; of these:
      2449470 mates make up the pairs; of these:
        1594292 (65.09%) aligned 0 times
        816748 (33.34%) aligned exactly 1 time
        38430 (1.57%) aligned >1 times
93.30% overall alignment rate

The alignment summary statistics show that the overall alignment rate is 93.30%.

In alignment summary statistics, the indentations are useful to break down the total. For example, 11889751 (total paired reads) is a total of 1298966 (aligned concordantly 0 times), 9865016 (aligned concordantly exactly 1 time), and 725769 (aligned concordantly >1 times).

Enhance your skills with courses on genomics and bioinformatics

References


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.