Canu: Genome Assembly for PacBio and Nanopore Long-Reads (Detailed Guide)

Renesh Bedre    3 minute read

What is Canu?

Long-read sequencing using Pacific Biosciences (PacBio) or Oxford Nanopore technologies revolutionized the generation of reference quality genomes, especially for large and repetitive genomes.

Canu (successor of Celera Assembler) is a single-molecule sequence hierarchical de novo assembler for large genomes which produces more continuous genome assemblies. Canu is well suited for PacBio and Oxford Nanopore long-read data considering their relatively high-error rate.

Canu has better run time performance, requires lower sequencing coverage, and works better for genomes with large repeats.

A Canu assembly pipeline contains three main stages: correction of sequence consensus (correction), trimming corrected reads (trimming), and assembly of trimmed corrected sequences (assembly).

A minimum coverage of 30x to 60x is recommended for eukaryotic genomes. Assemblies are better with higher coverage.

Alternatively, you can also combine canu genome assembly with short reads for generating the high quality finished assembly.

Getting started with canu

This tutorial explains the computational requirements for Canu, how to download and install Canu, and how to assemble the long-read (PacBio and NanoPore) using Canu.

Computational requirements for Canu

Canu automatically detects the available resources (memory and cores) on your computer for starting the assembly process. If you have insufficient resources, you may get memory errors.

You may assemble a bacterial genome using 8 GB of memory and 8 cores. But if you want to assemble larger eukaryotic genomes such as humans or other mammals, you may need at least 64 GB of memory and sufficient disk space (3 TB). If the genome is highly repetitive, you may need more disk space.

Tip: You should consider HPC for assembling large Eukaryotic genomes. Bacterial genomes can be assembled on desktop/laptop computers.

How to download and install Canu

The easiest way to install Canu is by downloading the pre-compiled binaries. You can download the pre-compiled binaries as below

# download for Linux
curl -L https://github.com/marbl/canu/releases/download/v2.2/canu-2.2.Linux-amd64.tar.xz --output canu-2.2.Linux.tar.xz

# extract
tar -xJf canu-2.2.Linux.tar.xz

# add binaries to PATH
export PATH=$PATH:/home/renesh/software/canu-2.2/bin

# check canu version
canu -version
canu 2.2

Once the binaries are added to the PATH, you should able to see a complete usage using the canu -h command.

Assemble PacBio reads

We will use Banana PacBio reads for assembly. PacBio input reads should be in FASTQ or FASTA format. I have not shared the FASTQ file due to its large size.

By default, Canu will correct, trim and assemble the reads into the contigs.

You can use the following code to generate the Canu assembly,

canu -p banana -d banana_pacbio_out genomeSize=523m -pacbio pacbio.fastq

Where,

Parameter Description  
-p assembly prefix. The ouput files will have this prefix.  
-d The output directory to save the assembly files  
genomeSize Haplod genome size. 523m means 523 Mbp. Use g for Gbp and k for Kbp. If you do not know exact genome size, you can use best approximate value. This is necessary for estimating the coverage in input sequence data.  
-pacbio Long-read sequencing technology  

You can also add other parameters for Canu for memory, coverage, and error adjustments. Read more here for other parameters for Canu

Once the Canu is successfully completed for assembling PacBio reads, you should get the following output files in the output directory.

Files Description  
banana.report This is detailed analysis report. This report includes histogram of read lengths and k-mers, summary of corrected data, summary of overlaps, and the summary of contig lengths.  
banana.correctedReads.fasta.gz Contains the reads after correction  
banana.trimmedReads.fasta.gz Contains the corrected reads after overlapped based trimming  
banana.contigs.fasta Full assembly of contigs  
banana.unassembled.fasta Unassembled reads and contigs (low coverage)  

Read more for detailed information about output files

Assemble Nanopore reads

We will use the example of Banana Nanopore reads for assembly. The input PacBio reads should be either in FASTQ or FASTA format.

You can the following code to generate the Canu assembly using Nanopore reads

canu -p banana -d banana_nanopore_out genomeSize=523m -nanopore pacbio.fastq

Summary

You have learned how to use Canu for de novo genome assembly using PacBio and Nanopore long-reads in this article.

Enhance your skills with courses on genomics and bioinformatics


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.