How to Use seqtk subseq to Extract Sequences from FASTA/FASTQ Files

Renesh Bedre    3 minute read

Seqtk is a lightweight command-line utility developed for fast manipulation of sequences in either the FASTA or FASTQ format.

For example, the seqtk subseq command is used for extracting the sequences (complete or subsequence) from the FASTA/FASTQ files based on provided sequence IDs and region coordinates.

The general syntax of seqtk subseq looks like this:

# extract sequences from FASTA
seqtk subseq input.fasta ids.txt > seq_subset.fasta

# extract sequences from FASTQ
seqtk subseq input.fastq ids.txt > seq_subset.fastq

Where, input.fasta or input.fastq are the name of your input FASTA/FASTQ files, and ids.txt contains the list of sequences IDs (one ID per line) to extract from the FASTA/FASTQ files.

The ids.txt can also contains the sequence ID and specific sequence regions, similar to three column BED files.

How to install seqtk?: If you don’t have seqtk installed, there are few ways to install. 1) using bioconda: conda install -c bioconda seqtk, 2) using brew on a MAC: brew install seqtk, and 3) source code: obtain source code from the GitHub repository and compile it.

The following examples explains how to use seqtk subseq to extract the sequences from FASTA/FASTQ files.

Extract sequences from FASTA

For example, if you have the following FASTA file,

cat input.fasta

>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>JAMFTS010000002.1
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG

If you want to extract the sequences of for specific genes from the FASTA file, based on their sequence IDs, you should generate an ids.txt file. This file should list the sequence IDs, with each ID on a separate line, as demonstrated below:

cat ids.txt

KU562861.1
MH150936.1
CP097510.1

Now extract the sequences from input.fasta based on sequence IDs using seqtk subseq,

seqtk subseq input.fa ids.txt

# output
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGACAGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG

Extract subsequences from specific region from FASTA

seqtk subseq can also be used for extracting the sequences from the specific region as well. For example, you have the following ids.txt file containing sequence name and specific region coordinates (separated by TAB),

cat ids.txt

KU562861.1      1       10
MH150936.1      1       5
CP097510.1      10      20

Now extract the specific region sequences from input.fasta based on sequence IDs and region coordinates using seqtk subseq,

# extract single sequence
seqtk subseq input.fa ids.txt

>KU562861.1:2-10
GAGCAGGAG
>CP097510.1:11-20
CGGTGTAGTC
>MH150936.1:2-5
AGAA

seqtk assumes that the coordinates from ids.txt are 0-based, but it converts them to 1-based when extracting the sequences

Similarly, you can also use bedtools getfasta or Python for extracting the sequences from specific regions of the FASTA file.

Extract sequences from FASTQ

For example, if you have the following FASTQ file,

cat input.fastq

@SRR22309490.1 1 length=101
CTGTTTTGTCTATTTTTGTTTGGTGCATTAGCTCCAATTGTGAACGTTAATTATGGAGGAATTAGTGGTGCTTTTTATGGGAACTATAGATCTAATTATAT
+SRR22309490.1 1 length=101
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@SRR22309490.2 2 length=101
ACCGTATATGTTTTCTATGTTCTCCACCGCAACATACTCTCCTTGTGAGAGTTTAAAGATATTCTTCTTCCTGTCAATTATCTTCATGCTTCCATCTGGTT
+SRR22309490.2 2 length=101
<AAF<J7<<JJJJJJJJFJFF<FJFFJJJJJJJJJJJFJ-FJJFJJJJJJJJJJJFJJF<FJJJJJJJJFJJJJJJJJJJFFJJFFAJJFJFFJJ<FF-FA
@SRR22309490.3 3 length=101
CTCCACTACTATCTCTTCTTCTTTGGAATATCTCCACGGAAAATCATCTTCACAAAAGCGAGATATTCCATTATCGCACCAAAAGTGTCTATGTGAACCCA
+SRR22309490.3 3 length=101
AAAFA7AJFJ<FJ<<FFJJJJJJJJJJJJJJJAJAJJJFJJJJJJJJJJJJJJJJJAF-JJF<FFJJJJJJAFJJJFJFJJJJJJ<<AJJJJJF<A<FAJJ
@SRR22309490.4 4 length=101
CCATGACCTTGGATACAACTTGCCTAGTGGGTCATGGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCCGTATCTCGTATGCCGTCTTCTGCT
+SRR22309490.4 4 length=101
A<AFFJJFJJFJJAFFFJJJJAJAJJJJJFJJFFFFJ<F7FJJJAAAJJFFJJJJ-AFA-<JJF77FF<7A<J-A777AFFAJFJFFJJFFJ7JA-AJF-<

If you want to extract the sequences of for specific reads from the FASTQ file, based on their read IDs, you should generate an ids.txt file. This file should list the read IDs, with each ID on a separate line, as demonstrated below:

cat ids.txt

SRR22309490.1
SRR22309490.5

Now extract the sequences from input.fastq based on sequence IDs using seqtk subseq,

seqtk subseq input.fastq ids.txt

@SRR22309490.1 1 length=101
CTGTTTTGTCTATTTTTGTTTGGTGCATTAGCTCCAATTGTGAACGTTAATTATGGAGGAATTAGTGGTGCTTTTTATGGGAACTATAGATCTAATTATAT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@SRR22309490.5 5 length=101
CTCGCAGTTGACTCATACTTAGCTCTATCGGTTTTGTACATGTGAGCAATCTCTGGAACCAATGGATCATCTGGGTTTGGGTCCGTTAACAATGAACATAT
+
AAAFFJJJJJJFJJJJF-FJFAFFAFFJJFF<FJFFJFJFFJJFFJJJJJJJFJJJJJJJJJJJJFFFJJJJJFJJJJJJJ<FFJFJJFJJFJFFFJJJJF

Extract subsequences from specific region from FASTQ

seqtk subseq can also be used for extracting the sequences from specific region. For example, if you have following ids.txt file with sequence name and speific region coordinates (TAB sepearated)

seqtk subseq can also be used for extracting the sequences from the specific region of reads as well. For example, you have the following ids.txt file containing read name and specific region coordinates (separated by TAB),

cat ids.txt

SRR22309490.1      1       10
SRR22309490.5     10      20

Now extract the specific region sequences from input.fastq based on sequence IDs and region coordinates using seqtk subseq,

# extract single sequence
seqtk subseq input.fastq ids.txt

@SRR22309490.1:2-10 1 length=101
TGTTTTGTC
+
AFFFJJJJJ
@SRR22309490.5:11-20 5 length=101
ACTCATACTT
+
JFJJJJF-FJ

Enhance your skills with courses on genomics and bioinformatics


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.