How to Extract Sequences from FASTA in Python

Renesh Bedre    2 minute read

In the Python bioinfokit package (v2.1.3), extract_seq() function can be used for extracting sequences (complete or subsequence) from FASTA file based on sequence IDs and region coordinates.

The general syntax of extract_seq() function looks like this:

# load package
from bioinfokit.analys import Fasta

# extract sequences based on sequence ID and region coordinates
Fasta.extract_seq(file="input.fasta", id="ids.txt")

Where, input.fasta is the name of your input FASTA and ids.txt contains the list of sequences IDs (one ID per line) to extract from the FASTA files.

The ids.txt can also contains the sequence ID and specific sequence regions, similar to three column BED files.

Following examples illustrates how to extract the sequences from FASTA files using the extract_seq() function from bioinfokit.

Extract sequences from FASTA

Suppose, you have the following FASTA file,

cat input.fasta

>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>JAMFTS010000002.1
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG

A file called ids.txt (which contains list of sequence IDs, with each ID on a separate line) should be generated for extracting sequences based on IDs from the FASTA file.

cat ids.txt

GU056837.1
MH150936.1
KU562861.1

Now extract the sequences from input.fasta based on sequence IDs using extract_seq(),

# load package
from bioinfokit.analys import Fasta
# extract sequences
Fasta.extract_seq(file="input.fasta", id="ids.txt")

# output (saved in output.fasta)
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGACAGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG

The extracted sequence FASTA file (output.fasta) will be saved in the same directory.

Extract subsequences from specific region from FASTA

In addition, extract_seq() can extract sequences from specific regions. As an example, you have the following file ids.txt, which contains the sequence name and specific region coordinates (separated by TAB),

cat ids.txt

GU056837.1	1	50
MH150936.1	10	40
KU562861.1	30	80

Now extract the specific region sequences from input.fasta based on sequence IDs and region coordinates using extract_seq(),

# load package
from bioinfokit.analys import Fasta
# extract sequences
Fasta.extract_seq(file="input.fasta", id="ids.txt")

# output (saved in output.fasta)
>KU562861.1
GTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGA
>MH150936.1
ATGAAAACTTTTCCTTTACTAAAAACCGTCA

The extracted sequence FASTA file (output.fasta) will be saved in the same directory.

Similarly, you can also use bedtools getfasta or seqtk subseq for extracting the sequences from specific regions of the FASTA file.

Enhance your skills with courses on genomics and bioinformatics


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.