Bioinformatics file readers and processing (FASTA, FASTQ, and VCF)

Renesh Bedre    1 minute read

Bioinformatics file readers

  • We will discuss how to read and process common bioinformatics files
  • We will use bioinfokit v2.0.4 or later
  • Check bioinfokit documentation for installation and documentation

FASTA reader

# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
from bioinfokit.analys import Fasta
fasta_iter = Fasta.fasta_reader(file='fasta_file')
# read fasta file
for record in fasta_iter:
    header, sequence = record
    print(header, sequence)
    # get sequence length
    sequence_len = len(sequence)
    # count A bases
    a_base = sequence.count('A')
    print(a_base)
    # reverse complementary
    rev_com_seq = fasta.rev_com(seq=sequence)
    print(rev_com_seq)

Split FASTA file into multiple FASTA files

Split one big FASTA into multiple smaller FASTA files

from bioinfokit.analys import Fasta
# split one big FASTA into 3 subset FASTA files
Fasta.split_fasta(file='fasta_file', n=3)

FASTQ reader

from bioinfokit.analys import fastq
fastq_iter = fastq.fastq_reader(file='fastq_file')
# read fastq file
for record in fastq_iter:
    # get sequence headers, sequence, and quality values
    header_1, sequence, header_2, qual = record
    # get sequence length
    sequence_len = len(sequence)
    # count A bases
    a_base = sequence.count('A')
    print(sequence, qual, a_base, sequence_len)

VCF (variant call format) reader

from bioinfokit.analys import marker
# id is for column name with chromosome identifier in VCF file
vcf_iter = marker.vcfreader(file='vcf_file', id='#CHROM')
# read vcf file
for record in vcf_iter:
    # get info lines, header, and variant records
    headers, info_lines, marker_lines = record
    # get variant records with specified chromosome name
    # change chr name as per need
    if marker_lines[0] == 'chr':
        print(marker_lines)
    # retrieve markers within specified positions
    # change chr, pos1, and pos2 variable as per need
    if marker_lines[0]=='SL4.0ch00' and int(marker_lines[1]>=764654) and int(marker_lines[1])<=1038399:
        print(marker_lines)

This work is licensed under a Creative Commons Attribution 4.0 International License