Annotation of genetic variants (SNPs/InDels) in VCF file

Renesh Bedre    3 minute read

  • The Variant Call Format (VCF) file produced by variant calling software (e.g. GATK, FreeBayes, BCFtools) contains the information for polymorphic loci present in the sample or population.
  • These genetic variants reported in VCF files generally lack the important variant annotation information such as genomic region-based annotation (CDS, introns, intergenic, etc.), associated genes, and their functions, etc.
  • Therefore, I have added the vcf_anot function in bioinfokit to add the variant annotation for variant location in the genome, associated genes, and gene functions. Currently, vcf_anot does not predict the variant effect on amino acid changes.

First, let’s check variant calls in test VCF file without adding annotation

##fileformat=VCFv4.2																																																																																																																																																																																																																																																																																																																																										
##filedate=20190219																																																																																																																																																																																																																																																																																																																																										
##source="beagle.21Jan17.6cc.jar (version 4.1)"																																																																																																																																																																																																																																																																																																																																										
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">																																																																																																																																																																																																																																																																																																																																										
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose">																																																																																																																																																																																																																																																																																																																																										
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">																																																																																																																																																																																																																																																																																																																																										
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">																																																																																																																																																																																																																																																																																																																																										
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">																																																																																																																																																																																																																																																																																																																																										
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">																																																																																																																																																																																																																																																																																																																																										
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">																																																																																																																																																																																																																																																																																																																																										
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	100_S85	101_S86																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	2157586	.	A	T	.	PASS	AR2=0.94;DR2=0.95;AF=0.82	GT:DS:GP	1/1:1.92:0,0.07,0.93	1/1:1.98:0,0.02,0.98																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	2157587	.	G	A	.	PASS	AR2=0.91;DR2=0.93;AF=0.46	GT:DS:GP	./.	0/1:0.95:0.05,0.94,0																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	2157604	.	A	G	.	PASS	AR2=0.91;DR2=0.93;AF=0.47	GT:DS:GP	./.	0/1:0.96:0.04,0.95,0																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	2157617	.	A	C	.	PASS	AR2=0.95;DR2=0.96;AF=0.82	GT:DS:GP	1/1:1.94:0,0.06,0.94	1/1:1.99:0,0.01,0.99																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	2157620	.	G	A	.	PASS	AR2=0.91;DR2=0.93;AF=0.48	GT:DS:GP	./.	0/1:0.97:0.04,0.96,0																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	2157622	.	G	T	.	PASS	AR2=0.95;DR2=0.96;AF=0.82	GT:DS:GP	1/1:1.94:0,0.06,0.94	1/1:1.99:0,0.01,0.99																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	2157626	.	C	T	.	PASS	AR2=0.86;DR2=0.89;AF=0.17	GT:DS:GP	./.	0/0:0.06:0.94,0.06,0																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	2157627	.	A	G	.	PASS	AR2=0.95;DR2=0.96;AF=0.82	GT:DS:GP	1/1:1.95:0,0.05,0.95	1/1:1.99:0,0.01,0.99																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	2157650	.	C	A	.	PASS	AR2=0.96;DR2=0.96;AF=0.81	GT:DS:GP	1/1:1.95:0,0.05,0.95	1/1:1.99:0,0.01,0.99		
SL3.0ch00	10737602	.	A	G	.	PASS	AR2=0.81;DR2=0.85;AF=0.79	GT:DS:GP	1/1:1.92:0,0.08,0.92	1/1:2:0,0,1																																																																																																																																																																																																																																																																																																																																
SL3.0ch00	10905876	.	A	G	.	PASS	AR2=0.86;DR2=0.87;AF=0.21	GT:DS:GP	./.	0/0:0.09:0.91,0.09,0																																																																																																																																																																																																																																																																																																																																		

How to add annotation to variants (SNPs/InDels) in VCF file?

# I am using interactive python interpreter (Python 3.8.2)
>>> from bioinfokit.analys import marker
>>> marker.vcf_anot(file='SL3.0ch00_imputed.vcf', gff_file='itag30_chr0.gff3', anot_attr='Note')
# annotated file will be saved in the same directory (SL3.0ch00_imputed_anot.txt)

Annotated output file SL3.0ch00_imputed_anot will have additional columns including genomic region, and transcript IDs and name for variants in CDS or exon or UTR regions

Now check variant calls in annotated text file,

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	100_S85	101_S86	genomic region	transcript ID	transcript name	strand
SL3.0ch00	2157586	.	A	T	.	PASS	AR2=0.94;DR2=0.95;AF=0.82	GT:DS:GP	1/1:1.92:0,0.07,0.93	1/1:1.98:0,0.02,0.98	Intergenic	None	None	None
SL3.0ch00	2157587	.	G	A	.	PASS	AR2=0.91;DR2=0.93;AF=0.46	GT:DS:GP	./.	0/1:0.95:0.05,0.94,0	Intergenic	None	None	None
SL3.0ch00	2157604	.	A	G	.	PASS	AR2=0.91;DR2=0.93;AF=0.47	GT:DS:GP	./.	0/1:0.96:0.04,0.95,0	Intergenic	None	None	None
SL3.0ch00	2157617	.	A	C	.	PASS	AR2=0.95;DR2=0.96;AF=0.82	GT:DS:GP	1/1:1.94:0,0.06,0.94	1/1:1.99:0,0.01,0.99	Intergenic	None	None	None
SL3.0ch00	2157620	.	G	A	.	PASS	AR2=0.91;DR2=0.93;AF=0.48	GT:DS:GP	./.	0/1:0.97:0.04,0.96,0	Intergenic	None	None	None
SL3.0ch00	2157622	.	G	T	.	PASS	AR2=0.95;DR2=0.96;AF=0.82	GT:DS:GP	1/1:1.94:0,0.06,0.94	1/1:1.99:0,0.01,0.99	Intergenic	None	None	None
SL3.0ch00	2157626	.	C	T	.	PASS	AR2=0.86;DR2=0.89;AF=0.17	GT:DS:GP	./.	0/0:0.06:0.94,0.06,0	Intergenic	None	None	None
SL3.0ch00	2157627	.	A	G	.	PASS	AR2=0.95;DR2=0.96;AF=0.82	GT:DS:GP	1/1:1.95:0,0.05,0.95	1/1:1.99:0,0.01,0.99	Intergenic	None	None	None
SL3.0ch00	2157650	.	C	A	.	PASS	AR2=0.96;DR2=0.96;AF=0.81	GT:DS:GP	1/1:1.95:0,0.05,0.95	1/1:1.99:0,0.01,0.99	Intergenic	None	None	None
SL3.0ch00	10737602	.	A	G	.	PASS	AR2=0.81;DR2=0.85;AF=0.79	GT:DS:GP	1/1:1.92:0,0.08,0.92	1/1:2:0,0,1	CDS	mRNA:Solyc00g102000.2.1	LOW QUALITY:Unknown protein (AHRD V3.3 )	+
SL3.0ch00	10905876	.	A	G	.	PASS	AR2=0.86;DR2=0.87;AF=0.21	GT:DS:GP	./.	0/0:0.09:0.91,0.09,0	CDS	mRNA:Solyc00g020540.2.1	Aminotransferase-like protein (AHRD V3.3 *-* Q6EQM2_ORYSJ)	+

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License