A guide to understanding the variant information fields in variant call format (VCF) file

Renesh Bedre    5 minute read

  • The Variant Call Format (VCF) file produced by variant calling software (e.g. GATK, FreeBayes, SAMtools) contains the information for polymorphic loci and probabilistic measures present in the sample or population.
  • Several fields are present in the INFO or FORMAT column in the VCF file, which gives various metrics related to called genotypes and are useful for filtering for downstream analysis.

GT (Genotype)

  • GT refers to the most likely genotype of the sample. The alleles are separated by / or |
  • For diploid organisms, it has 0 value for reference allele and 1 for the alternate allele (non-reference allele).
genotype description
0/0 the sample is a homozygous reference
0/1 the sample is heterozygous (carries both reference and alternate alleles)
1/1 the sample is a homozygous alternate
./. No genotype called or missing genotype

GP (Genotype posterior probabilities)

  • Phred-scale based genotype posterior probabilities calculated using Bayes’ formula, which ranges from 0 to 1.
  • GP tag has three subfields for homozygous reference, heterozygous, and homozygous alternate genotypes probabilities.
  • The highest probability of possible genotypes is used for assigning the genotype. For example, if GP is 0.11,0.62,0.27, then the genotype is heterozygous (0/1) as it has the highest probability (0.62).
  • The probabilities of genotypes always sum to 1.

GL and PL (Genotype likelihoods)

  • GL refers to log10-scaled genotype likelihoods
  • PL refers to Phred-scaled genotype likelihoods. The most likely genotype has a PL value of 0.
  • Similar to GP, GL and PL tag has three subfields for homozygous reference, heterozygous, and homozygous alternate genotypes likelihoods.

DP, DP4, and AD (Read depth)

  • DP refers to the overall read depth from all target samples supporting the genotype call.
  • Generally, markers are retained with DP > 10 or DP > 5 to get high-quality genotypes. This value can be changed based on research applications. For example, in clinical research higher DP is desirable for filtering.
  • DP4 field have four subfields for sequence reads covering the variant. These subfields refer to reference allele covered by forward read, reference allele covered by reverse read, alternate allele covered by forward read, and alternate allele covered by reverse read. DP4 may not sum to DP as it excludes low-quality bases.
  • AD refers to the allele depth. AD reports the informative reads supporting each allele. AD may not always sum to DP.

MQ (Mapping quality)

  • MQ refers to the root mean square (RMS) mapping quality of all the reads spanning the given variant site.
  • MQ represents the square root of the mean of squares of mapping qualities of all the reads at a given variant site.
  • The MQ >= 60 represents the good mapping quality. The variants with MQ < 40 or < 50 should be removed.

FS (FisherStrand) and SOR (StrandOddsRatio)

  • FS and SOR tags used for strand bias evaluation
  • FS refers to the Phred-scaled probability of the strand bias. FS values close to zero represents little to no bias at the variant site.
  • SOR is created as an alternative to FS. Most of the SOR values range from 0 to 9. The SOR values greater than 3 shows strand bias and should be removed.

QD (Quality by Depth)

  • QD is variant confidence adjusted for variant sites with deep coverages.
  • QD is a better metric than DP or QUAL for variant filtering.
  • Variants with QD < 2 should be removed.

DS (Alternate allele dosage)

  • Expected probability of the alternate allele
  • Calculated as p(heterozygous) + 2*p(homozygous alternate)

AC (Alternate allele count)

  • Total alternate allele count for all possible genotypes

AN (Total allele count)

  • Total number of alleles in all possible genotypes

AF (Alternate allele frequency)

  • AF is the frequency for an alternate allele
  • AF is calculated (AC/AN)
  • AF tag can be used to infer the minor allele frequency (MAF) (Check bcftools fill-tags plugin)
  • If AF < 0.5, then AF is equal to MAF
  • rare variants generally has AF or MAF < 5 % (0.05)

MAF (Minor allele frequency)

  • MAF refers to the minor allele (least frequent) frequency
  • An alternate allele may not be always minor allele

AR2 (Allelic R-Squared) and DR2 (Dosage R-Squared)

  • DR2 and AR2 fields estimate SNP imputation accuracy for each SNP. DR2 and AR2 are highly correlated.
  • Allelic R-Squared refers to the squared correlation between the imputed most probable allele dosage and true allele dosage
  • Values between 0.3 and 0.8 are typically used for filtering i.e. variant should be retained if the average accuracy of imputation > 0.3 or > 0.8
  • R-Squared measures are highly correlated with minor allele frequency (MAF)

NS (Samples with data)

  • NS refers to the number of target samples which has called genotypes or does not have missing values

RPB (Read Position Bias)

  • RPB refers to the tail distance bias between reference and alternate allele by the mapped reads
  • Most of the time, the end of the sequences constitutes sequencing errors, and therefore the alleles present at the end of sequences may not be right.
  • RPB is a z-score for Mann-Whitney U test. RPB > 2 and < -2 (p < 0.05) indicate significant bias. RPB value close to zero is ideal (p > 0.05). The bigger p value is better.

MQB (Mapping Quality Bias)

  • MQB refers to the mapping quality bias between the reads supporting reference and alternate allele
  • MQB reports the p values from the Mann-Whitney U test. If p < 0.05, it suggests there is significant bias i.e. the reads supporting alternate allele have lower mapping quality than reads supporting reference allele. The bigger p value is better.

VDB (Variant Distance Bias)

  • VDB refers to the variant distance bias which evaluates the likelihood of the mean pairwise distance of the variant bases in the mapped region of the reads.
  • VDB checks for the random distribution of variant bases in the mapped region of the reads. The lower value suggests that the position of alternate allele is biased. The higher value is the better.
  • VDB is useful in identifying the false-positive variants (artifacts resulted from mis-aligned regions such as RNA-seq reads spanning the splice sites).

References

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License