genomize-Seq is a next generation sequencing (NGS) data analysis, management and sharing platform. In this blog letter, we will explain the parameters genomize-Seq employs for variant confidence classification.
genomize-Seq uses a three-level confidence classification scheme, with the classes High, Low and Failed. (Figure 1) A parameter optimization is necessary to classify the real variants into high confidence class. As a variety of algorithms and dozens of parameters are used in Next Generation Sequencing, different analyses of the same raw data may cause different results (Chapman, O’Rawe et al.). No matter how complex the parameter optimization and machine learning algorithms are, the clinician and/or the patient will want to know the real result.
Therefore, genomize-Seq assists the laboratories for correct parameter optimization and ensures reproducible production of the most accurate result, which actually is limited by the technology and/or the DNA processing kit. This parameter optimization can be done by comparing with alternative techniques such as Sanger sequencing must be performed to calculate sensitivity, specificity and accuracy.
Figure 1 – Variant Confidence Classification System
As seen in Figure 2, low confidence variants will be shown in a separate tab with their associated reason(s).
Figure 2 – A List View of Low Confidence Variants
genomize-Seq classifies the variants with the following parameters:
•Primary coverage filter threshold per allele
•Secondary coverage threshold per allele
•Strand Bias Metric (SBM) Threshold For Heterozygous Variants
•Alternative Allele Ratio Threshold For Heterozygous Variants
Primary coverage threshold is a minimum as such, any variant with coverage per-allele lower than the primary threshold will be classified as ‘Failed’ and hidden from the Samples Detail Page. For example, in order to find the covarage of the forward strand (F), the sum of the alternative (alt) and reference (ref) alleles in that strand is calculated (see also Box 1 and Figure 3).
Having per-allele coverage in between the primary and secondary coverage thresholds will classify a variant as low confidence. Therefore, to be classified as high confidence a variant’s per-allele coverage must be greater than the secondary threshold (along with other requirements explained below).
Alternative allele ratio filter
When searching for heterozygous germline variants, the alternative allele ratio must ideally be close to 50%. Depending on the combination of the technology and the kit used, this can be partly or fully achieved. Therefore, one will often need to use alternative allele ratio filter to achieve accurate results. If the alternative allele ratio of the variant is less than the threshold or more than [1 – threshold] the variant is assigned as “Low confidence”. Changing the alternative allele ratio threshold will not affect the classification for homozygous variants, as the ratio must be close to 100% in homozygous variants.
Figure 3 – Distribution of the Reference and Alternative Alleles Displayed in IGV
Strand Bias Metric
Sometimes the failure of the reading machine results in such a contradiction as the genotype inferred from the forward strand and reverse strand are different, with one homozygous and the other heterozygous. This phenomenon is known as strand bias (Guo et al.). Genomize-Seq uses a strand bias metric (SBM) to infer strand bias which is calculated as below:
The variants with SBM values greater than the SBM threshold are assigned as “Low confidence”.
The strand bias metric is not applicable to homozygous variant calls due to the fact that, there is not enough evidence in the reference allele to infer such a bias.
A summary of variant classifiation principles explained so far can be seen in Table 1.
Table 1 – The Overall Principles of Variant Confidence Classification
Finally we developed a binary parameter, “Require Allele Ratio Evidence From Both Strands”. Bi-directional support is important for heterozygous calls which can be deselected for gene panel enrichment kits with less than 100% bidirectional coverage. A comparison of confidence classifications made with and without this option can be seen in Table 2.
You can use Variant Confidence Calculator, in order to visualize the results with different parameters.
• Coverage – The number of reads made for a variant
• Primary coverage filter – a lower threshold for classifying failed variants
• Secondary coverage filter – a threshold for classifying high and low confidence variants
• Reference allele – DNA sequence of the related region in reference genome (hg19)
• Alternative allele – an allele other than reference allele in a variant
• altF – The number of alternative alleles counted in the reads for the forward strand of the variant
• refF – The number of reference alleles counted in the reads for the forward strand of the variant
• altR – The number of alternative alleles counted in the reads for the reverse strand of the variant
• refR – The number of reference alleles counted in the reads for the reverse strand of the variant
• alternative allele ratio – The number of reads supporting the alternative allele divided by the sum of all reads. Genomize-seq calculates this ratio, per strand.
• strand bias – Sometimes the failure of the reading machine results in such a contradiction as the genotype inferred from the forward strand and reverse strand are different, with one homozygous and the other heterozygous. This phenomenon is known as strand bias.
• strand bias metric (SBM)– A parameter for classifying strand bias, if the SBM value of the variant is higher than given threshold, it is classified as strand bias and the confidence of the read is classified as low.
• homopolymer – a polymer which contains only one kind of residue (e.g. the polynucleotide GGGGG…)
• homopolymer stretch existence – determines if the variant resides in a homopolymer region
•Chapman, Brad. ‘Updated Comparison Of Variant Detection Methods: Ensemble, Freebayes And Minimal BAM Preparation Pipelines’. Blue Collar Bioinformatics 2013. 7 July 2015.
•O’Rawe et al.: Low concordance of multiple variant- calling pipelines: practical implications for exome and genome sequencing. Genome Medicine 2013 5:28.
•Guo et al.: The effect of strand bias in Illumina short- read sequencing data. BMC Genomics 2012 13:666.