genomize-Seq is a next generation sequencing (NGS) data analysis, management and sharing platform. In this blog letter, we will explain the parameters genomize-Seq employs for variant confidence classification.
genomize-Seq uses a three-level confidence classification scheme, with the classes High, Low and Failed. (Figure 1) A parameter optimization is necessary to classify the real variants into high confidence class. As a variety of algorithms and dozens of parameters are used in Next Generation Sequencing, different analyses of the same raw data may cause different results (Chapman, O’Rawe et al.). No matter how complex the parameter optimization and machine learning algorithms are, the clinician and/or the patient will want to know the real result.

Therefore, genomize-Seq assists the laboratories for correct parameter optimization and ensures reproducible production of the most accurate result, which actually is limited by the technology and/or the DNA processing kit. This parameter optimization can be done by comparing with alternative techniques such as Sanger sequencing must be performed to calculate sensitivity, specificity and accuracy.

Figure 1 – Variant Confidence Classification System

As seen in Figure 2, low confidence variants will be shown in a separate tab with their associated reason(s).

Figure 2 – A List View of Low Confidence Variants

genomize-Seq classifies the variants with the following parameters:
•Coverage Thresholds
•Primary coverage filter threshold per allele
•Secondary coverage threshold per allele
•Strand Bias Metric (SBM) Threshold For Heterozygous Variants
•Alternative Allele Ratio Threshold For Heterozygous Variants

Coverage Thresholds
Primary coverage threshold is a minimum as such, any variant with coverage per-allele lower than the primary threshold will be classified as ‘Failed’ and hidden from the Samples Detail Page. For example, in order to find the covarage of the forward strand (F), the sum of the alternative (alt) and reference (ref) alleles in that strand is calculated (see also Box 1 and Figure 3). 
Having per-allele coverage in between the primary and secondary coverage thresholds will classify a variant as low confidence. Therefore, to be classified as high confidence a variant’s per-allele coverage must be greater than the secondary threshold (along with other requirements explained below).
Alternative allele ratio filter
When searching for heterozygous germline variants, the alternative allele ratio must ideally be close to 50%. Depending on the combination of the technology and the kit used, this can be partly or fully achieved. Therefore, one will often need to use alternative allele ratio filter to achieve accurate results. If the alternative allele ratio of the variant is less than the threshold or more than [1 – threshold] the variant is assigned as “Low confidence”. Changing the alternative allele ratio threshold will not affect the classification for homozygous variants, as the ratio must be close to 100% in homozygous variants.

Figure 3 – Distribution of the Reference and Alternative Alleles Displayed in IGV

Strand Bias Metric
Sometimes the failure of the reading machine results in such a contradiction as the genotype inferred from the forward strand and reverse strand are different, with one homozygous and the other heterozygous. This phenomenon is known as strand bias (Guo et al.). Genomize-Seq uses a strand bias metric (SBM) to infer strand bias which is calculated as below:

The variants with SBM values greater than the SBM threshold are assigned as “Low confidence”.
The strand bias metric is not applicable to homozygous variant calls due to the fact that, there is not enough evidence in the reference allele to infer such a bias.

A summary of variant classifiation principles explained so far can be seen in Table 1.

Table 1 – The Overall Principles of Variant Confidence Classification

Finally we developed a binary parameter, “Require Allele Ratio Evidence From Both Strands”. Bi-directional support is important for heterozygous calls which can be deselected for gene panel enrichment kits with less than 100% bidirectional coverage. A comparison of confidence classifications made with and without this option can be seen in Table 2.

You can use Variant Confidence Calculator, in order to visualize the results with different parameters.