Variant Calling Practical

Part 1: Variant Calling

The web-based IGV pileup viewer below is loaded with the 1000 Genome Phase 3 high coverage whole genome sequencing (WGS) data for the NA12878 parent-child trio (NA12878 and her parents). Enter each of the genomic coordinates below into the search box and on Socrative, describe the genomic feature you see at this location in the child (NA12878) including type (e.g. SNV), zygosity (e.g. heterozygous) and inheritance (i.e. is the variant de novo or inherited). You may need to zoom in/out and adjust the viewer settings (with the gear) to fully investigate a variant.

We are using the web-based pileup viewer to get started quickly (without installing any software). In practice you would likely use a locally-installed version of the IGV pileup viewer.

Part 2: Bitter Tasting

Kim et al. identified a 3 variant haplotype in the TAS2R38 gene associated with the ability to taste the substance phenylthiocarbamide (PTC). The haplotype variants are described in the paper as A49P, V262A, and I296V, with "PAV" the dominant tasting haplotype. A little work is required to (reverse) translate that specification into genomic coordinates. We can translate those amino acid substitutions to nucleotide substitutions using the protein accession obtained from the NCBI and the Mutalyzer "back translator", e.g. NP_789787.4(TAS2R38):p.A49P. We take the coding nomenclature Mutalyzer reports NM_176817.4:c.145G>C to its position converter to obtain the GRCh37 genomic coordinates, NC_000007.13:g.141673345C>G, or chr7:141673345C>G. If we follow this process for all three variants in the paper we obtain the following genomic variants corresponding to the protein variants described in the paper:

Paper VariantGenomic Variant (GRCh37/hg19)
I296V chr7:141672604T>C
V262A chr7:141672705G>A
A49P chr7:141673345C>G

We can use MySeq to query the WGS VCF (Variant Call File) for those variants by genomic position, e.g. chr7:141673345. Clicking on any of the variant rows will bring up additional annotations, including "Functional" annotations that we can use to verify our reverse translation.

The heterozygous T/C, G/A and C/G genotypes are very suggestive that NA12878 carries one copy of the tasting haplotype. However, the variant call data alone is insufficient to know what haplotypes are present (i.e. which alleles are in phase). How could you (partially) verify that NA12878 carries the previously reported haplotype? Show the answer.

The first two variants are only 101bp apart. Thus we can verify that those variants are indeed on the same chromosome by looking to see if the alleles are on same fragment (i.e. in the same read or read-pair), i.e. perform "read-backed phasing". The pileup for NA12878's whole genome sequencing (WES) data is shown below for the region spanning the first two variants (from the 1000 Genome Phase 3 dataset).

Review the pileup data in this region for NA12878. Can you verify the presence of the haplotype? What would you initially look for? Show the answer.

We would look for the C ("blue") and A ("green") alternate alleles to occur in the same fragment.

Did you find what you are looking for? When we review the pileup in this region we don't see any fragments with the C ("blue") and A ("green") alternate alleles. However, when we review the GenBank entry for TAS2R38, we see the following misc_difference.

      /gene_synonym="PTC; T2R38; T2R61; THIOT"
      /note="This sequence differs from the reference genome
      assembly (NCBI Build 37) at this position. C was replaced
      by T to represent the more common allele."

The reference G allele in chr7:141672705G>A is the alternate Ala262 amino acid. Thus the "taster" haplotype PAV described in the paper corresponds to the bolded alleles shown below. As we see above, those alleles do appear to be on the same chromosome in NA12878.

chr7:141673345C>G C/G
chr7:141672705G>A G/A
chr7:141672604T>C T/C

Having verified the haplotype structure, we can now predict that NA12878, who carries one copy of the dominant taster haplotype, tastes PTC as bitter.


Today we did a lot of manual variant calling. Obviously we don't call 4-5 million variants by hand. But manual review of the pileup is still an important component of the workflow, especially for tricky variants. Similarly the variant callers (and downstream tools like GATK's read-backed phasing and WhatsHap) can automaticall perform local phasing (and, as you read, having long reads enables you to phase over larger regions), but tricky situations (like above) may require manual review.

MySeq provides variant query and a variety of other analyses, I encourage you to check it out.