Variant Calling

Germline Variant Calling

SNV calling from NGS data refers to a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of high–throughput sequencing (HTS) experiments. Most HTS based methods for SNV detection are designed to detect germline variations in the individual's genome. These are the mutations that an individual biologically inherits from their parents, and are the usual type of variants searched for when performing such analysis (except for certain specific applications where somatic mutations are sought). Somatic variants correspond to mutations that have occurred de novo within groups of somatic cells within an individual (that is, they are not present within the individual's germline cells). This form of analysis has been frequently applied to the study of cancer, where many studies are designed around investigating the profile of somatic mutations within cancerous tissues.

Germline SNV calling are based around:

  1. Filtering the set of HTS reads to remove sources of error/bias
  2. Aligning the reads to a reference genome
  3. Using an algorithm, either based on a statistical model or some heuristics, to predict the likelihood of variation at each locus, based on the quality scores and allele counts of the aligned reads at that locus
  4. Filtering the predicted results, often based on metrics relevant to the application
  5. SNP annotation to predict the functional effect of each variation.

The usual output of these procedures is a VCF file.

Human reference Data

The GATK resource bundle is a collection of standard files for working with human resequencing data with the GATK. GATK resource bundle is at /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle

GRCh37

Genome Reference Consortium Human Build 37

hg38

Genome Reference Consortium Human Build 38

hg19

Similar to GRCh37, this is the February 2009 assembly of the human genome with a different mitochondrial sequence and additional alternate haplotype assemblies.

b37

The reference genome included by some versions of the GATK software which includes data from GRCh37, the rCRS mitochondrial sequence, and the Human herpesvirus 4 type 1 in one file

Under b37 directory, you can find file named human_g1k_v37_decoy.fasta. This GRCh37-derived alignment set includes chromosomal plus unlocalized and unplaced contigs, the rCRS mitochondrial sequence (AC:NC_012920), Human herpesvirus 4 type 1 (AC:NC_007605) and decoy sequence derived from HuRef, Human Bac and Fosmid clones and NA12878.

The big difference between the reference genome major releases is the coordinate system and the content. After you pick a genome, you should stick with it throughout your entire analysis to avoid issues. If you plan to use GATK for variant analysis, hg_g1k_v37 is the current recommended reference.

Getting sample data

Somatic Variant Calling

RNA-seq variant analysis

RNA fusion detection. Another important application of RNA-seq is to detect fusion genes, which are abnormal genes produced by the concatenation of two separate genes arising from chromosomal translocations, or tran-splicing events. Fusion genes play a critical role in investigating causes and development of various cancer types. Based on recent publications (pubmed: 28680106), FusionCatcher yielded most sensitive and precise predictions.

This is an example of finding fusion genes in the BT474 cell line using the public available RNA-seq data (from SRA archive).

mkdir Fusioncatcher_Test
cd Fusioncatcher_Test
 
wget http://ftp-private.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/...
 
 
wget http://ftp-private.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR064/SRR064439/SRR064439.sra

Submit the following batch job to HTC cluster.

#!/bin/bash
#SBATCH -N 1
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
#SBATCH -J Fusioncatcher_human_sample
#SBATCH -o Fusioncatcher.out
#SBATCH -t 24:00:00
 
module load fusioncatcher/0.99.7b
 
fusioncatcher -p 4 -i ./Fusioncatcher_samples/ -o ./results/
 
# fusioncatcher -d /ihome/sam/apps/fusioncatcher/fusioncatcher/data/ensembl_v86/ -i ./Fusioncatcher_Test/ -o ./results/

Options specified as follows:

  • '-p PROCESSES', or ' --threads=PROCESSES' Number or processes/threads to be used for running SORT, Bowtie, BLAT, STAR, BOWTIE2 and other tools/programs. If this parameter is not specified, 1 core is used.
  • ' --config=CONFIGURATION_FILENAME', The default configuration file is at /ihome/sam/apps/fusioncatcher/fusioncatcher/etc/configuration.cfg
  • '-i INPUT_FILENAME' The input file(s) or directory.
  • '-o OUTPUT_DIRECTORY', The output directory where all the output files containing information about the found candidate fusiongenes are written.
  • '-d DATA_DIRECTORY', The data directory where all the annotations files from Ensembl database are placed. This directory should be built using 'fusioncatcher-build'. The default directory is /ihome/sam/apps/fusioncatcher/fusioncatcher/data/ensembl_v86/