Variant Calling

Germline Variant Calling

SNV calling from NGS data refers to a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of high–throughput sequencing (HTS) experiments. Most HTS based methods for SNV detection are designed to detect germline variations in the individual's genome. These are the mutations that an individual biologically inherits from their parents, and are the usual type of variants searched for when performing such analysis (except for certain specific applications where somatic mutations are sought). Somatic variants correspond to mutations that have occurred de novo within groups of somatic cells within an individual (that is, they are not present within the individual's germline cells). This form of analysis has been frequently applied to the study of cancer, where many studies are designed around investigating the profile of somatic mutations within cancerous tissues.

Germline SNV calling are based around:

  1. Filtering the set of HTS reads to remove sources of error/bias
  2. Aligning the reads to a reference genome
  3. Using an algorithm, either based on a statistical model or some heuristics, to predict the likelihood of variation at each locus, based on the quality scores and allele counts of the aligned reads at that locus
  4. Filtering the predicted results, often based on metrics relevant to the application
  5. SNP annotation to predict the functional effect of each variation.

The usual output of these procedures is a VCF file.

Human reference Data

The GATK resource bundle is a collection of standard files for working with human resequencing data with the GATK. GATK resource bundle is at /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle

GRCh37

Genome Reference Consortium Human Build 37

hg38

Genome Reference Consortium Human Build 38

hg19

Similar to GRCh37, this is the February 2009 assembly of the human genome with a different mitochondrial sequence and additional alternate haplotype assemblies.

b37

The reference genome included by some versions of the GATK software which includes data from GRCh37, the rCRS mitochondrial sequence, and the Human herpesvirus 4 type 1 in one file

Under b37 directory, you can find file named human_g1k_v37_decoy.fasta. This GRCh37-derived alignment set includes chromosomal plus unlocalized and unplaced contigs, the rCRS mitochondrial sequence (AC:NC_012920), Human herpesvirus 4 type 1 (AC:NC_007605) and decoy sequence derived from HuRef, Human Bac and Fosmid clones and NA12878.

The big difference between the reference genome major releases is the coordinate system and the content. After you pick a genome, you should stick with it throughout your entire analysis to avoid issues. If you plan to use GATK for variant analysis, hg_g1k_v37 is the current recommended reference.

Getting sample data

We will use GATK best practices to analyze 6 exome sequence data. The 6 samples are from 1000 Genomes CHB (Han Chinese Beijing) exome data, and their SRA records are SRR098333 ~ SRR098338.

Use the following codes to download the data.

module load SRAToolkit/2.8.2
 
prefetch SRR098333
fastq-dump --split-3 --qual-filter-1 SRR098333
 
prefetch SRR098334
fastq-dump --split-3 --qual-filter-1 SRR098334
 
prefetch SRR098335
fastq-dump --split-3 --qual-filter-1 SRR098335
 
prefetch SRR098336
fastq-dump --split-3 --qual-filter-1 SRR098336
 
prefetch SRR098337
fastq-dump --split-3 --qual-filter-1 SRR098337
 
prefetch SRR098338
fastq-dump --split-3 --qual-filter-1 SRR098338
  • --split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored.
  • --qual-filter-1 Filter used in current 1000 Genomes data

Aligning the reads to a reference genome

We can use bwa_mem or bowtie2 to align the reads to a reference genome. We choose b37 as the human reference genome.

To use bwa_mem to align the reads, submit the following jobs to HTC cluster.

#!/bin/bash
#
#SBATCH --job-name=bwa_mem
#SBATCH -N 1
#SBATCH --cpus-per-task=8 # Request that ncpus be allocated per process.
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH --output=bwa_mem.out
 
module load bwa/0.7.15-gcc5.2.0
 
#bwa index /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta
 
bwa mem -t 8 /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta SRR098333_1.fastq SRR098333_2.fastq > alignments/SRR098333.bwa_mem.sam
 
bwa mem -t 8 /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta SRR098334_1.fastq SRR098334_2.fastq > alignments/SRR098334.bwa_mem.sam
 
bwa mem -t 8 /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta SRR098335_1.fastq SRR098335_2.fastq > alignments/SRR098335.bwa_mem.sam
 
bwa mem -t 8 /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta SRR098336_1.fastq SRR098336_2.fastq > alignments/SRR098336.bwa_mem.sam
 
bwa mem -t 8 /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta SRR098337_1.fastq SRR098337_2.fastq > alignments/SRR098337.bwa_mem.sam
 
bwa mem -t 8 /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta SRR098338_1.fastq SRR098338_2.fastq > alignments/SRR098338.bwa_mem.sam

Bwa mem needs a bwa index from a set of DNA sequences. For b37, this index has been built. Use codes in the comment line to build your own index.

To use bowtie2 to align the sequences, submit the following jobs to HTC cluster.

#!/bin/bash
#
#SBATCH --job-name=bowtie2
#SBATCH -N 1
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
#SBATCH -t 1-00:00 # Runtime in D-HH:MM
#SBATCH --output=bowtie2.out
 
module load bowtie2/2.3.2-gcc5.2.0
 
# bowtie2-build --threads 16 -f /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.bowtie2_index
 
bowtie2 -p 16 -x /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.bowtie2_index -1 SRR098333_1.fastq -2 SRR098333_2.fastq -S alignments/SRR098333.bowtie2.sam
 
bowtie2 -p 16 -x /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.bowtie2_index -1 SRR098334_1.fastq -2 SRR098334_2.fastq -S alignments/SRR098334.bowtie2.sam
 
bowtie2 -p 16 -x /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.bowtie2_index -1 SRR098335_1.fastq -2 SRR098335_2.fastq -S alignments/SRR098335.bowtie2.sam
 
bowtie2 -p 16 -x /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.bowtie2_index -1 SRR098336_1.fastq -2 SRR098336_2.fastq -S alignments/SRR098336.bowtie2.sam
 
bowtie2 -p 16 -x /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.bowtie2_index -1 SRR098337_1.fastq -2 SRR098337_2.fastq -S alignments/SRR098337.bowtie2.sam
 
bowtie2 -p 16 -x /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.bowtie2_index -1 SRR098338_1.fastq -2 SRR098338_2.fastq -S alignments/SRR098338.bowtie2.sam

Germline SNP & Indel Discovery in Exome Sequence

Then I followed GATK's Best Practices for Germline SNP & Indel Discovery in Whole Genome and Exome Sequence to analyze the data set.

Convert to BAM and sort

First, use samtools to sort by position and convert to bam format.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J samtools_sort_human_samples
#SBATCH --output=samtools.out
 
#SBATCH --cpus-per-task=16 # Request that ncpus be allocated per process.
 
module load samtools/1.3.1-gcc5.2.0
 
samtools sort -@ 16 -o SRR098333.bwa_mem.sorted.bam SRR098333.bwa_mem.sam
samtools sort -@ 16 -o SRR098334.bwa_mem.sorted.bam SRR098334.bwa_mem.sam
samtools sort -@ 16 -o SRR098335.bwa_mem.sorted.bam SRR098335.bwa_mem.sam
samtools sort -@ 16 -o SRR098336.bwa_mem.sorted.bam SRR098336.bwa_mem.sam
samtools sort -@ 16 -o SRR098337.bwa_mem.sorted.bam SRR098337.bwa_mem.sam
samtools sort -@ 16 -o SRR098338.bwa_mem.sorted.bam SRR098338.bwa_mem.sam
 
samtools sort -@ 16 -o SRR098333.bowtie2.sorted.bam SRR098333.bowtie2.sam
samtools sort -@ 16 -o SRR098334.bowtie2.sorted.bam SRR098334.bowtie2.sam
samtools sort -@ 16 -o SRR098335.bowtie2.sorted.bam SRR098335.bowtie2.sam
samtools sort -@ 16 -o SRR098336.bowtie2.sorted.bam SRR098336.bowtie2.sam
samtools sort -@ 16 -o SRR098337.bowtie2.sorted.bam SRR098337.bowtie2.sam
samtools sort -@ 16 -o SRR098338.bowtie2.sorted.bam SRR098338.bowtie2.sam

Identify read group information

The read group information is key for downstream GATK functionality. The GATK will not work without a read group tag. Make sure to enter as much metadata as you know about your data in the read group fields provided.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J picard_human_samples
# #SBATCH --output=samtools.out
 
#SBATCH --cpus-per-task=1 # Request that ncpus be allocated per process.
 
module load picard/2.11.0
 
java -Xms5g -Xmx16g -jar /ihome/sam/apps/picard/picard-2.11.0/picard.jar AddOrReplaceReadGroups I=SRR098333.bwa_mem.sorted.bam O=SRR098333.bwa_mem.sorted_rg.bam SORT_ORDER=coordinate RGLB=NA18634 RGPL=illumina RGPU=NA18634 RGSM=NA18634
 
java -Xms5g -Xmx16g -jar /ihome/sam/apps/picard/picard-2.11.0/picard.jar AddOrReplaceReadGroups I=SRR098334.bwa_mem.sorted.bam O=SRR098334.bwa_mem.sorted_rg.bam SORT_ORDER=coordinate RGLB=NA18553 RGPL=illumina RGPU=NA18553 RGSM=NA18553
 
java -Xms5g -Xmx16g -jar /ihome/sam/apps/picard/picard-2.11.0/picard.jar AddOrReplaceReadGroups I=SRR098335.bwa_mem.sorted.bam O=SRR098335.bwa_mem.sorted_rg.bam SORT_ORDER=coordinate RGLB=NA18631 RGPL=illumina RGPU=NA18631 RGSM=NA18631
 
java -Xms5g -Xmx16g -jar /ihome/sam/apps/picard/picard-2.11.0/picard.jar AddOrReplaceReadGroups I=SRR098336.bwa_mem.sorted.bam O=SRR098336.bwa_mem.sorted_rg.bam SORT_ORDER=coordinate RGLB=NA18616 RGPL=illumina RGPU=NA18616 RGSM=NA18616
 
java -Xms5g -Xmx16g -jar /ihome/sam/apps/picard/picard-2.11.0/picard.jar AddOrReplaceReadGroups I=SRR098337.bwa_mem.sorted.bam O=SRR098337.bwa_mem.sorted_rg.bam SORT_ORDER=coordinate RGLB=NA18638 RGPL=illumina RGPU=NA18638 RGSM=NA18638
 
java -Xms5g -Xmx16g -jar /ihome/sam/apps/picard/picard-2.11.0/picard.jar AddOrReplaceReadGroups I=SRR098338.bwa_mem.sorted.bam O=SRR098338.bwa_mem.sorted_rg.bam SORT_ORDER=coordinate RGLB=NA18567 RGPL=illumina RGPU=NA18567 RGSM=NA18567

Mark duplicates

Once your data has been mapped to the reference genome, you can proceed to mark duplicates. The idea here is that during the sequencing process, the same DNA fragments may be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant.

#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J picard_human_samples
# #SBATCH --output=picard.out
#SBATCH --array=3-8 # job array index 1, 2, ..., 16
 
#SBATCH --cpus-per-task=1 # Request that ncpus be allocated per process.
 
module load picard/2.11.0
 
echo "parsing sample: SRR09833" $SLURM_ARRAY_TASK_ID
 
java -Xms5g -Xmx16g -jar /ihome/sam/apps/picard/picard-2.11.0/picard.jar MarkDuplicates I=SRR09833$SLURM_ARRAY_TASK_ID.bwa_mem.sorted_rg.bam O=SRR09833$SLURM_ARRAY_TASK_ID.bwa_mem.sorted_dups_removed.bam METRICS_FILE=SRR09833$SLURM_ARRAY_TASK_ID_metrics.txt CREATE_INDEX=TRUE

Here I used job arrays, which offer a mechanism for submitting and managing collections of similar jobs quickly and easily.

Perform local realignment of reads around indels

The local realignment process is designed to consume one or more BAM files and to locally realign reads such that the number of mismatching bases is minimized across all the reads. Indel realignment is no longer necessary for variant discovery if you plan to use a variant caller that performs a haplotype assembly step, such as HaplotypeCaller or MuTect2.

There are 2 steps to the realignment process:

  • Determining (small) suspicious intervals which are likely in need of realignment (see the RealignerTargetCreator tool)
#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J GATK_human_samples
#SBATCH --output=RealignerTargetCreator.out
#SBATCH --array=3-8 # job array index 1, 2, ..., 16
 
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
 
module load GATK/3.8.0
 
echo "parsing sample: SRR09833" $SLURM_ARRAY_TASK_ID
 
java -Xms5g -Xmx32g  -jar /ihome/sam/apps/GATK/GenomeAnalysisTK-3.8.0/GenomeAnalysisTK.jar -T RealignerTargetCreator -nt 4 -R /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta -I SRR09833$SLURM_ARRAY_TASK_ID.bwa_mem.sorted_dups_removed.bam -o SRR09833$SLURM_ARRAY_TASK_ID.forIndelRealigner.intervals -known /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/1000G_phase1.indels.b37.vcf -known /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf
  • Running the realigner over those intervals (IndelRealigner)
#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J GATK_human_samples
#SBATCH --output=IndelRealigner-%A_%a.out
#SBATCH --array=3-8 # job array index 1, 2, ..., 16
 
#SBATCH --cpus-per-task=1 # Request that ncpus be allocated per process.
 
module load GATK/3.8.0
 
echo "parsing sample: SRR09833" $SLURM_ARRAY_TASK_ID
 
java -Xms5g -Xmx32g -jar /ihome/sam/apps/GATK/GenomeAnalysisTK-3.8.0/GenomeAnalysisTK.jar \
    -T IndelRealigner \
    -R /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta \
    -I SRR09833$SLURM_ARRAY_TASK_ID.bwa_mem.sorted_dups_removed.bam \
    -known /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/1000G_phase1.indels.b37.vcf \
    -known /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \
    -targetIntervals SRR09833$SLURM_ARRAY_TASK_ID.forIndelRealigner.intervals \
    -o SRR09833$SLURM_ARRAY_TASK_ID.bwa_mem.sorted_dups_removed_indelrealigner.bam

Note the -know command, which provide known sites. Each GATK tool uses known sites differently, but what is common to all is that they use them to help distinguish true variants from false positives, which is very important to how these tools work. Check this documentation ( https://gatkforums.broadinstitute.org/gatk/discussion/1247/what-should-i... ) for guidance to use as known variants/sites for running GATK tools.

Recalibrate Bases

Variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic technical error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration (BQSR) is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This allows us to get more accurate base qualities, which in turn improves the accuracy of our variant calls.

The base recalibration process involves two key steps:

  • first the program builds a model of covariation based on the data and a set of known variants (which you can bootstrap if there is none available for your organism)
#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J GATK_human_samples
#SBATCH --output=BaseRecalibrator-%A_%a.out
#SBATCH --array=3-8 # job array index 1, 2, ..., 16
 
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
 
module load GATK/3.8.0
 
echo "parsing sample: SRR09833" $SLURM_ARRAY_TASK_ID
 
java -Xms5g -Xmx32g -jar /ihome/sam/apps/GATK/GenomeAnalysisTK-3.8.0/GenomeAnalysisTK.jar \
    -T BaseRecalibrator \
    -nct 4 \
    -R /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta \
    -knownSites /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/1000G_phase1.indels.b37.vcf \
    -knownSites /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \
    -knownSites /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/dbsnp_138.b37.vcf \
    -I SRR09833$SLURM_ARRAY_TASK_ID.bwa_mem.sorted_dups_removed_indelrealigner.bam \
    -o SRR09833$SLURM_ARRAY_TASK_ID.recal_data.table
  • then it adjusts the base quality scores in the data based on the model. This step is not necessary for some tools. These tools can directly use the above recal_data.table.
#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J GATK_human_samples
#SBATCH --exclusive
#SBATCH --output=PrintReads-3.8.0_lowmem.out
#SBATCH --array=3-8 # job array index 1, 2, ..., 16
 
#SBATCH --cpus-per-task=1 # Request that ncpus be allocated per process.
 
module load GATK/3.8.0
 
# echo "parsing sample: SRR09833" $SLURM_ARRAY_TASK_ID
 
java -Xms1g -Xmx5g -jar /ihome/sam/apps/GATK/GenomeAnalysisTK-3.8.0/GenomeAnalysisTK.jar -T PrintReads -R /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta -BQSR SRR098333.recal_data.table -I SRR098333.bwa_mem.sorted_dups_removed_indelrealigner.bam -o SRR098333.bwa_mem.sorted_dups_removed_indelrealigner_BQSR.bam

Variant Discovery

Once you have pre-processed your data according to our recommendations, you are ready to undertake the variant discovery process, i.e. identify the sites where your data displays variation relative to the reference genome, and calculate genotypes for each sample at that site.

For DNA, the variant calling step is further subdivided into two separate steps (per-sample calling followed by joint genotyping across samples) in order to enable scalable and incremental processing of cohorts comprising many individual samples.

  • This workflow involves running HaplotypeCaller on each sample separately in GVCF mode, to produce an intermediate file format called GVCF (for Genomic VCF).
#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J GATK_human_samples
#SBATCH --output=HaplotypeCaller-%A_%a.out
#SBATCH --array=3-8 # job array index 1, 2, ..., 16
 
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
 
module load GATK/3.8.0
 
echo "parsing sample: SRR09833" $SLURM_ARRAY_TASK_ID
 
java -Xms5g -Xmx32g -jar /ihome/sam/apps/GATK/GenomeAnalysisTK-3.8.0/GenomeAnalysisTK.jar \
    -T HaplotypeCaller \
    -nct 4 \
    -R /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta \
    --emitRefConfidence GVCF \
    --dbsnp /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/dbsnp_138.b37.vcf \
    -I SRR09833$SLURM_ARRAY_TASK_ID.bwa_mem.sorted_dups_removed_indelrealigner.bam \
    -BQSR SRR09833$SLURM_ARRAY_TASK_ID.recal_data.table \
    -o SRR09833$SLURM_ARRAY_TASK_ID.raw.snps.indels.g.vcf
  • The GVCFs of multiple samples are then run through a joint genotyping step to produce a multi-sample VCF callset, which can then be filtered to balance sensitivity and specificity as desired.
#!/bin/bash
#
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 3-00:00 # Runtime in D-HH:MM
#SBATCH -J GATK_human_samples
#SBATCH --output=GenotypeGVCFs.out
 
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
 
module load GATK/3.8.0
 
java -Xms5g -Xmx64g -jar /ihome/sam/apps/GATK/GenomeAnalysisTK-3.8.0/GenomeAnalysisTK.jar \
    -T GenotypeGVCFs \
    -nt 4 \
    -R /mnt/mobydisk/pan/genomics/refs/GATK_Resource_Bundle/b37/human_g1k_v37.fasta \
    -V SRR098333.raw.snps.indels.g.vcf \
    -V SRR098334.raw.snps.indels.g.vcf \
    -V SRR098335.raw.snps.indels.g.vcf \
    -V SRR098336.raw.snps.indels.g.vcf \
    -V SRR098337.raw.snps.indels.g.vcf \
    -V SRR098338.raw.snps.indels.g.vcf \
    -o SRR098333-8.GVCF.vcf

Filter Variants

The GATK's variant calling tools are designed to be very lenient in order to achieve a high degree of sensitivity. This is good because it minimizes the chance of missing real variants, but it does mean that we need to filter the raw callset they produce in order to reduce the amount of false positives, which can be quite large.

  • Recalibrate variant quality scores = run VQSR

VQSR requires at least 30 exome samples. Exmome data gives smaller number of variants per sample compared to WGS. A few exome samples are typically insufficient to build a robust recalibration model. If necessary, add exomes from 1000G Project or comparable.

The GVCF files for each alignment, along with 30 whole-exome sequencing samples from the 1,000 Genomes Project, were merged and joint-genotyped using GVCFs.

  • Apply hard filters to a call set

Variant Annotation

The called variants can be further analyzed using VariantAnnotator and subjected to variant quality score recalibration of SNPs and INDELs separately. The variants that were called can be analyzed using ANNOVAR.

Somatic Variant Calling

Detection of cancer mutations is different from the detection of germline genetic variants, mainly because:

  • We are only interested in somatic mutations which are the differences between tumor and germline DNA.
  • Tumor samples are often impure due to a mixture of tumor and normal cells.
  • Tumors consists of sub-clones with different somatic mutations.

Germline genotype callers such as GATK's HaplotypeCaller are optimized for diploid samples or samples of known ploidy, and for detecting variants with allele frequencies close to 0.5 or 1. Therefore, somatic variants should be called with specialized callers. We follows GATK best practices for somatic mutation using Exome/Panel + Whole Genome data. MuTect2 is a somatic SNP and indel caller that combines the DREAM challenge-winning somatic genotyping engine of the original MuTect with the assembly-based machinery of HaplotypeCaller.

Data

As part of its goal to overcome obstacles in realizing the full benefits of the application of genomic methods for cancer diagnosis and the guidance of precision treatment, the BCM Human Genome Sequencing Center and the TCRB has made genomic sequence data from high quality human tumor specimens with matched normal freely available to the public without the barriers that other, more restrictive data sharing initiatives impose. This data set can be accessed at http://txcrb.org/open.html

We have applied mutect2 to the case_001, case_002 and case_003 tumor-normal pair data. For case_001, download TCRBOA1-N-WEX.read1.fastq.bz2, TCRBOA1-N-WEX.read2.fastq.bz2, TCRBOA1-T-WEX.read1.fastq.bz2 and TCRBOA1-T-WEX.read1.fastq.bz2. Download the corresponding data for case_002 and case_003.

Initial QC and adapter trimming

You should perform proper Initial QC and adapter trimming for your own data.

Generation and initial processing of bam files

Bam files should be generated and preprocessed according to GATKs Best Practice for variant discovery (https://www.broadinstitute.org/gatk/guide/best-practices).

RNA-seq variant analysis

RNA fusion detection. Another important application of RNA-seq is to detect fusion genes, which are abnormal genes produced by the concatenation of two separate genes arising from chromosomal translocations, or tran-splicing events. Fusion genes play a critical role in investigating causes and development of various cancer types. Based on recent publications (pubmed: 28680106), FusionCatcher yielded most sensitive and precise predictions.

This is an example of finding fusion genes in the BT474 cell line using the public available RNA-seq data (from SRA archive).

mkdir Fusioncatcher_Test
cd Fusioncatcher_Test
 
wget http://ftp-private.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/...
 
 
wget http://ftp-private.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR064/SRR064439/SRR064439.sra

Submit the following batch job to HTC cluster.

#!/bin/bash
#SBATCH -N 1
#SBATCH --cpus-per-task=4 # Request that ncpus be allocated per process.
#SBATCH -J Fusioncatcher_human_sample
#SBATCH -o Fusioncatcher.out
#SBATCH -t 24:00:00
 
module load fusioncatcher/0.99.7b
 
fusioncatcher -p 4 -i ./Fusioncatcher_samples/ -o ./results/
 
# fusioncatcher -d /ihome/sam/apps/fusioncatcher/fusioncatcher/data/ensembl_v86/ -i ./Fusioncatcher_Test/ -o ./results/

Options specified as follows:

  • '-p PROCESSES', or ' --threads=PROCESSES' Number or processes/threads to be used for running SORT, Bowtie, BLAT, STAR, BOWTIE2 and other tools/programs. If this parameter is not specified, 1 core is used.
  • ' --config=CONFIGURATION_FILENAME', The default configuration file is at /ihome/sam/apps/fusioncatcher/fusioncatcher/etc/configuration.cfg
  • '-i INPUT_FILENAME' The input file(s) or directory.
  • '-o OUTPUT_DIRECTORY', The output directory where all the output files containing information about the found candidate fusiongenes are written.
  • '-d DATA_DIRECTORY', The data directory where all the annotations files from Ensembl database are placed. This directory should be built using 'fusioncatcher-build'. The default directory is /ihome/sam/apps/fusioncatcher/fusioncatcher/data/ensembl_v86/