Our Sponsors

Download BioinformaticsOnline(BOL) Apps in your chrome browser.



Question: Question: What are the best known workflow or tutorial for SNP calling?

Neelam Jha
1252 days ago

Question: What are the best known workflow or tutorial for SNP calling?

Hi Neelam,

The extraction of single nucleotide polymorphisms (SNPs) from the raw genetic sequences involves many processing steps and the application of a diverse set of tools. The pipeline includes quality control, mapping of short reads to the reference genome, visualization and post-processing of the alignment including base quality recalibration. Followings are the essential useful pipeline links for SNPs callings:

A simple SNP calling pipeline:


A beginners guide to SNP calling from high-throughput DNA-sequencing data.


SNP Calling Workshop


Calling SNPs with Samtools


Variant Callers for Next-Generation Sequencing Data: A Comparison Study


GotCloud: Variant Calling Pipeline


ngs_backbone: a pipeline for read cleaning, mapping and SNP calling using Next Generation Sequence


iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data


QualitySNPng: a user-friendly SNP detection and visualization tool


Calling variants using BWA and GATK best practice pipeline


SNP calling pipeline


dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms


Pipeline for SNP Analysis


UGP Variant Pipeline 0.0.3


SNPs Calling




Hi Neelam,

There are several workflow, but you might find this SNP pipeline useful. It uses GATK for SNP calling which currently starts with an alignment from BWA.

In nutshell, the flow involves realigning the BAM file using GATK's -> SNP calling using GATK -> Indel calling -> filtering of the resulting VCF files -> Annotate called and filtered SNPs.



Hi Poonam,

The various aspect of SNP calling is covered in this recent http://www.ncbi.nlm.nih.gov/pubmed/21478889 entitled "A framework for variation discovery and genotyping using next-generation DNA sequencing data" from authors of GATK. In addition, keep an eye at software manuals http://www.broadinstitute.org/gatk/ for up-to-date options incorporated in the toolkit.



Hi Neelam,

My GATK workflow for a pair end Illumina data. SNPs calling using following steps:

Downloaded the SNP and indels databases from ftp://gsapubftp-anonymous@ftp.broadinstitute.org (bunlde -> 1.5 -> hg19)

The exome intervals using UCSC Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start

$ bwa aln -t 4 hg19.fa seq1.fastq > 1.sai
$ bwa aln -t 4 hg19.fa seq2.fastq > 2.sai
$ bwa sampe -r "@RG\tID:exomeID\tLB:exomeLB\tSM:exomeSM\tPL:illumina\tPU:exomePU" hg19.fa 1.sai 2.sai seq1.fastq seq2.fastq > original.sam

$ java -Xmx5g -jar FixMateInformation.jar I=original.sam O=fixed.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT
$ java -Xmx5g -jar SortSam.jar I=fixed.sam SO=coordinate O=first.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
$ java -Xmx5g -jar MarkDuplicates.jar I=first.bam O=marked.bam METRICS_FILE=metricsFile CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true

$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T RealignerTargetCreator -R hg19.fa -o intervalsList -I marked.bam -known Mills_and_1000G_gold_standard.indels.hg19.vcf
$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T IndelRealigner -R hg19.fa -I marked.bam -targetIntervals intervalsList -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o realigned.bam
$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T CountCovariates -l INFO -R hg19.fa -I realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile recalFile -knownSites dbsnp_135.hg19.vcf
$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T TableRecalibration -R hg19.fa -I realigned.bam -o recalibrated.bam -recalFile recalFile
$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T UnifiedGenotyper -R hg19.fa -I recalibrated.bam -o resultSNPs.vcf -D dbsnp_135.hg19.vcf -metrics UniGenMetrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A DepthOfCoverage -A AlleleBalance -L exomes.bed

Note: While using it please bear in mind, that it will only call SNPs and not indels.



Hi Neelam,

I guess this seep catalog of human genetic variation analysis of 1000 genomes are much useful. http://www.1000genomes.org/analysis



Hi Neelam,

Please go through this paper "A fast and accurate SNP detection algorithm for next-generation sequencing data" http://www.nature.com/ncomms/journal/v3/n12/abs/ncomms2256.html

You can call variants with freebayes software http://clavius.bc.edu/~erik/CSHL-advanced-sequencing/freebayes-tutorial.html




Hi Neelam,

There are several free software and pipelines for SNP calling. I will suggest you to read this beginners guide to SNP calling from high-throughput DNA-sequencing data. http://www.ncbi.nlm.nih.gov/pubmed/22886560  and try some automatic analysis pipeline of next-generation sequencing data http://www.ncbi.nlm.nih.gov/pubmed/24929521

dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms https://peerj.com/articles/431/

snpqc – an R pipeline for quality control of Illumina SNP genotyping array data.http://onlinelibrary.wiley.com/doi/10.1111/age.12198/abstract;jsessionid=A3B89DD95DB7E06F361B0E7CB903F63F.f01t03

The basic variant-calling and annotation pipeline developed at the Victorian Life Sciences Computation Initiative (VLSCI), University of Melbourne. https://github.com/claresloggett/variant_calling_pipeline



Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing http://genomemedicine.com/content/5/3/28





Hi Neelam,

I prefer VarScan ( http://varscan.sourceforge.net/ ) This is a tool that detects variants (SNPs and indels) in next-generation sequencing data. VarScan now takes SAMtools pileup as input, so it’s compatible with most SAM-friendly short read aligners. SNP, indel, and consensus calling. In addition to detecting variants, VarScan calls consensus genotypes based on read counts and allele frequency. For information http://www.ncbi.nlm.nih.gov/pubmed/22300766




Hi Neelam,

I found this tutorial very useful https://wikis.utexas.edu/display/bioiteam/Variant+calling+tutorial



Hi Neelam,

I guess, this Virmid (Virtual Microdissection for SNP calling) will be useful for you. It is a Java based variant caller designed for disease-control matched samples. Virmid is also specialized for identifying potential within individual contamination where the disease sample cannot be purified enough. While the SNP calling rate is severely compromised with this heterogeneity, Virmid can uncover SNPs with low allele frequency by considering the level of contamination (alpha). http://sourceforge.net/p/virmid/wiki/Home/


FermiKit: assembly-based variant calling for Illumina resequencing dataFermiKit: assembly-based variant calling for Illumina resequencing data https://github.com/lh3/fermikit


Software discoSnp++ is designed for discovering Single Nucleotide Polymorphism (SNP) and insertions/deletions (indels) from raw set(s) of reads obtained with Next Generation Sequencers (NGS).
Note that number of input read sets is not constrained, it can be one, two, or more. Note also that no other data as reference genome or annotations are needed.
The software is composed by two modules. First module, kissnp2, detects SNPs from read sets. A second module, kissreads2, enhance the kissnp2 results by computing per read set  and for each variant found  i/ its mean read coverage and ii/ the (phred) quality of reads generating the polymorphism. https://colibread.inria.fr/software/discosnp/