BOL: Question: What are the best known workflow or tutorial for SNP calling?

Question: Question: What are the best known workflow or tutorial for SNP calling?

Neel
3956 days ago

Question: What are the best known workflow or tutorial for SNP calling?

Answers

Hi Neelam,

The extraction of single nucleotide polymorphisms (SNPs) from the raw genetic sequences involves many processing steps and the application of a diverse set of tools. The pipeline includes quality control, mapping of short reads to the reference genome, visualization and post-processing of the alignment including base quality recalibration. Followings are the essential useful pipeline links for SNPs callings:

A simple SNP calling pipeline:

http://www.tgac.ac.uk/Event%20Docs/Summer%20School:%20Walk%20Through%20BioInf/NGS%20Challenges.pdf

A beginners guide to SNP calling from high-throughput DNA-sequencing data.

http://www.ncbi.nlm.nih.gov/pubmed/22886560

SNP Calling Workshop

http://www.ebi.ac.uk/training/sites/ebi.ac.uk.training/files/materials/2014/140217_AgriOmics/dan_bolser_snp_calling_tutorial.pdf

Calling SNPs with Samtools

http://ged.msu.edu/angus/tutorials-2012/snp_tutorial.html

Variant Callers for Next-Generation Sequencing Data: A Comparison Study

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0075619

GotCloud: Variant Calling Pipeline

http://genome.sph.umich.edu/wiki/GotCloud:_Variant_Calling_Pipeline

ngs_backbone: a pipeline for read cleaning, mapping and SNP calling using Next Generation Sequence

http://www.biomedcentral.com/1471-2164/12/285

iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data

http://www.biomedcentral.com/1752-0509/7/S6/S8

QualitySNPng: a user-friendly SNP detection and visualization tool

http://nar.oxfordjournals.org/content/early/2013/04/29/nar.gkt333.full

Calling variants using BWA and GATK best practice pipeline

http://varianttools.sourceforge.net/Calling/BwaGatkHg19

SNP calling pipeline

http://www.bbmriwiki.nl/wiki/SnpCallingPipeline

dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms

https://peerj.com/preprints/314/

Pipeline for SNP Analysis

http://sniplay.cirad.fr/cgi-bin/analysis.cgi

UGP Variant Pipeline 0.0.3

http://weatherby.genetics.utah.edu/UGP/wiki/index.php/UGP_Variant_Pipeline_0.0.3

SNPs Calling

https://code.google.com/p/rseqflow/wiki/PipelineDescription#SNPs_Calling

Thanks

Jit 3955 days ago

Hi Neelam,

There are several workflow, but you might find this SNP pipeline useful. It uses GATK for SNP calling which currently starts with an alignment from BWA.

In nutshell, the flow involves realigning the BAM file using GATK's -> SNP calling using GATK -> Indel calling -> filtering of the resulting VCF files -> Annotate called and filtered SNPs.

Thanks

Poonam Mahapatra 3955 days ago

Hi Poonam,

The various aspect of SNP calling is covered in this recent http://www.ncbi.nlm.nih.gov/pubmed/21478889 entitled "A framework for variation discovery and genotyping using next-generation DNA sequencing data" from authors of GATK. In addition, keep an eye at software manuals http://www.broadinstitute.org/gatk/ for up-to-date options incorporated in the toolkit.

Thanks

Abhimanyu Singh 3955 days ago

Hi Neelam,

My GATK workflow for a pair end Illumina data. SNPs calling using following steps:

Downloaded the SNP and indels databases from ftp://gsapubftp-anonymous@ftp.broadinstitute.org (bunlde -> 1.5 -> hg19)

The exome intervals using UCSC Table Browser http://genome.ucsc.edu/cgi-bin/hgTables?command=start

$ bwa aln -t 4 hg19.fa seq1.fastq > 1.sai
$ bwa aln -t 4 hg19.fa seq2.fastq > 2.sai
$ bwa sampe -r "@RG\tID:exomeID\tLB:exomeLB\tSM:exomeSM\tPL:illumina\tPU:exomePU" hg19.fa 1.sai 2.sai seq1.fastq seq2.fastq > original.sam

$ java -Xmx5g -jar FixMateInformation.jar I=original.sam O=fixed.sam SO=coordinate VALIDATION_STRINGENCY=LENIENT
$ java -Xmx5g -jar SortSam.jar I=fixed.sam SO=coordinate O=first.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
$ java -Xmx5g -jar MarkDuplicates.jar I=first.bam O=marked.bam METRICS_FILE=metricsFile CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true

$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T RealignerTargetCreator -R hg19.fa -o intervalsList -I marked.bam -known Mills_and_1000G_gold_standard.indels.hg19.vcf
$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T IndelRealigner -R hg19.fa -I marked.bam -targetIntervals intervalsList -known Mills_and_1000G_gold_standard.indels.hg19.vcf -o realigned.bam
$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T CountCovariates -l INFO -R hg19.fa -I realigned.bam -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate -recalFile recalFile -knownSites dbsnp_135.hg19.vcf
$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T TableRecalibration -R hg19.fa -I realigned.bam -o recalibrated.bam -recalFile recalFile
$ java -Xmx5g -jar GenomeAnalysisTK.jar -nt 4 -T UnifiedGenotyper -R hg19.fa -I recalibrated.bam -o resultSNPs.vcf -D dbsnp_135.hg19.vcf -metrics UniGenMetrics -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 1000 -A DepthOfCoverage -A AlleleBalance -L exomes.bed

Note: While using it please bear in mind, that it will only call SNPs and not indels.

Thanks

John Parker 3953 days ago

Hi Neelam,

I guess this seep catalog of human genetic variation analysis of 1000 genomes are much useful. http://www.1000genomes.org/analysis

Thanks

Surabhi Chaudhary 3953 days ago

Hi Neelam,

Please go through this paper "A fast and accurate SNP detection algorithm for next-generation sequencing data" http://www.nature.com/ncomms/journal/v3/n12/abs/ncomms2256.html

You can call variants with freebayes software http://clavius.bc.edu/~erik/CSHL-advanced-sequencing/freebayes-tutorial.html

https://github.com/ekg/freebayes

Thanks

Suleman Khan 3945 days ago

Hi Neelam,

There are several free software and pipelines for SNP calling. I will suggest you to read this beginners guide to SNP calling from high-throughput DNA-sequencing data. http://www.ncbi.nlm.nih.gov/pubmed/22886560 and try some automatic analysis pipeline of next-generation sequencing data http://www.ncbi.nlm.nih.gov/pubmed/24929521

dDocent: a RADseq, variant-calling pipeline designed for population genomics of non-model organisms https://peerj.com/articles/431/

snpqc – an R pipeline for quality control of Illumina SNP genotyping array data.http://onlinelibrary.wiley.com/doi/10.1111/age.12198/abstract;jsessionid=A3B89DD95DB7E06F361B0E7CB903F63F.f01t03

The basic variant-calling and annotation pipeline developed at the Victorian Life Sciences Computation Initiative (VLSCI), University of Melbourne. https://github.com/claresloggett/variant_calling_pipeline

http://compbio.ufl.edu/wp-content/uploads/2014/02/Azarian_Bioinfo_Seminar_013914.pdf

http://ngsda.blogspot.in/2010/10/snp-call-pipeline.html

Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing http://genomemedicine.com/content/5/3/28

http://www.biotech.cornell.edu/brc/genomic-diversity-facility/services/gbs-data-analysis

Thanks

Radha Agarkar 3941 days ago

Hi Neelam,

I prefer VarScan ( http://varscan.sourceforge.net/ ) This is a tool that detects variants (SNPs and indels) in next-generation sequencing data. VarScan now takes SAMtools pileup as input, so it’s compatible with most SAM-friendly short read aligners. SNP, indel, and consensus calling. In addition to detecting variants, VarScan calls consensus genotypes based on read counts and allele frequency. For information http://www.ncbi.nlm.nih.gov/pubmed/22300766

http://massgenomics.org/varscan

Cheers

Martin Jones 3941 days ago

Hi Neelam,

I found this tutorial very useful https://wikis.utexas.edu/display/bioiteam/Variant+calling+tutorial

Thanks

Pragati Singh 3936 days ago

Hi Neelam,

I guess, this Virmid (Virtual Microdissection for SNP calling) will be useful for you. It is a Java based variant caller designed for disease-control matched samples. Virmid is also specialized for identifying potential within individual contamination where the disease sample cannot be purified enough. While the SNP calling rate is severely compromised with this heterogeneity, Virmid can uncover SNPs with low allele frequency by considering the level of contamination (alpha). http://sourceforge.net/p/virmid/wiki/Home/

FermiKit: assembly-based variant calling for Illumina resequencing dataFermiKit: assembly-based variant calling for Illumina resequencing data https://github.com/lh3/fermikit

Software discoSnp++ is designed for discovering Single Nucleotide Polymorphism (SNP) and insertions/deletions (indels) from raw set(s) of reads obtained with Next Generation Sequencers (NGS).
Note that number of input read sets is not constrained, it can be one, two, or more. Note also that no other data as reference genome or annotations are needed.
The software is composed by two modules. First module, kissnp2, detects SNPs from read sets. A second module, kissreads2, enhance the kissnp2 results by computing per read set and for each variant found i/ its mean read coverage and ii/ the (phred) quality of reads generating the polymorphism. https://colibread.inria.fr/software/discosnp/

Cheers

Jit 3567 days ago

You should try following as well

Genome-Wide Association Studies

Variant Calling Pipeline: FastQ to Annotated SNPs in Hours

A simple SNP calling pipeline

Hands-on Tutorial on SNP Calling

Genome Analysis Toolkit

Jit 2692 days ago