BOL: Related items

Seal: SEquence ALignment evaluation suite

Jit — Wed, 03 Jan 2018 05:05:46 -0600

Seal is a comprehensive sequencing simulation and alignment tool evaluation suite. This software (implemented in Java) provides several utilities that can be used to evaluate alignment algorithms, including:

Reading a pre-existing reference genome from one or more FASTA files.
Alternatively, generating an artificial reference genome based on input parameters (length, repeat count, repeat length, repeat variability rate).
Simulating reads from random locations in the genome based on input parameters of read length, coverage, sequencing error rate, and indel rate.
Applying alignment tools to the genome and the reads through a standardized interface.
Parsing the output of the alignment tool and calculating the number of reads that were correctly or incorrectly mapped.
Computing run times and measures of accuracy.

Seal has interfaces to evaluate the following software packages:

Bowtie
BWA
MAQ
mrFAST
mrsFAST
Novoalign
SHRiMP
SOAPv2

Address of the bookmark: http://compbio.case.edu/seal/

Tools to Predict the Impact of Missense Variants !

Jit — Mon, 23 Apr 2018 12:57:33 -0500

Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen‐2, SIFT, FatHMM, MutationTaster‐2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants.

Study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. Comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.

Following tools are useful for mis sense muation detection ...

PolyPhen‐2 (PP2)
“Predicts possible impact of an amino acid substitution on the structure and function of a human protein using straightforward physical and comparative considerations”

MutationTaster‐2 (MT2)
“Evaluation of the disease‐causing potential of DNA sequence alterations”

MutationAssessor (MASS)
“Predicts the functional impact of amino acid substitutions in proteins, such as mutations discovered in cancer or missense polymorphisms”

LRT
“Identify a subset of deleterious mutations that disrupt highly conserved amino acids within protein‐coding sequences, which are likely to be unconditionally deleterious”

SIFT
“Predicts whether an amino acid substitution affects protein function”

GERP++
“Identifies constrained elements in multiple alignments by quantifying substitution deficits. These deficits represent substitutions that would have occurred if the element were neutral DNA, but did not occur because the element has been under functional constraint. We refer to these deficits as “rejected substitutions.” Rejected substitutions are a natural measure of constraint that reflects the strength of past purifying selection on the element”

phyloP
“Compute conservation or acceleration P values based on an alignment and a model of neutral evolution”

FatHMM unweighted (FatHMM‐U)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants”

FatHMM weighted (FatHMM‐W)
Predicts “functional consequences of both coding variants, that is, nonsynonymous single‐nucleotide variants, and noncoding variants” and its weighting scheme attributes higher tolerance scores to SNVs in proteins, related proteins, or domains that already include a high fraction of pathogenic variantsh

Combined Annotation Dependent Depletion (CADD)
“CADD is a tool for scoring the deleteriousness of single‐nucleotide variants as well as insertion/deletions variants in the human genome”

AfterQC: Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data

Jit — Fri, 29 Jun 2018 03:26:03 -0500

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data AfterQC can simply go through all fastq files in a folder and then output three folders: good, bad and QC folders, which contains good reads, bad reads and the QC results of each fastq file/pair. Currently it supports processing data from HiSeq 2000/2500/3000/4000, Nextseq 500/550, MiniSeq...and other Illumina 1.8 or newer formats The author has reimplemented this tool in C++ with multithreading support to make it much faster. The new tool is called fastp and can be found at: https://github.com/OpenGene/fastp . If you prefer a C++ based tool, please use fastp instead. https://github.com/OpenGene/AfterQC

Address of the bookmark: https://github.com/OpenGene/AfterQC

MITOS: improved de novo metazoan mitochondrial genome annotation

Jit — Fri, 26 Oct 2018 08:25:39 -0500

Allows automatic annotation of metazoan mitochondrial genomes. MITOS is a pipeline designed to compute a consistent de novo annotation of the mitogenomic sequences. The software allows for a systematic error screening, the standardisation of gene name and gene boundary designation, anticodon labelling of tRNAs, and provides the means for the assessment of the validity of a gene assignment.

Address of the bookmark: http://mitos.bioinf.uni-leipzig.de/index.py

kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity

Rahul Nayak — Tue, 29 May 2018 08:37:53 -0500

The k-mer Weighted Inner Product.

This software implements a de novo, alignment free measure of sample genetic dissimilarity which operates upon raw sequencing reads. It is able to calculate the genetic dissimilarity between samples without any reference genome, and without assembling one.

De novo estimates of genetic relatedness from next-gen sequencing data https://kwip.readthedocs.org

Address of the bookmark: https://github.com/kdmurray91/kwip

Tallymer: method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

Jit — Thu, 15 Feb 2018 10:21:02 -0600

Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set whole genome shotgun sequences from maize (B73) (total size 10⁹ bp).
Tallymer was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available.

A manual can be found here.

Address of the bookmark: https://www.zbh.uni-hamburg.de/forschung/arbeitsgruppe-genominformatik/software/tallymer.html

CheckM:Assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes

Rahul Nayak — Wed, 23 May 2018 04:39:26 -0500

CheckM provides a set of tools for assessing the quality of genomes recovered from isolates, single cells, or metagenomes. It provides robust estimates of genome completeness and contamination by using collocated sets of genes that are ubiquitous and single-copy within a phylogenetic lineage. Assessment of genome quality can also be examined using plots depicting key genomic characteristics (e.g., GC, coding density) which highlight sequences outside the expected distributions of a typical genome. CheckM also provides tools for identifying genome bins that are likely candidates for merging based on marker set compatibility, similarity in genomic characteristics, and proximity within a reference genome tree.

Address of the bookmark: http://ecogenomics.github.io/CheckM/

Platypus: A Haplotype-Based Variant Caller For Next Generation Sequence Data

Shruti Paniwala — Thu, 25 Oct 2018 06:14:55 -0500

Platypus is a tool designed for efficient and accurate variant-detection in high-throughput sequencing data. By using local realignment of reads and local assembly it achieves both high sensitivity and high specificity. Platypus can detect SNPs, MNPs, short indels, replacements and (using the assembly option) deletions up to several kb. It has been extensively tested on whole-genome, exon-capture, and targeted capture data, it has been run on very large datasets as part of the Thousand Genomes and WGS500 projects, and is being used in clinical sequencing trials in the Mainstreaming Cancer Genetics programme.

Tutorial https://github.com/andyrimmer/Platypus/blob/master/misc/README.txt

Address of the bookmark: http://www.well.ox.ac.uk/platypus

Variant Calling Resequencing-Based Genome Inference

Abhi — Wed, 31 Jul 2024 02:02:24 -0500

Variant Calling - Resequencing-Based Genome Inference

Erik Garrison
University of Tennessee Health Science Center
Workshop on Genomics - Český Krumlov
January 12, 2024

https://evomics.org/wp-content/uploads/2024/01/Variant-calling-Workshop-on-Genomics-2024-Cesky-Krumlov.pdf

Address of the bookmark: https://evomics.org/wp-content/uploads/2024/01/Variant-calling-Workshop-on-Genomics-2024-Cesky-Krumlov.pdf

vcfR: a package to manipulate and visualize VCF data in R

Jit — Thu, 25 Oct 2018 09:05:59 -0500

VcfR is an R package intended to allow easy manipulation and visualization of variant call format (VCF) data. Functions are provided to rapidly read from and write to VCF files. Once VCF data is read into R a parser function extracts matrices from the VCF data for use with typical R functions. This information can then be used for quality control or other purposes. Additional functions provide visualization of genomic data. Once processing is complete data may be written to a VCF file or converted into other popular R objects (e.g., genlight, DNAbin). VcfR provides a link between VCF data and the R environment connecting familiar software with genomic data.

Address of the bookmark: https://github.com/knausb/vcfR