BOL: Related items

nQuire: a statistical framework for ploidy estimation using next generation sequencing

Jit — Thu, 04 Oct 2018 05:23:59 -0500

nQuire provides a statistical framework to study organisms with intraspecific variation in ploidy. nQuire is likely to be useful in epidemiological studies of pathogens, artificial selection experiments, and for historical or ancient samples where intact nuclei are not preserved. It is implemented as a stand-alone Linux command line tool in the C programming language and is available at https://github.com/clwgg/nQuireunder the MIT license.

Address of the bookmark: https://github.com/clwgg/nQuireunder

NanoPack: visualizing and processing long-read sequencing data

Jit — Tue, 25 Dec 2018 21:20:50 -0600

The NanoPack tools are written in Python3 and released under the GNU GPL3.0 License. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for Linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools.

Address of the bookmark: https://github.com/wdecoster/nanopack

jackalope: A swift, versatile phylogenomic and high-throughput sequencing simulator

Abhimanyu Singh — Fri, 26 Jul 2019 00:58:12 -0500

jackalope simply and efficiently simulates (i) variants from reference genomes and (ii) reads from both Illumina and Pacific Biosciences (PacBio) platforms. It can either read reference genomes from FASTA files or simulate new ones. Genomic variants can be simulated using summary statistics, phylogenies, Variant Call Format (VCF) files, and coalescent simulations—the latter of which can include selection, recombination, and demographic fluctuations. jackalope can simulate single, paired-end, or mate-pair Illumina reads, as well as reads from Pacific Biosciences These simulations include sequencing errors, mapping qualities, multiplexing, and optical/PCR duplicates. All outputs can be written to standard file formats.

A swift, versatile phylogenomic and high-throughput sequencing simulator https://jackalope.lucasnell.com

Address of the bookmark: https://github.com/lucasnell/jackalope

DeepVariant : an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Jit — Sat, 25 Jan 2020 13:28:09 -0600

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data. DeepVariant relies on Nucleus, a library of Python and C++ code for reading and writing data in common genomics file formats (like SAM and VCF) designed for painless integration with the TensorFlow machine learning framework.

https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html

https://www.biorxiv.org/content/10.1101/092890v6

Address of the bookmark: https://github.com/google/deepvariant

McClintock: Meta-pipeline to identify transposable element insertions using next generation sequencing data

BioStar — Tue, 27 Oct 2020 00:21:18 -0500

an integrated bioinformatics pipeline for the detection of TE insertions in whole-genome shotgun data, called McClintock (https://github.com/bergmanlab/mcclintock), which automatically runs and standardizes output for multiple TE detection methods. We demonstrate the utility of McClintock by evaluating six TE detection methods using simulated and real genome data from the model microbial eukaryote, Saccharomyces cerevisiae.

Address of the bookmark: https://github.com/bergmanlab/mcclintock

Scalpel

Shruti Paniwala — Wed, 20 Aug 2014 02:07:58 -0500

A team from Cold Spring Harbor Laboratory has released an algorithm, called Scalpel, for finding insertions and deletions in next generation sequencing data sets. Scalpel, which is open source and available for download on SourceForge, outperformed the popular tools GATK HaplotypeCaller and SOAPindel in test runs on both simulated and real whole human exomes.

Like other indel callers, Scalpel works by performing de novo assembly of regions of interest, so that misalignment to the reference genome cannot obscure the presence of an insertion or deletion. Scalpel's innovation is to repeatedly check its assembly before comparing to the reference genome, to account for simple sequence repeats that are a regular source of error in indel calling. When Scalpel assembles an exon, it collects reads that map to that exon (including partial matches), splits them into k-mers, and creates a de Bruijn graph to span the exon; however, if it detects repeats in the map, it iteratively increases the size of the k-mers by one base until the repeats are eliminated. This ensures that the final assembly of the exon is highly accurate while minimizing compute time.

The Cold Spring Harbor team's validation of Scalpel, published over the weekend in Nature Methods, compares Scalpel's performance on a live whole exome against HaplotypeCaller and SOAPindel. The donor is an individual with serious neurological disorders, which may be linked to a high incidence of indels. One thousand indels from this individual's exome, called by one or more of the informatics pipelines, were selected for focused resequencing. This resequencing revealed a 77% true positive rate for Scalpel calls, dramatically better than the rates for either of the competing tools; Scalpel performed especially well with indels longer than five base pairs, a traditional weak point for indel callers.

Finally, the authors demonstrate Scalpel's use on a large set of genetic data from nearly 600 families who donated samples to the Simons Simplex Collection, a project of the Simons Foundation Autism Research Initiative. Scalpel found a very high enrichment for indels in children affected by autism, compared with their unaffected siblings, a pattern that persisted even after excluding common variants.

Release Notes for Genome Workbench 2.10.5

Jit — Thu, 12 May 2016 13:49:41 -0500

New Features in latest release

New ProSplign tool integrated with Genome Workbench (Tutorial, Video)
New export function for BAM/cSRA coverage graphs (Tutorial)
New export function for alignments GFF3 format ((Tutorial))
Tree View: implemented new export mode based on selections (tutorial coming)
Tree View: added support for distance based circular trees
Tree View: new rooting mode (Midpoint Root) results in more balanced trees (Tutorial)
Tree View: added possibility to right-click on an edge between two nodes and "Place Root at Middle of Branch" – to re-root at mid-branch (Tutorial)

GKNO

Neel — Fri, 20 May 2016 18:56:37 -0500

gkno opens the world of complex bioinformatic analysis to people of all level of computational expertise. This site contains documentation, tutorials and information on all the tools that comprise gkno.

http://gkno.me/how-to/install.html

http://gkno.me/software.html

Address of the bookmark: http://gkno.me/

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm)

Abhimanyu Singh — Thu, 29 Dec 2016 03:26:45 -0600

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. Key features of Prodigal include:

Speed: Prodigal is an extremely fast gene recognition tool (written in very vanilla C). It can analyze an entire microbial genome in 30 seconds or less.
Accuracy: Prodigal is a highly accurate gene finder. It correctly locates the 3' end of every gene in the experimentally verified Ecogene data set (except those containing introns). It possesses a very sophisticated ribosomal binding site scoring system that enables it to locate the translation initiation site with great accuracy (96% of the 5' ends in the Ecogene data set are located correctly).
Specificity: Prodigal's false positive rate compares favorably with other gene identification programs, and usually falls under 5%.
GC-Content Indifferent: Prodigal performs well even in high GC genomes, with over a 90% perfect match (5'+3') to the Pseudomonas aeruginosa curated annotations.
Metagenomic Version: Prodigal can run in metagenomic mode and analyze sequences even when the organism is unknown.
Ease of Use: Prodigal can be run in one step on a single genomic sequence or on a draft genome containing many sequences. It does not need to be supplied with any knowledge of the organism, as it learns all the properties it needs to on its own.
Open Source: Prodigal source code is freely available under the General Public License.

Download the latest version of Prodigal at the Prodigal github page.
or
Browse the wiki documenation.

Address of the bookmark: http://prodigal.ornl.gov/

BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

Poonam Mahapatra — Wed, 03 Jan 2018 00:25:27 -0600

BBSplit internally uses BBMap to map reads to multiple genomes at once, and determine which genome they match best. This is different than with ordinary mapping. If a genome (say, human) contains an exact repeat somewhere, reads mapping to it will be mapped ambiguously. But if you want to determine whether reads are mouse or human, it does not matter whether they map ambiguously within human, only whether they are ambiguous between human and mouse. BBSplit tracks this additional ambiguity information and decides how to use it based on the “ambig2” flag. The normal use of BBSplit is like Seal, either quantifying how many reads go to each reference, or splitting the reads into multiple output files, one per reference. BBSplit can only be run using references indexed with BBSplit, as they contain additional information regarding which sequences came from which reference file.

BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special "ambiguous" file for each of them. Paired reads will always be kept together.

For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t

This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean.fq (unmapped reads)

In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq

BBSplit is available here:
https://sourceforge.net/projects/bbmap/

The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"