BOL: Related items

ALE: a Generic Assembly Likelihood Evaluation Framework for Assessing the Accuracy of Genome and Metagenome Assemblies

Neel — Tue, 26 Apr 2016 03:38:43 -0500

Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

More at http://www.ncbi.nlm.nih.gov/pubmed/23303509

Address of the bookmark: http://sc932.github.io/ALE/about.html

RASTtk : algorithm for building custom annotation pipelines and annotating batches of genomes

Abhi — Wed, 27 Apr 2016 11:07:59 -0500

The RAST (Rapid Annotation using Subsystem Technology) annotation engine was built in 2008 to annotate bacterial and archaeal genomes. It works by offering a standard software pipeline for identifying genomic features (i.e., protein-encoding genes and RNA) and annotating their functions. Recently, in order to make RAST a more useful research tool and to keep pace with advancements in bioinformatics, it has become desirable to build a version of RAST that is both customizable and extensible. In this paper, we describe the RAST tool kit (RASTtk), a modular version of RAST that enables researchers to build custom annotation pipelines. RASTtk offers a choice of software for identifying and annotating genomic features as well as the ability to add custom features to an annotation job. RASTtk also accommodates the batch submission of genomes and the ability to customize annotation protocols for batch submissions. This is the first major software restructuring of RAST since its inception.

More at http://www.nature.com/articles/srep08365

Address of the bookmark: http://rast.nmpdr.org/

WgSim

Jit — Thu, 23 Jun 2016 07:26:49 -0500

Reads simulator

Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms.

Wgsim outputs the simulated polymorphisms, and writes the true read coordinates as well as the number of polymorphisms and sequencing errors in read names. One can evaluate the accuracy of a mapper or a SNP caller with wgsim_eval.pl that comes with the package.

Address of the bookmark: https://github.com/lh3/wgsim

PHYMMBL

Jit — Mon, 10 Oct 2016 08:56:34 -0500

Metagenomics sequencing projects collect samples of DNA from uncharacterized environments that may contain hundreds or even thousands of species. One of the main challenges in analyzing a metagenome is phylogenetic classification of raw sequence reads into groups representing the same or similar species. Such classification is a useful prerequisite for genome assembly and for analysis of the biological diversity present in a sample. The newest sequencing technologies have simultaneously made metagenomics easier, by making the sequencing process faster, and more difficult, by producing shorter read lengths than previous technologies. Methods for classifying sequences as short as 100 base pairs (bp) have until now been relatively inaccurate, requiring metagenomics projects to use older, long-read technologies. Phymm, a new classification approach for metagenomics data which uses interpolated Markov models (IMMs) to taxonomically classify DNA sequences, can accurately classify reads as short as 100 bp. Its accuracy for short reads represents a significant leap forward over previous composition-based classification methods. PhymmBL (rhymes with "thimble"), the hybrid classifier included in this distribution which combines analysis from both Phymm and BLAST, produces even higher accuracy.

Address of the bookmark: http://www.cbcb.umd.edu/software/phymm/

BIMA V3: an aligner customized for mate pair library sequencing

Abhimanyu Singh — Wed, 14 Dec 2016 15:20:00 -0600

Summary: Mate pair library sequencing is an effective and economical method for detecting genomic structural variants and chromosomal abnormalities. Unfortunately, the mapping and alignment of mate pair read pairs to a reference genome is a challenging and
time consuming process for most NGS alignment programs. Large insert sizes, introduction of library preparation protocol artifacts (biotin junction reads, paired-end read contamination, chimeras, etc.), and presence of structural variant breakpoints within reads increases mapping and alignment complexity. We describe an algorithm that is up to 20 times faster and 25% more accurate than popular NGS alignment programs when processing mate pair sequencing.
Availability: http://bioinformaticstools.mayo.edu/research/bima/
Contact: vasmatzis.george@mayo.edu

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2014/02/12/bioinformatics.btu078.full.pdf

Understanding HiFi Reads !

Rahul Nayak — Thu, 24 Mar 2022 19:48:11 -0500

While little public data is available for either of the new synthetic long read approaches, Illumina showed an example comparison earlier this year at the Festival of Genomics & Biodata conference (FoG 2022). In the IGV screenshot presented (below), synthetic Infinity reads – labeled “Longas” – are at the top, followed by standard Illumina short reads, and PacBio HiFi reads labeled “CCS” depicted at the bottom:

Address of the bookmark: http://pacb.com/blog/the-hifi-difference-true-long-reads-vs-synthetic-long-reads/

Filtlong: quality filtering tool for long reads

Radha Agarkar — Wed, 13 May 2020 10:23:55 -0500

Filtlong is a tool for filtering long reads by quality. It can take a set of long reads and produce a smaller, better subset. It uses both read length (longer is better) and read identity (higher is better) when choosing which reads pass the filter.

Filtlong builds into a stand-alone executable:

git clone https://github.com/rrwick/Filtlong.git
cd Filtlong
make -j
bin/filtlong -h

Address of the bookmark: https://github.com/rrwick/Filtlong

mixtureS: a novel tool for bacterial strain reconstruction from reads

BioStar — Fri, 21 Aug 2020 08:23:19 -0500

mixtureS that can de novo identify bacterial strains from shotgun reads of a clonal or metagenomic sample, without prior knowledge about the strains and their variations. Tested on 243 simulated datasets and 195 experimental datasets, mixtureS reliably identified the strains, their numbers and their abundance. Compared with three tools, mixtureS showed better performance in almost all simulated datasets and the vast majority of experimental datasets.

Availability

The source code and tool mixtureS is available at http://www.cs.ucf.edu/˜xiaoman/mixtureS/.

Address of the bookmark: http://www.cs.ucf.edu/~xiaoman/mixtureS/

LoRMA: a tool for correcting sequencing errors in long reads such those produced by Pacific Biosciences sequencing machines

Jit — Wed, 15 Jun 2016 17:18:36 -0500

LoRMA is a tool for correcting sequencing errors in long reads such those produced by Pacific Biosciences sequencing machines.

Publication:

L. Salmela, R. Walve, E. Rivals, and E. Ukkonen: Accurate selfcorrection of errors in long reads using de Bruijn graphs. Accepted to RECOMB-Seq 2016.

Download:

Address of the bookmark: https://www.cs.helsinki.fi/u/lmsalmel/LoRMA/

Liftoff: an accurate tool that maps annotations in GFF or GTF between assemblies

Jit — Tue, 30 Jun 2020 21:40:52 -0500

Liftoff, an accurate tool that maps annotations in GFF or GTF between assemblies of the same, or closely-related species. Unlike current coordinate lift-over tools which require a pre-generated “chain” file as input, Liftoff is a standalone tool that takes two genome assemblies and a reference annotation as input and outputs an annotation of the target genome.

Address of the bookmark: https://github.com/agshumate/Liftoff