BOL: Related items

SimLoRD: A read simulator for third generation sequencing reads

Aaryan Lokwani — Wed, 22 Aug 2018 10:40:27 -0500

SimLoRD is a read simulator for third generation sequencing reads and is currently focused on the Pacific Biosciences SMRT error model.

Reads are simulated from both strands of a provided or randomly generated reference sequence.

The reference can be read from a FASTA file or randomly generated with a given GC content. It can consist of several chromosomes, whose structure is respected when drawing reads. (Simulation of genome rearrangements may be incorporated at a later stage.)
The read lengths can be determined in four ways: drawing from a log-normal distribution (typical for genomic DNA), sampling from an existing FASTQ file (typical for RNA), sampling from a a text file with integers (RNA), or using a fixed length
Quality values and number of passes depend on fragment length.
Provided subread error probabilities are modified according to number of passes
Outputs reads in FASTQ format and alignments in SAM format

Address of the bookmark: https://bitbucket.org/genomeinformatics/simlord/

PANDASEQ is a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.

BioStar — Fri, 21 Sep 2018 10:19:52 -0500

Development packages for zlib and libbz2 are needed, as well as a standard compiler environment. On Ubuntu, this can be installed via:

sudo apt-get install build-essential libtool automake zlib1g-dev libbz2-dev pkg-config

On MacOS, the Apple Developer tools and Fink (or MacPorts or Brew) must be installed, then:

sudo fink install bzip2-dev pkgconfig

Address of the bookmark: https://github.com/neufeld/pandaseq

Pacasus: Correction of palindromes in long reads from PacBio and Nanopore

BioStar — Mon, 12 Nov 2018 05:26:48 -0600

Tool for detecting and cleaning PacBio / Nanopore long reads after whole genome amplification. Check the poster from the Revolutionizing Next-Generation Sequencing (2nd edition) conference in the source folder: https://github.com/swarris/Pacasus/blob/master/vib2017.pdf.

The prepint version is found on http://www.biorxiv.org/content/early/2017/08/09/173872

It uses the pyPaSWAS framework for sequence alignment (https://github.com/swarris/pyPaSWAS)

Address of the bookmark: https://github.com/swarris/Pacasus

Flye: Fast and accurate de novo assembler for single molecule sequencing reads

Rahul Nayak — Sat, 06 Jul 2019 03:48:22 -0500

Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PB / ONT reads as input and outputs polished contigs. Flye also includes a special mode for metagenome assembly.

Address of the bookmark: https://github.com/fenderglass/Flye

Free Genomics data !

BioStar — Fri, 07 Feb 2020 14:08:31 -0600

The specimens were collected by the Oxford Wytham Woods and Edinburgh Lohse lab teams. DNA extraction and sequencing was carried out by the Sanger Institute Scientific Operations teams. Assemblies were carried out by the Tree of Life team (Shane McCarthy) and colleagues in Pacific Biosciences (Jonas Korlach).

https://www.darwintreeoflife.org/an-initial-set-of-raw-genome-assemblies-from-the-darwin-tree-of-life-project/

Address of the bookmark: https://www.darwintreeoflife.org/an-initial-set-of-raw-genome-assemblies-from-the-darwin-tree-of-life-project/

Filtlong: quality filtering tool for long reads

Radha Agarkar — Wed, 13 May 2020 10:23:55 -0500

Filtlong is a tool for filtering long reads by quality. It can take a set of long reads and produce a smaller, better subset. It uses both read length (longer is better) and read identity (higher is better) when choosing which reads pass the filter.

Filtlong builds into a stand-alone executable:

git clone https://github.com/rrwick/Filtlong.git
cd Filtlong
make -j
bin/filtlong -h

Address of the bookmark: https://github.com/rrwick/Filtlong

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

Jit — Thu, 24 Dec 2020 10:03:36 -0600

Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. It can assemble a human genome in several hours and works with the California redwood genome, one of the most complex genomes sequenced so far. Hifiasm can produce primary/alternate assemblies of quality competitive with the best assemblers. It also introduces a new graph binning algorithm and achieves the best haplotype-resolved assembly given trio data.

Address of the bookmark: https://github.com/chhylp123/hifiasm

Bioinformatics tools for telomere to telomere assembly !

BioStar — Tue, 17 Aug 2021 13:17:09 -0500

● Merfin – k-mer-based assembly and variant calling evaluation for improved consensus accuracy (Arang Rhie)
● PanGenie – algorithm that leverages a pangenome reference built from haplotype-resolved genome assemblies in conjunction with k-mer count information from raw, short-read sequencing data to genotype a wide spectrum of genetic variation (Tobias Marschall)
● SQANTI3 – an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline (Rocío Amorín de Hegedüs @rocioadh)
● tama (Transcriptome Annotation by Modular Algorithms) – software designed for processing Iso-Seq data and other long-read transcriptome data (Richard Kuo @GenomeRIK)
● pbaa (PacBio Amplicon Analysis) – separates complex mixtures of amplicon targets from genomic samples to cluster and generate high-quality consensus sequences from HiFi reads (Zev Kronenberg @zevkronenberg)
● bellerophon – analyzes MHC typing and other low-complexity gene amplicon data; performs allele calling while detecting polymorphic sites within the sequences and removing potential chimeric sequence variants (Yuanyuan Cheng @Yuanyuan929)
● svpack – tools for filtering, comparing, and annotating structural variant (SV) calls in VCF format (Aaron Wenger)
● JumboDB – tool for de Bruijn graph construction (Anton Bankevich @AntonBankevich)
● uLTRA – tool for splice alignment of long transcriptomic reads to a genome, guided by a database of exon annotations. (Kristoffer Sahlin @krsahlin)
● LeafGo – workflow to rapidly produce high-quality de novo plant genomes (Luca Ermini @ermini_luca)

Reference:

https://www.pacb.com/blog/young-investigators-share-stellar-science-career-advice-and-bioinformatics-tools-at-smrt-leiden-2021/

Stacks

Jitendra Narayan — Wed, 24 Feb 2016 15:52:30 -0600

Stacks is a software pipeline for building loci from short-read sequences, such as those generated on the Illumina platform. Stacks was developed to work with restriction enzyme-based data, such as RAD-seq, for the purpose of building genetic maps and conducting population genomics and phylogeography.

More at http://catchenlab.life.illinois.edu/stacks/

Address of the bookmark: http://catchenlab.life.illinois.edu/stacks/

ALE: a Generic Assembly Likelihood Evaluation Framework for Assessing the Accuracy of Genome and Metagenome Assemblies

Neel — Tue, 26 Apr 2016 03:38:43 -0500

Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

More at http://www.ncbi.nlm.nih.gov/pubmed/23303509

Address of the bookmark: http://sc932.github.io/ALE/about.html