BOL: Related items

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

Jit — Thu, 24 Dec 2020 10:03:36 -0600

Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. It can assemble a human genome in several hours and works with the California redwood genome, one of the most complex genomes sequenced so far. Hifiasm can produce primary/alternate assemblies of quality competitive with the best assemblers. It also introduces a new graph binning algorithm and achieves the best haplotype-resolved assembly given trio data.

Address of the bookmark: https://github.com/chhylp123/hifiasm

Bioinformatics tools for telomere to telomere assembly !

BioStar — Tue, 17 Aug 2021 13:17:09 -0500

● Merfin – k-mer-based assembly and variant calling evaluation for improved consensus accuracy (Arang Rhie)
● PanGenie – algorithm that leverages a pangenome reference built from haplotype-resolved genome assemblies in conjunction with k-mer count information from raw, short-read sequencing data to genotype a wide spectrum of genetic variation (Tobias Marschall)
● SQANTI3 – an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline (Rocío Amorín de Hegedüs @rocioadh)
● tama (Transcriptome Annotation by Modular Algorithms) – software designed for processing Iso-Seq data and other long-read transcriptome data (Richard Kuo @GenomeRIK)
● pbaa (PacBio Amplicon Analysis) – separates complex mixtures of amplicon targets from genomic samples to cluster and generate high-quality consensus sequences from HiFi reads (Zev Kronenberg @zevkronenberg)
● bellerophon – analyzes MHC typing and other low-complexity gene amplicon data; performs allele calling while detecting polymorphic sites within the sequences and removing potential chimeric sequence variants (Yuanyuan Cheng @Yuanyuan929)
● svpack – tools for filtering, comparing, and annotating structural variant (SV) calls in VCF format (Aaron Wenger)
● JumboDB – tool for de Bruijn graph construction (Anton Bankevich @AntonBankevich)
● uLTRA – tool for splice alignment of long transcriptomic reads to a genome, guided by a database of exon annotations. (Kristoffer Sahlin @krsahlin)
● LeafGo – workflow to rapidly produce high-quality de novo plant genomes (Luca Ermini @ermini_luca)

Reference:

https://www.pacb.com/blog/young-investigators-share-stellar-science-career-advice-and-bioinformatics-tools-at-smrt-leiden-2021/

MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads

Abhi — Tue, 05 Sep 2023 07:31:35 -0500

MitoHiFi v3.2 is a python pipeline distributed under MIT License !

MitoHiFi was first developed to assemble the mitogenomes for a wide range of species in the Darwin Tree of Life Project (DToL)

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05385-y

Address of the bookmark: https://github.com/marcelauliano/MitoHiFi

Trimmomatic: A flexible read trimming tool for Illumina NGS data

Jit — Fri, 15 Apr 2016 05:58:53 -0500

Paired End:

java -jar trimmomatic-0.35.jar PE -phred33 input_forward.fq.gz input_reverse.fq.gz output_forward_paired.fq.gz output_forward_unpaired.fq.gz output_reverse_paired.fq.gz output_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
Remove leading low quality or N bases (below quality 3) (LEADING:3)
Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
Drop reads below the 36 bases long (MINLEN:36)

More at http://www.usadellab.org/cms/?page=trimmomatic

Address of the bookmark: http://www.usadellab.org/cms/?page=trimmomatic

segemehl

Anjana — Tue, 10 May 2016 08:10:15 -0500

segemehl is a software to map short sequencer reads to reference genomes. Unlike other methods, segemehl is able to detect not only mismatches but also insertions and deletions. Furthermore, segemehl is not limited to a specific read length and is able to map primer- or polyadenylation contaminated reads correctly. segemehl implements a matching strategy based on enhanced suffix arrays (ESA).

More at http://www.bioinf.uni-leipzig.de/Software/segemehl/

Manual http://www.bioinf.uni-leipzig.de/Software/segemehl/segemehl_manual_0_1_7.pdf

Address of the bookmark: http://hoffmann.bioinf.uni-leipzig.de/LIFE/segemehl.html

CovCal: Coverage / Read Count Calculator

Jit — Wed, 15 Jun 2016 18:08:13 -0500

Coverage / Read Count Calculator

Calculate how much sequencing you need to hit a target depth of coverage (or vice versa).

Instructions: set the read length/configuration and genome size, then select what you want to calculate.

Written by Stephen Turner, based on the Lander-Waterman formula, inspired by a similar calculator written by James Hadfield. Coverage is calculated as C=LN/G and reads as N=CG/L where C = Coverage (X),L = Read length (bp), G = Haploid genome size (bp), and N = Number of reads. Source code on GitHub.

Address of the bookmark: http://apps.bioconnector.virginia.edu/covcalc/

Maq: Mapping and Assembly with Quality

Jit — Tue, 22 Nov 2016 04:51:39 -0600

Maq stands for Mapping and Assembly with Quality It builds assembly by mapping short reads to reference sequences. Maq is a project hosted by SourceForge.net. The project page is available athttp://sourceforge.net/projects/maq/. Maq is previously known as mapass2.

Run Maq Now

Follow these steps to try Maq. All you need is a reference sequence file in the FASTA format.

Prepare a reference sequence (ref.fasta). Better a bacterial genome.
Download maq, maq-data and maqview at the download page.
Copy maq, maq.pl and maq_eval.pl to the $PATH or to the same directory.
Simulate diploid reference and read sequences, map reads, call variants and evaluate the results in one go:
```
maq.pl demo ref.fasta calib-30.dat
```
where calib-30.dat is contained in maq-data.

View the alignment:

cd maqdemo/easyrun;
maqindex -i -c consensus.cns all.map;
maqview -c consensus.cns all.map

Even for advanced maq users, running `maq.pl demo' is recommended. You may find something helpful.

Address of the bookmark: http://maq.sourceforge.net

Understanding PacBio

Jitendra Narayan — Fri, 24 Feb 2017 10:17:36 -0600

This tutorial includes resources for learning more about PacBio data and bioinformatics analysis, and includes content suitable for both beginners and experts. Below are links to training modules (webinars and PowerPoint presentations) to help you get started with your data processing, as well as information for specialized applications.

Training Resources:

Specialized Applications:

Address of the bookmark: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki

Software and Tools to detect structure variation with long reads !!

Archana Malhotra — Wed, 15 Mar 2017 14:31:09 -0500

Uncovering the connection between genetics and heritable diseases requires an approach that looks at all the variant bases and types in a genome. While a PacBio de novo assembly resolves the most novel SV variants. 8-10X PacBio coverage of single genomes or trios reveals triple the SVs detectable by short-read data.

With Single Molecule, Real-Time (SMRT) Sequencing, you can access structural variations having a broad range of sizes, types, and GC content with the ability to:

Uncover missing heritability linked to structural variation
Unambiguously identify genomic context and variant breakpoints at the sequence level to unravel the genetic etiology of disease
Resolve structural variation across the complete size spectrum with basepair resolution

Following are the SV tools, which can assist you to achieve your goal.

Sniffles: Structural variation caller using third generation sequencing

Sniffles is a structural variation caller using third generation sequencing (PacBio or Oxford Nanopore). It detects all types of SVs using evidence from split-read alignments, high-mismatch regions, and coverage analysis. Please note the current version of Sniffles requires sorted output from BWA-MEM (use -M and -x parameter) or NGM-LR with the optional SAM attributes enabled!

More at https://github.com/fritzsedlazeck/Sniffles

MultiBreak-SV: It identifies structural variants from next-generation paired end data, third-generation long read data, or data from a combination of sequencing platforms.

There are two pieces of software in this release: (1) a pre-processor that takes machineformat (.m5) BLASR files, and (2) MultiBreak-SV. For installation and usage instructions, see doc/MultiBreakSV-Manual.txt.

More at https://github.com/raphael-group/multibreak-sv

Parliament: A Structural Variation Tool. Why ask a single sv-detection approach to find every variant when you can have a parliament of tools deciding?

Publication about the algorithm and “…the first long-read characterization of structural variation in a diploid human personal genome…” (HS1011) - “Assessing structural variation in a personal genome—towards a human reference diploid genome”

More at https://sourceforge.net/projects/parliamentsv/

https://www.dnanexus.com/papers/Parliament_Info_Sheet.pdf

PBHoney: the structural variation discovery tool

PBHoney is an implementation of two variant-identification approaches designed to exploit the high mappability of long reads (i.e., greater than 10,000 bp). PBHoney considers both intra-read discordance and soft-clipped tails of long reads to identify structural variants.

Read The Paper http://www.biomedcentral.com/1471-2105/15/180/abstract

More at https://sourceforge.net/projects/pb-jelly/

SMRT-SV: Structural variant and indel caller for PacBio reads

Structural variant (SV) and indel caller for PacBio reads based on methods from Chaisson et al. 2014.

SMRT-SV provides an official software package for tools described in Chaisson et al. 2014 and adds several key features including the following.

Unified variant calling user interface with built-in cluster compute support
Small indel calling (2-49 bp)
Improved inversion calling (screenInversions)
Quality metric for SV calls based on number of local assemblies supporting each call
Higher sensitivity for SV calls using tiled local assemblies across the entire genome instead of "signature" regions
Genotyping of SVs with Illumina paired-end reads from WGS samples

More at https://github.com/EichlerLab/pacbio_variant_caller

Meraculous: De Novo Genome Assembly with Short Paired-End Reads

Jit — Tue, 07 Nov 2017 04:36:10 -0600

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3158087/