BOL: Related items

SiLiX: implements an ultra-efficient algorithm for the clustering of homologous sequences

Jit — Wed, 12 Dec 2018 09:22:41 -0600

The software package SiLiX implements an ultra-efficient algorithm for the clustering of homologous sequences, based on single transitive links (single linkage) with alignment coverage constraints.

SiLiX adopts a graph-theoretical framework to interpret similarity pairs as edges of a network. A very efficient algorithm, based on the Disjoint Sets Data Structure, allows the computation of sequence families with low time and space requirements.

A parallel version of SiLiX, based on MPI, is also available in this package and has been proved to be scalable, so that its allows the study of very large datasets.

SiLiX is already included in the analysis pipeline for HOGENOM.

Address of the bookmark: http://lbbe.univ-lyon1.fr/SiLiX?lang=fr

AlfaPang: alignment free algorithm for pangenome graph construction

BioStar — Thu, 28 Aug 2025 02:56:35 -0500

AlfaPang constructs variation graphs, leveraging its alignment-free and reference-free approach, based solely on intrinsic sequence properties. This design allows AlfaPang's runtime and memory usage to scale linearly with the size of input sequences, enabling it to handle significantly larger genome sets compared to other methods.

Address of the bookmark: https://github.com/AdamCicherski/AlfaPang

HECIL: A Hybrid Error Correction Algorithm for Long Reads with Iterative Learning

Abhimanyu Singh — Tue, 01 Jan 2019 12:01:00 -0600

HECIL—Hybrid Error Correction with Iterative Learning—a hybrid error correction framework that determines a correction policy for erroneous long reads, based on optimal combinations of decision weights obtained from short read alignments.

HECIL’s core algorithm by introducing an iterative learning paradigm that enhances the correction policy at each iteration by incorporating knowledge gathered from previous iterations via data-driven confidence metrics assigned to prior corrections.

Address of the bookmark: https://github.com/NDBL/HECIL

Omega2: metagenome assembly pipeline

Jit — Mon, 10 Jul 2017 05:56:07 -0500

Omega found overlaps between reads using a prefix/suffix hash table. The overlap graph of reads was simplified by removing transitive edges and trimming short branches. Unitigs were generated based on minimum cost flow analysis of the overlap graph and then merged to contigs and scaffolds using mate-pair information. In comparison with three de Bruijn graph assemblers (SOAPdenovo, IDBA-UD and MetaVelvet), Omega provided comparable overall performance on a HiSeq 100-bp dataset and superior performance on a MiSeq 300-bp dataset. In comparison with Celera on the MiSeq dataset, Omega provided more continuous assemblies overall using a fraction of the computing time of existing overlap-layout-consensus assemblers. This indicates Omega can more efficiently assemble longer Illumina reads, and at deeper coverage, for metagenomic datasets.

Address of the bookmark: http://omega.omicsbio.org/

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

RGFA: powerful and convenient handling of assembly graphs

Rahul Nayak — Thu, 25 Jan 2018 05:47:53 -0600

RGFA, an implementation of the proposed GFA specification in Ruby. It allows the user to conveniently parse, edit and write GFA files. Complex operations such as the separation of the implicit instances of repeats and the merging of linear paths can be performed. A typical application of RGFA is the editing of a graph, to finish the assembly of a sequence, using information not available to the assembler. We illustrate a use case, in which the assembly of a repetitive metagenomic fosmid insert was completed using a script based on RGFA.

https://github.com/ggonnella/rgfa

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5103826/

HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies

Jit — Tue, 15 May 2018 07:35:26 -0500

HapCUT2 is a maximum-likelihood-based tool for assembling haplotypes from DNA sequence reads, designed to "just work" with excellent speed and accuracy. We found that previously described haplotype assembly methods are specialized for specific read technologies or protocols, with slow or inaccurate performance on others. With this in mind, HapCUT2 is designed for speed and accuracy across diverse sequencing technologies, including but not limited to: NGS short reads (Illumina HiSeq) clone-based sequencing (Fosmid or BAC clones) SMRT reads (PacBio) Oxford Nanopore reads 10X Genomics Linked-Reads proximity-ligation (Hi-C) reads high-coverage sequencing (>40x coverage-per-SNP) using above technologies combinations of the above technologies (e.g. scaffold long reads with Hi-C reads) See below for specific examples of command line options and best practices for some of these technologies. NOTE: At this time HapCUT2 is for diploid organisms only. VCF input should contain diploid variants. If you use HapCUT2 in your research, please cite: Edge, P., Bafna, V. & Bansal, V. HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res. gr.213462.116 (2016). doi:10.1101/gr.213462.116

Address of the bookmark: https://github.com/vibansal/HapCUT2

wtdbg2: A fuzzy Bruijn graph approach to long noisy reads assembly

BioStar — Mon, 04 Feb 2019 04:53:47 -0600

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output.

./wtdbg2 -x rs -g 4.6m -t 16 -i reads.fa.gz -fo prefix
./wtpoa-cns -t 16 -i prefix.ctg.lay.gz -fo prefix.ctg.fa

Address of the bookmark: https://github.com/ruanjue/wtdbg2

TRITEX sequence assembly pipeline for Triticeae genomes

Jit — Tue, 20 Aug 2019 09:47:14 -0500

The pipeline is open-source and hosted in a public Bitbucket repository.

TRITEX has been run on highly inbred genotypes of barley (Hordeum vulgare), tetraploid wheat (Triticum turgidum) and hexaploid wheat (T. aestivum) with reasonable results: super-scaffold N50 values in the range of dozens of Mb and pseudomolecules with better gene space representation than a BAC-by-BAC assembly. It has never been tested and is not expected to work on heterozygous or autopolyploid genomes.

A protocol for generating chromosome-conformation capture sequencing (Hi-C) data suitable for use with the pipeline is described in Himmelbach et al. 2018. Refer to the technical notes of 10X Genomics on how to generate Chromium data.

Address of the bookmark: https://tritexassembly.bitbucket.io/

Free Genomics data !

BioStar — Fri, 07 Feb 2020 14:08:31 -0600

The specimens were collected by the Oxford Wytham Woods and Edinburgh Lohse lab teams. DNA extraction and sequencing was carried out by the Sanger Institute Scientific Operations teams. Assemblies were carried out by the Tree of Life team (Shane McCarthy) and colleagues in Pacific Biosciences (Jonas Korlach).

https://www.darwintreeoflife.org/an-initial-set-of-raw-genome-assemblies-from-the-darwin-tree-of-life-project/

Address of the bookmark: https://www.darwintreeoflife.org/an-initial-set-of-raw-genome-assemblies-from-the-darwin-tree-of-life-project/