BOL: Related items

CAT/BAT: tool for taxonomic classification of contigs and metagenome-assembled genomes (MAGs)

Jit — Mon, 18 May 2020 10:53:32 -0500

Contig Annotation Tool (CAT) and Bin Annotation Tool (BAT) are pipelines for the taxonomic classification of long DNA sequences and metagenome assembled genomes (MAGs/bins) of both known and (highly) unknown microorganisms, as generated by contemporary metagenomics studies. The core algorithm of both programs involves gene calling, mapping of predicted ORFs against the nr protein database, and voting-based classification of the entire contig / MAG based on classification of the individual ORFs. CAT and BAT can be run from intermediate steps if files are formated appropriately (see Usage).

Address of the bookmark: https://github.com/dutilh/CAT

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

Rahul Nayak — Tue, 05 Jun 2018 09:57:11 -0500

PERGA - Paired End Reads Guided Assembler PERGA is a novel sequence reads guided de novo assembly approach which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds. Instead of using single-end reads to construct contig, PERGA uses paired-end reads and different read overlap sizes from O ≥ Omax to Omin to resolve the gaps and branches. Moreover, by constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. PERGA will try to extend the contigs by all feasible nucleotides and determine if these multiple extensions due to sequencing errors or repeats by using looking ahead technology, and it also try to separate the different repeats of nearby genomic regions to make the assembly result more longer and accurate. The simulated E.coli paired-end reads data are generated using GemSim (KE McElroy, F Luciani, T Thomas. Gemsim: General, Error-Model Based Simulator of Next-Generation Sequencing Data. BMC Genomics 2012, 13:74), with coverage 50x, 60x, 100x, read lengths 100-bp, and can be downloaded from https://github.com/zhuxiao/data_PERGA.

Address of the bookmark: https://github.com/hitbio/PERGA

UniqueKmer: Generate unique KMERs for every contig in a FASTA file

Abhi — Fri, 17 Dec 2021 00:08:15 -0600

Generate unique k-mers for every contig in a FASTA file.

Unique k-mer is consisted of k-mer keys (i.e. ATCGATCCTTAAGG) that are only presented in one contig, but not presented in any other contigs (for both forward and reverse strands).

This tool accepts the input of a FASTA file consisting of many contigs, and extract unique k-mers for each contig.

The output unique k-mer file and Genome file can be used for fastv: https://github.com/OpenGene/fastv, which is an ultra-fast tool to identify and visualize microbial sequences from sequencing data.

https://github.com/OpenGene/UniqueKMER

Address of the bookmark: https://github.com/OpenGene/UniqueKMER

npScarf: Scaffolding and Completing Assemblies in Real-time Fashion

Jit — Tue, 23 May 2017 04:53:29 -0500

npScarf (jsa.np.npscarf) is a program that scaffolds and completes draft genomes assemblies in real-time with Oxford Nanopore sequencing. The pipeline can run on a computing cluster as well as on a laptop computer for microbial datasets. It also facilitates the real-time analysis of positional information such as gene ordering and the detection of genes from mobile elements (plasmids and genomic islands).

Complete paper at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5321748/

Address of the bookmark: https://github.com/mdcao/npScarf

HiTE: a fast and accurate dynamic boundary adjustment approach for full-length Transposable Elements detection and annotation in Genome Assemblies

LEGE — Sat, 20 Sep 2025 09:34:04 -0500

HiTE is a Python software that uses a dynamic boundary adjustment approach to detect and annotate full-length Transposable Elements in Genome Assemblies. In comparison to other tools, HiTE demonstrates superior performance in detecting a greater number of full-length TEs.

panHiTE

We have developed panHiTE, a comprehensive and accurate pipeline for TE detection in large-scale population genomes. It has been successfully applied to hundreds of plant population genomes, demonstrating its effectiveness and scalability.

For detailed instructions, please refer to the panHiTE tutorial.

Address of the bookmark: https://github.com/CSU-KangHu/HiTE

Circlator: automated circularization of genome assemblies using long sequencing reads

Poonam Mahapatra — Tue, 15 May 2018 09:42:32 -0500

A tool to circularize genome assemblies. The algorithm and benchmarks are described in the Genome Biology manuscript. Citation: "Circlator: automated circularization of genome assemblies using long sequencing reads", Hunt et al, Genome Biology 2015 Dec 29;16(1):294. doi: 10.1186/s13059-015-0849-0. PMID: 26714481.

Address of the bookmark: http://sanger-pathogens.github.io/circlator/

ARC: pipeline which facilitates iterative, reference guided de novo assemblies

Jit — Thu, 26 Jul 2018 09:20:26 -0500

ARC is a pipeline which facilitates iterative, reference guided de novo assemblies with the intent of:

Reducing time in analysis and increasing accuracy of results by only considering those reads which should assemble together.
Reducing/removing reference bias as compared to mapping based approaches.

The software is designed to work in situations where a whole-genome assembly is not the objective, but rather when the researcher wishes to assemble discreet 'targets' contained within next-generation shotgun sequence data. ARC decomplexifies the traditionally difficult problem of assembly by breaking the reads into small, manageable subsets which can then be assembled quickly and efficiently in parallel. Applications include those in which the researcher wishes to de novo assemble specific content and a set of semi-similar reference targets is available to initialize the assembly process.

https://ibest.github.io/ARC/

Address of the bookmark: https://ibest.github.io/ARC/

SKESA: strategic k-mer extension for scrupulous assemblies

Jit — Wed, 14 Nov 2018 04:45:41 -0600

SKESA is a DeBruijn graph-based de-novo assembler designed for assembling reads of microbial genomes sequenced using Illumina. Comparison with SPAdes and MegaHit shows that SKESA produces assemblies that have high sequence quality and contiguity, handles low-level contamination in reads, is fast, and produces an identical assembly for the same input when assembled multiple times with the same or different compute resources.

Source code for SKESA is freely available at https://github.com/ncbi/SKESA/releases.

Research Paper @ Link

SKESA algorithm are as follows:

Address of the bookmark: https://github.com/ncbi/SKESA/releases

Hawkeye: an interactive visual analytics tool for genome assemblies

Abhimanyu Singh — Tue, 01 Jan 2019 11:56:17 -0600

Genome sequencing remains an inexact science, and genome sequences can contain significant errors if they are not carefully examined. Hawkeye is our new visual analytics tool for genome assemblies, designed to aid in identifying and correcting assembly errors. Users can analyze all levels of an assembly along with summary statistics and assembly metrics, and are guided by a ranking component towards likely mis-assemblies. Hawkeye is freely available and released as part of the open source AMOS project http://amos.sourceforge.net/hawkeye.

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2007-8-3-r34

Address of the bookmark: http://amos.sourceforge.net/wiki/index.php?title=Hawkeye

CrossMap: program for genome coordinates conversion between different assemblies

Rahul Nayak — Tue, 25 Jan 2022 17:59:32 -0600

CrossMap is a program for genome coordinates conversion between different assemblies (such as hg18 (NCBI36) <=> hg19 (GRCh37)). It supports commonly used file formats including BAM, CRAM, SAM, Wiggle, BigWig, BED, GFF, GTF, MAF VCF, and gVCF.

Address of the bookmark: http://crossmap.sourceforge.net/