BOL: Related items

GenomeScope: open-source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate, and repeat content from unprocessed short reads

Jit — Fri, 21 Oct 2016 05:46:43 -0500

Summary: GenomeScope is an open-source web tool to rapidly estimate the overall characteristics of a genome, including genome size, heterozygosity rate, and repeat content from unprocessed short reads. These features are essential for studying genome evolution, and help to choose parameters for downstream analysis. We demonstrate its accuracy on 324 simulated and 16 real datasets with a wide range in genome sizes, heterozygosity levels, and error rates. Availability and Implementation: http://qb.cshl.edu/genomescope/, https://github.com/schatzlab/genomescope.git

Address of the bookmark: http://qb.cshl.edu/genomescope/

Software for genome assembly !

LEGE — Sun, 30 Aug 2020 09:51:38 -0500

List of bioinformatics tools/Software Website References for genome assembly:

1 Falcon https://github.com/PacificBiosciences/pb-assembly

2 Canu assembler http://canu.readthedocs.io/en/latest/index.html

3 Miniasm assembler https://github.com/lh3/miniasm

4 PBJelly scaffolding tool https://sourceforge.net/projects/pb-jelly/

5 ARCS scaffolding tool https://github.com/bcgsc/arcs

6 Redundans reduction and scaffolding tool https://github.com/Gabaldonlab/redundans

7 Arrow error correction https://github.com/PacificBiosciences/ GenomicConsensus

8 PILON error correction https://github.com/broadinstitute/pilon/wiki

9 BUSCO single copy gene markers http://busco.ezlab.org/

10 Bandage graph assembly viewer https://rrwick.github.io/Bandage/

11 Gepard dotter http://cube.univie.ac.at/gepard

12 MUMmer aligner and plotter http://mummer.sourceforge.net/

odgi: optimized dynamic genome/graph implementation

Abhimanyu Singh — Tue, 01 Feb 2022 23:42:21 -0600

odgi provides an efficient and succinct dynamic DNA sequence graph model, as well as a host of algorithms that allow the use of such graphs in bioinformatic analyses.

Careful encoding of graph entities allows odgi to efficiently compute and transform pangenomes with minimal overheads. odgi implements a dynamic data structure that leveraged multi-core CPUs and can be updated on the fly.

The edges and path steps are recorded as deltas between the current node id and the target node id, where the node id corresponds to the rank in the global array of nodes. Graphs built from biological data sets tend to have local partial order and, when sorted, the deltas be small. This allows them to be compressed with a variable length integer representation, resulting in a small in-memory footprint at the cost of packing and unpacking.

The RAM and computational savings are substantial. In partially ordered regions of the graph, most deltas will require only a single byte.

Address of the bookmark: https://github.com/pangenome/odgi

Bioinformatics tools for genome assembly !

BioStar — Mon, 24 Jul 2023 07:04:26 -0500

There are numerous genome assembly tools available, each with its strengths and weaknesses. Here is a list of some widely used genome assembly tools as of my last update in September 2021:

SPAdes: An assembler specifically designed for single-cell and multi-cell bacterial genomes, as well as small eukaryotic genomes.
ABySS: A parallelized assembler for large genomes that uses de Bruijn graphs.
Velvet: Another de Bruijn graph-based assembler optimized for short-read sequencing data.
SOAPdenovo: A de Bruijn graph-based assembler designed for short reads, widely used for assembling large and complex genomes.
MaSuRCA: A hybrid assembler that combines data from multiple sequencing technologies, such as Illumina and PacBio.
Canu: A long-read assembler optimized for PacBio and Oxford Nanopore sequencing data.
Flye: A long-read assembler suitable for bacterial and small eukaryotic genomes.
SMARTdenovo: An assembler designed for long reads, particularly suited for PacBio data.
SPAdes Long Read (SPAdesLR): An extension of SPAdes for long-read data, such as those from PacBio or Nanopore.
Minia: An assembler optimized for low memory consumption, suitable for small and medium-sized genomes.
Unicycler: A hybrid assembler that combines short and long reads for circular bacterial genome assembly.
wtdbg2: A de Bruijn graph assembler for long reads, efficient for very large genomes.
Shasta: A long-read assembler that uses the Overlap-Layout-Consensus approach, suitable for PacBio and Nanopore data.
Sparc: An assembler designed to handle noisy long reads from Nanopore sequencing.
CANA: An assembler for metagenomic data, particularly for complex and diverse microbial communities.
Ra Assembler: A metagenome assembler for long reads, designed for highly complex metagenomic samples.

Please note that the field of bioinformatics is constantly evolving, and new assembly tools may have emerged since my last update. Additionally, the performance of these tools can vary depending on the characteristics of the sequencing data and the genome being assembled. When selecting an assembly tool, consider the specific requirements of your project, the available data types, and the computational resources at your disposal. Always refer to the respective tool's documentation and publications for the most up-to-date information and recommendations.

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

CrossMap

Jitendra Narayan — Mon, 08 Feb 2016 15:47:00 -0600

CrossMap is a program for convenient conversion of genome coordinates (or annotation files) between different assemblies (such as Human hg18 (NCBI36) <> hg19 (GRCh37), Mouse mm9 (MGSCv37) <> mm10 (GRCm38)).

It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF.

CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome.

We do not recommend using CrossMap to convert genome coordinates between species.

More at http://crossmap.sourceforge.net/

Address of the bookmark: http://crossmap.sourceforge.net/

Easyfig

Poonam Mahapatra — Fri, 29 Apr 2016 05:49:39 -0500

Easyfig has moved to github, for newer releases of Easyfig please visit our new webpage - https://mjsull.github.io/Easyfig. Easyfig is a Python application for creating linear comparison figures of multiple genomic loci with an easy-to-use graphical user interface (GUI).

More at http://easyfig.sourceforge.net/

Address of the bookmark: http://easyfig.sourceforge.net/

BUSCO: Assessing genome assembly and annotation completeness with Benchmarking Universal Single-Copy Orthologs

Anjana — Tue, 10 May 2016 07:46:24 -0500

High-throughput genomics has revolutionized biological research, however, while the number of sequenced genomes grows by the day, quality assessment of the resulting assembled sequences remains complicated and mostly limited to technical measures like N50.
BUSCO provides measures for quantitative assessment of genome assembly, gene set, and transcriptome completeness based on evolutionarily informed expectations of gene content from near-universal single-copy orthologs selected from OrthoDB.
BUSCO assessments are implemented in open-source software, with comprehensive lineage-specific sets of Benchmarking Universal Single-Copy Orthologs for arthropods, vertebrates, metazoans, fungi, eukaryotes, and bacteria.
These conserved orthologs are ideal candidates for large-scale phylogenomics studies, and the annotated BUSCO gene models built during genome assessments provide a comprehensive gene predictor training set for use as part of genome annotation pipelines.
BUSCO assessments offer intuitive metrics, based on evolutionarily informed expectations of gene content from hundreds of species, to gauge completeness of rapidly accumulating genomic data and satisfy an Iberian's quest for quality - "Busco calidad/qualidade".

Address of the bookmark: http://busco.ezlab.org/

Blobology

Jit — Mon, 13 Jun 2016 10:18:33 -0500

Tools for making blobplots or Taxon-Annotated-GC-Coverage plots (TAGC plots) to visualise the contents of genome assembly data sets as a QC step

Blaxter Lab, Institute of Evolutionary Biology, University of Edinburgh

Goal: To create blobplots or Taxon-Annotated-GC-Coverage plots (TAGC plots) to visualise the contents of genome assembly data sets as a QC step.

This repository accompanies the paper:
Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots. Sujai Kumar, Martin Jones, Georgios Koutsovoulos, Michael Clarke, Mark Blaxter
(submitted 2013-10-01 to Frontiers in Bioinformatics and Computational Biology special issue : Quality assessment and control of high-throughput sequencing data).

It contains bash/perl/R scripts for running the analysis presented in the paper to create a preliminary assembly, and to create and collate GC content, read coverage and taxon annotation for the preliminary assembly, which can be visualised, such as Figure 2a from the paper showing TAGC plots/blobplots for Caenorhabditis sp. 5:

Address of the bookmark: https://github.com/blaxterlab/blobology

HGA

Jit — Tue, 29 Nov 2016 07:25:53 -0600

HGA tool version 1.0 This tool helps to apply the Hierarchical Genome Assembly (HGA) method. The tool will apply: 1. Partitioning a given reads dataset into a given number of partitions. 2. Assembling each partitions using a pre-specified assembler (Velvet or SPAdes in this version) and using a given kmer size. 3. Merging all the assemblies of the partition. 4. Combining all the assemblies of the partition (using velvet with kmer value of 31). 5. Finaly, re-assembling the whole dataset with the merged contigs or the combined contigs, using a given kmer size.

https://github.com/aalokaily/Hierarchical-Genome-Assembly-HGA

Address of the bookmark: https://github.com/aalokaily/Hierarchical-Genome-Assembly-HGA