BOL: Related items

Mash: fast genome and metagenome distance estimation using MinHash

Jit — Tue, 12 Dec 2017 17:30:12 -0600

Mash is normally distributed as a dependency-free binary for Linux or OSX (see https://github.com/marbl/Mash/releases). This source distribution is intended for other operating systems or for development. Mash requires c++11 to build, which is available in and GCC >= 4.8 and OSX >= 10.7.

See http://mash.readthedocs.org for more information.

Address of the bookmark: https://github.com/marbl/Mash/releases

dna2bit: an ultra-fast and accurate genomic distance estimation software

LEGE — Sun, 31 Aug 2025 06:24:58 -0500

dna2bit is a software tool developed in C++11, leveraging the capabilities of OpenMP for parallel computing and the popcount technique for efficient bit manipulation. It has been thoroughly tested using the g++ and clang compilers on both Linux and MacOS platforms.

Address of the bookmark: https://github.com/lijuzeng/dna2bit

MimiLook: A Phylogenetic Workflow for Detection of Gene Acquisition in Major Orthologous Groups of Megavirales

Abhi — Mon, 10 Jan 2022 06:32:22 -0600

This tool detects statistically validated events of gene acquisitions with the help of the T-REX algorithm by comparing individual gene tree with NCBI species tree. In between the steps, the workflow decides about handling paralogs, filtering outputs, identifying Megavirale specific OGs, detection of HGTs, along with retrieval of information about those OGs that are monophyletic with organisms from cellular domains of life.

https://www.readcube.com/articles/10.3390%2Fv9040072

Address of the bookmark: https://pubmed.ncbi.nlm.nih.gov/28387730/

kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome

Jit — Fri, 08 Dec 2017 16:48:40 -0600

Sept. 20, 2017 Version 3.1 released. Major upgrade. Version 3.1 fixes the problems with SNP annotation that arose when NCBI discontinued use of GI numbers. Please read carefully the Preface (page 3) and the File of annotated genomes section (pages 9-10) in the version 3.1 User Guide. Thanks to Tom Slezak for revsing the get_genbank_file3 script and to Tod Stuber (USDA) for testing version 3.1 even though he doesn't need the annotation feature. All users are encouraged to upgrade to version 3.1.

Address of the bookmark: https://sourceforge.net/projects/ksnp/files/

Orthoflow: workflow for phylogenetic inference of genome-scale datasets of protein-coding genes

LEGE — Wed, 21 Feb 2024 06:13:08 -0600

Orthoflow is a workflow for phylogenetic inference of genome-scale datasets of protein-coding genes. Our goal was to make it straightforward to work from a combination of input sources including annotated contigs in Genbank format and FASTA files containing CDSs. It uses several state of the art inference methods for orthology inference, either based on HMM profiles or de novo inference of orthogroups. Through the use of OrthoSNAP, many additional ortholog alignments can be generated from multi-copy gene families. For phylogenetic inference, users can choose a supermatrix approach and/or gene tree inference followed by supertree reconstruction. Users can specify a range of alignment filtering settings to retain high-quality alignments for phylogenetic inference. The workflow produces a detailed report that, in addition to the phylogenetic results, includes a range of diagnostics to verify the quality of the results.

Address of the bookmark: https://github.com/rbturnbull/orthoflow

AVID: A Global Alignment Program

Archana Malhotra — Wed, 24 May 2017 05:19:28 -0500

A new global alignment method called AVID. The method is designed to be fast, memory efficient, and practical for sequence alignments of large genomic regions up to megabases long. We present numerous applications of the method, ranging from the comparison of assemblies to alignment of large syntenic genomic regions and whole genome human/mouse alignments. We have also performed a quantitative comparison of AVID with other popular alignment tools. To this end, we have established a format for the representation of alignments and methods for their comparison. These formats and methods should be useful for future studies. The tools we have developed for the alignment comparisons, as well as the AVID program, are publicly available. See Web Site References section for AVID Web address and Web addresses for other programs discussed in this paper.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC430967/

GffCompare: Program for processing GTF/GFF files

Jit — Tue, 09 Jul 2019 13:35:13 -0500

The program gffcompare can be used to compare, merge, annotate and estimate accuracy of one or more GFF files (the “query” files), when compared with a reference annotation (also provided as GFF).

Address of the bookmark: https://ccb.jhu.edu/software/stringtie/gffcompare.shtml

Opera: An optimal genome scaffolding program

Jit — Mon, 27 Nov 2017 10:18:20 -0600

Opera (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly ). It uses information from paired-end or long reads to optimally order and orient contigs assembled from shotgun-sequencing reads.

An updated version called OPERA-LG has been re-engineered with features for the assembly of large and complex genomes.

Song Gao, Denis Bertrand, Burton K. H. Chia and Niranjan Nagarajan. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology, May 2016, doi: 10.1186/s13059-016-0951-y.

Song Gao, Wing-Kin Sung, Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology, Sept. 2011, doi:10.1089/cmb.2011.0170.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0951-y

Address of the bookmark: https://sourceforge.net/projects/operasf/

Rebaler: program for conducting reference-based assemblies using long reads.

Jit — Tue, 18 Sep 2018 07:52:41 -0500

Rebaler is a program for conducting reference-based assemblies using long reads. It relies mainly on minimap2 for alignment and Racon for making consensus sequences.

I made Rebaler for bacterial genomes (specifically for the task of testing basecallers). It should in principle work for non-bacterial genomes as well, but I haven't tested it.

Address of the bookmark: https://github.com/rrwick/Rebaler

CroCo: A program to detect potential cross contaminations in HTS assembled transcriptomes using expression level quantification

Jit — Mon, 07 Jan 2019 18:17:44 -0600

CroCo is a program to detect cross contamination events in assembled transcriptomes using sequencing reads to determine the true origin of every transcripts.
Such cross contaminations can be expected if several RNA-Seq experiments were prepared during the same period at the same lab, or by the same people, or if they were processed or sequenced by the same sequencing service facility.
Our approach first determines a subset of transcripts that are suspiciously similar across samples using a pairwise BLAST procedure. CroCo then combine all transcriptomes into a metatranscriptome and quantifies the "expression level" of all transcripts successively using every sample read data (e.g. several species sequenced by the same lab for a particular study) while allowing read multi-mappings.
Several mapping tools implemented in CroCo can be used to estimate expression level (default is RapMap).
This information is then used to categorize each transcript in the following 5 categories :

clean: the transcript origin is from the focal sample.

cross contamination: the transcript origin is from an alien sample of the same experiment.

dubious: expression levels are too close between focal and alien samples to determine the true origin of the transcript.

low coverage: expression levels are too low in all samples, thus hampering our procedure (which relies on differential expression) to confidently assign it to any category.

over expressed: expression levels are very high in at least 3 samples and CroCo will not try to categorize it. Indeed, such a pattern does not correspond to expectations for cross contaminations, but often reflect highly conserved genes such as ribosomal gene, or external contamination shared by several samples (e.g. Escherichia coli contaminations).

Address of the bookmark: https://gitlab.mbb.univ-montp2.fr/mbb/CroCo