BOL: Related items

PhyloHerb: A high‐throughput phylogenomic pipeline for processing genome skimming data

Abhi — Wed, 06 Sep 2023 00:14:28 -0500

Phylogenomic Analysis Pipeline for Herbarium Specimens

What is PhyloHerb: PhyloHerb is a wrapper program to process genome skimming data collected from plant materials. The outcomes include the plastid genome (plastome) assemblies, mitochondrial genome assemblies, nuclear ribosomal DNAs (NTS+ETS+18S+ITS1+5.8S+ITS2+28S), alignments of gene and intergenic regions, and a species tree. It is designed to be a high throughput program dealing with lower quality data. Examples include low-coverage (5x cpDNA) plastome phylogeny, recycling plastid genes from target enrichment data, retrieving low-copy nuclear genes from medium coverage (5x nucDNA) genome skimming.

License: GNU General Public License

Citation:

Cai, Liming, Hongrui Zhang, and Charles C. Davis. 2022. PhyloHerb: A high‐throughput phylogenomic pipeline for processing genome‐skimming data. Applications in Plant Sciences 10(3): 1–9. https://doi.org/10.1002/aps3.11475

Address of the bookmark: https://github.com/lmcai/PhyloHerb/

Libraries or management tools for high throughput sequencing data

LEGE — Fri, 04 Oct 2024 02:45:06 -0500

GATB Library. The Genome Analysis Toolbox with de-Bruijn graph. A large part of tools developed by the GenScale team are based on this library.
These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples (e.g. metagenomes). Among them are (the full is available here: https://gatb.inria.fr/software/):
LRez: C++ Library and toolkit for the barcode-based management and indexation of linked-read datasets.

Variant calling and/or genotyping

DiscoSNP++ and discoSnpRAD: Reference-free small variant discovery (SNPs and indels)
MindTheGap: Detection and assembly of large insertion variants
TakeABreak: reference-free inversion discovery tool
SVJedi: Structural Variant genotyper with long read data
SVJedi-graph: Structural Variant genotyper with long read data using a variation graph

Sequence assembly

MinYS: reference-guided genome assembly in metagenomics data
MTG-link: local assembly tool for linked-read data
Minia: De novo short read assembler
de-novo pipeline: de-novo assembly pipeline (error correction / contigs / scaffolding) for genomes and meta-genomes
Mapsembler2: Targeted assembly (not maintained)

Managing k-mers & indexation

findere: simple strategy for speeding up queries and for reducing false positive calls from any Approximate Membership Query data structure.
- fimpera extends findere adding the abundance information.
kmtricks: modular tool suite for counting kmers, and constructing Bloom filters or kmer matrices, for large collections of sequencing data.
kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.
back to sequences: Find sequences (reads, unitigs, genes) related to a set of kmers in large datasets, in a matter of seconds.
Backpack Quotient Filter: k-mer indexing data structure with abundance
short read connector: Detect similar reads from potentially large read set
DSK: Count K-mer in sequences

Pangenome graph manipulation

Pancat: Pangenome Comparison and Analysis Toolkit
GFAGraphs: a Python library to handle pangenome graph files in GFA format.

Comparative metagenomics with k-mers

Simka and SimkaMin: Comparative metagenomics for large-scale datasets
Comparead & Commet: comparison of metagenomic datasets

Species and bacterial strains identification

ORI: software using long nanopore reads to identify bacteria present in a sample at the strain level
StrainFLAIR: STRAIN-level proFiLing using vArIation gRaph

General-purpose sequencing data manipulation

GASSST: long read mapper
Leon: short read compressor (now included in GATB-core)
Bloocoo: short read corrector
BCALM: Construct compacted de Bruijn graphs (unitigs)

Protein Structure

A_Purva: Contact Map Overlap solver
MD-Jeep: Distance Geometry solver
CSA: Comparative Structural Alignment

Workflow

SLICEE: parallel execution of bioinformatics workflows

Comparative Genomics

CASSIS: detection of rearrangement breakpoints
PLAST: intensive bank-to-bank sequence comparison
DRJBreakpointFinder: detection and precise localization of excision sites in proviral segments

ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data

Jit — Mon, 19 Feb 2018 06:46:15 -0600

ETE v3, featuring numerous improvements in the underlying library of methods, and providing a novel set of standalone tools to perform common tasks in comparative genomics and phylogenetics.

The new features include

(i) building gene-based and supermatrix-based phylogenies using a single command,

(ii) testing and visualizing evolutionary models,

(iii) calculating distances between trees of different size or including duplications, and

(iv) providing seamless integration with the NCBI taxonomy database.

ETE is freely available at http://etetoolkit.org

Address of the bookmark: http://etetoolkit.org

lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data

BioJoker — Tue, 27 Nov 2018 04:43:57 -0600

lordFAST is a sensitive tool for mapping long reads with high error rates. lordFAST is specially designed for aligning reads from PacBio sequencing technology but provides the user the ability to change alignment parameters depending on the reads and application.

lordFAST, a novel long-read mapper that is specifically designed to align reads generated by PacBio and potentially other SMS technologies to a reference. lordFAST not only has higher sensitivity than the available alternatives, it is also among the fastest and has a very low memory footprint.

Address of the bookmark: https://github.com/vpc-ccg/lordfast

Genome in a Bottle (GIAB) Consortium

Jit — Sat, 25 Jan 2020 13:50:52 -0600

The Genome in a Bottle (GIAB) Consortium is a public-private-academic consortium hosted by NIST to develop the technical infrastructure (reference standards, reference methods, and reference data) to enable translation of whole human genome sequencing to clinical practice.

https://www.nist.gov/news-events/news/2016/09/nist-releases-new-family-standardized-genomes

Address of the bookmark: https://jimb.stanford.edu/giab/

AutoGluon: AutoML for Text, Image, and Tabular Data

Jit — Thu, 07 Jan 2021 05:33:17 -0600

AutoGluon automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models on text, image, and tabular data.

Address of the bookmark: https://github.com/awslabs/autogluon

NASA Open Science Data Repository

Abhi — Wed, 18 Dec 2024 11:54:47 -0600

The NASA Open Science Data Repository (OSDR) enables access to space-related data from experiments and missions that investigate biological and health responses of terrestrial life to spaceflight. The goal of OSDR is to enable multi-modal and multi-hierarchical fundamental space life science data be reused toward basic science, applied science, and operational outcomes for space exploration and knowledge discovery. These data include ‘omics, phenotypic, physiological, behavioral, hardware, environmental telemetry; raw, processed; tabular, text, code, bioimaging, and video.

https://www.nasa.gov/reference/osdr-data-processing/

Address of the bookmark: https://www.nasa.gov/osdr/

Installing Perl environment on Linux

biogeek — Tue, 26 Dec 2017 21:21:50 -0600

By using plenv, you can easily install and switch among different version of Perl. This will be installed under your home directory in~/.plenv.

Install latest Perl (with supporting multithreading) and CPANMinus.

 $ cd
 $ git clone git://github.com/tokuhirom/plenv.git ~/.plenv
 $ git clone git://github.com/tokuhirom/Perl-Build.git ~/.plenv/plugins/perl-build/
 $ echo 'export PATH="$HOME/.plenv/bin:$PATH"' >> ~/.bashrc
 $ echo 'eval "$(plenv init -)"' >> ~/.bashrc
 $ source ~/.bashrc
 $ plenv install 5.18.1 -Dusethreads
 $ plenv rehash
 $ plenv global 5.18.1
 $ plenv install-cpanm

git is a distributed revision control and source code management software which can help you to download files from GitHub server.
echo means "print".
>> means adding the output into the end of the file, while > means adding the output by overwriting the whole file. Please use> with additional cares.
In Linux system, there are two types of outputs when you execute a command. One is called standard output (or sometimes STDOUT for short), and the other is a standard error (STDERR). 1> is for STDOUT only, 2> is for STDERR only, and &>means for both. In default > is the same to 1>.
exec is execution.
Remember to install Perl in supporting multithreading (with option -Dusethreads), which is important for many NGS analysis packages (e.g. Trinity). In this setting, you can use multiple CPU for Perl software.
Install the CPAN (Comprehensive Perl Archive Network) manager software, CPANMinus, by install-cpanm.

You can use plenv global and plenv local to change the different version of Perl to fulfil different needs of your Perl software.

For example, if the specific version of Perl is not compatible with your script, you can switch to the different version by:

 $ plenv local

It is similar to set the local version of your script language when you use pyenv and rbenv as the following.

Put the following path into ~/.bashrc file.

export PERL5LIB="$HOME/.plenv/build/perl-5.18.1/lib"

Install BioPerl and PerlIO::gzip

CPANMinus is a very good Perl module manager, use cpanm to install BioPerl can save you a lot of time. Here are some useful modules:

$ cpanm Bio::Perl
$ cpanm Bio::SearchIO
$ cpanm PerlIO::gzip

For more information, please visit: https://github.com/tokuhirom/plenv

Environment for Tree Exploration (ETE) is a Python programming toolkit that assists in the recontruction, manipulation, analysis and visualization of phylogenetic trees

Rahul Nayak — Wed, 27 Nov 2019 05:32:33 -0600

The Environment for Tree Exploration (ETE) is a Python programming toolkit that assists in the recontruction, manipulation, analysis and visualization of phylogenetic trees (although clustering trees or any other tree-like data structure are also supported).

Other tools

https://github.com/shenwei356/taxonkit

ETE, version: 3.1.1
BioPython, version: 1.73
taxadb, version: 0.10.1
TaxonKit, version: 0.5.0

Address of the bookmark: https://pypi.org/project/ete3/3.1.1/

BASE: a practical de novo assembler for large genomes using long NGS reads

Rahul Nayak — Fri, 19 Oct 2018 07:25:21 -0500

new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.

Address of the bookmark: https://github.com/dhlbh/BASE