BOL: Related items

Libraries or management tools for high throughput sequencing data

LEGE — Fri, 04 Oct 2024 02:45:06 -0500

GATB Library. The Genome Analysis Toolbox with de-Bruijn graph. A large part of tools developed by the GenScale team are based on this library.
These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples (e.g. metagenomes). Among them are (the full is available here: https://gatb.inria.fr/software/):
LRez: C++ Library and toolkit for the barcode-based management and indexation of linked-read datasets.

Variant calling and/or genotyping

DiscoSNP++ and discoSnpRAD: Reference-free small variant discovery (SNPs and indels)
MindTheGap: Detection and assembly of large insertion variants
TakeABreak: reference-free inversion discovery tool
SVJedi: Structural Variant genotyper with long read data
SVJedi-graph: Structural Variant genotyper with long read data using a variation graph

Sequence assembly

MinYS: reference-guided genome assembly in metagenomics data
MTG-link: local assembly tool for linked-read data
Minia: De novo short read assembler
de-novo pipeline: de-novo assembly pipeline (error correction / contigs / scaffolding) for genomes and meta-genomes
Mapsembler2: Targeted assembly (not maintained)

Managing k-mers & indexation

findere: simple strategy for speeding up queries and for reducing false positive calls from any Approximate Membership Query data structure.
- fimpera extends findere adding the abundance information.
kmtricks: modular tool suite for counting kmers, and constructing Bloom filters or kmer matrices, for large collections of sequencing data.
kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.
back to sequences: Find sequences (reads, unitigs, genes) related to a set of kmers in large datasets, in a matter of seconds.
Backpack Quotient Filter: k-mer indexing data structure with abundance
short read connector: Detect similar reads from potentially large read set
DSK: Count K-mer in sequences

Pangenome graph manipulation

Pancat: Pangenome Comparison and Analysis Toolkit
GFAGraphs: a Python library to handle pangenome graph files in GFA format.

Comparative metagenomics with k-mers

Simka and SimkaMin: Comparative metagenomics for large-scale datasets
Comparead & Commet: comparison of metagenomic datasets

Species and bacterial strains identification

ORI: software using long nanopore reads to identify bacteria present in a sample at the strain level
StrainFLAIR: STRAIN-level proFiLing using vArIation gRaph

General-purpose sequencing data manipulation

GASSST: long read mapper
Leon: short read compressor (now included in GATB-core)
Bloocoo: short read corrector
BCALM: Construct compacted de Bruijn graphs (unitigs)

Protein Structure

A_Purva: Contact Map Overlap solver
MD-Jeep: Distance Geometry solver
CSA: Comparative Structural Alignment

Workflow

SLICEE: parallel execution of bioinformatics workflows

Comparative Genomics

CASSIS: detection of rearrangement breakpoints
PLAST: intensive bank-to-bank sequence comparison
DRJBreakpointFinder: detection and precise localization of excision sites in proviral segments

ACANA: An accurate and consistent alignment tool for DNA sequences

Jit — Wed, 06 Dec 2017 09:45:29 -0600

ACANA is an accurate and consistent alignment tool for DNA sequences. ACANA is specifically designed for aligning sequences that share only some moderately conserved regions and/or have a high frequency of long insertions or deletions. It attempts to combine the best of local and global alignments algorithms in searching for evolutionarily related regions of sequences in order to achieve the best alignment. ACANA is also robust to the small changes of alignment parameters, particularly the gap extension score. As an accurate alignment tool, ACANA is particularly useful in comparative sequence analysis for identifying conserved functional regulatory elements.

Address of the bookmark: https://www.niehs.nih.gov/research/resources/software/biostatistics/acana/index.cfm

MSAProbs - Parallel and accurate multiple sequence alignment

Neel — Tue, 09 Jul 2019 23:58:44 -0500

MSAProbs is a well-established state-of-the-art multiple sequence alignment algorithm for protein sequences. The design of MSAProbs is based on a combination of pair hidden Markov models and partition functions to calculate posterior probabilities. Assessed using the popular benchmarks: BAliBASE, PREFAB, SABmark and OXBENCH, MSAProbs achieves statistically significant accuracy improvements over the existing top performing aligners, including ClustalW, MAFFT, MUSCLE, ProbCons and Probalign. In addition, MSAProbs is optimized for shared-memory CPUs by employing a multi-threaded design, and further parallelized for distributed-memory systems using MPI to overcome high memory overhead barrier and achieve good parallel and data-size scalability.

Address of the bookmark: http://msaprobs.sourceforge.net/homepage.htm#latest

Liftoff: An accurate GFF3/GTF lift over pipeline

Neel — Sun, 20 Dec 2020 01:36:37 -0600

Liftoff is a tool that accurately maps annotations in GFF or GTF between assemblies of the same, or closely-related species. Unlike current coordinate lift-over tools which require a pre-generated “chain” file as input, Liftoff is a standalone tool that takes two genome assemblies and a reference annotation as input and outputs an annotation of the target genome.

Address of the bookmark: https://github.com/agshumate/Liftoff

OrthoGNC: A Software for Accurate Identification of Orthologs Based on Gene Neighborhood Conservation

Jit — Tue, 14 Nov 2017 09:30:35 -0600

Orthology relations can be used to transfer annotations from one gene (or protein) to another. Hence, detecting orthology relations has become an important task in the post-genomic era. Various genomic events, such as duplication and horizontal gene transfer, can cause erroneous assignment of orthology relations. In closely-related species, gene neighborhood information can be used to resolve many ambiguities in orthology inference. Here we present OrthoGNC, a software for accurately predicting pairwise orthology relations based on gene neighborhood conservation. Analyses on simulated and real data reveal the high accuracy of OrthoGNC. In addition to orthology detection, OrthoGNC can be employed to investigate the conservation of genomic context among potential orthologs detected by other methods. OrthoGNC is freely available online at http://bs.ipm.ir/softwares/orthognc and http://tinyurl.com/orthoGNC.

http://www.comp.nus.edu.sg/~wongls/projects/orthoGNC/

Address of the bookmark: http://www.sciencedirect.com/science/article/pii/S1672022917301663

JASMINE: Jointly Accurate Sv Merging with Intersample Network Edges

Shruti Paniwala — Sat, 02 Jul 2022 11:41:53 -0500

This tool is used to merge structural variants (SVs) across samples. Each sample has a number of SV calls, consisting of position information (chromosome, start, end, length), type and strand information, and a number of other values. Jasmine represents the set of all SVs across samples as a network, and uses a modified minimum spanning forest algorithm to determine the best way of merging the variants such that each merged variants represents a set of analogous variants occurring in different samples.

Address of the bookmark: https://github.com/mkirsche/Jasmine

FOGSAA: Fast Optimal Global Sequence Alignment Algorithm

Jit — Fri, 08 Dec 2017 14:41:08 -0600

Sequence alignment algorithms are widely used to infer similarirty and the point of differences between pair of sequences. FOGSAA is a fast Global alignment algorithm. It is basically a branch and bound approach which starts branch expansion in a greedy way taking the symbols from the given pair of sequences (protein or nucleotide) and results in an optimal alignment faster than conventional dymanic programming techniques. It is also better than the heuristic methods with respect to alignment quality.

Address of the bookmark: http://www.isical.ac.in/~bioinfo_miu/FOGSAA.htm

Opera: An optimal genome scaffolding program

Jit — Mon, 27 Nov 2017 10:18:20 -0600

Opera (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly ). It uses information from paired-end or long reads to optimally order and orient contigs assembled from shotgun-sequencing reads.

An updated version called OPERA-LG has been re-engineered with features for the assembly of large and complex genomes.

Song Gao, Denis Bertrand, Burton K. H. Chia and Niranjan Nagarajan. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology, May 2016, doi: 10.1186/s13059-016-0951-y.

Song Gao, Wing-Kin Sung, Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology, Sept. 2011, doi:10.1089/cmb.2011.0170.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0951-y

Address of the bookmark: https://sourceforge.net/projects/operasf/

SPAdes hybrid genome assembly

Jit — Mon, 27 Nov 2017 08:05:40 -0600

When you have both Illumina and Nanopore data, then SPAdes remains a good option for hybrid assembly - SPAdes was used to produce the B fragilis assembly by Mick Watson’s group.

Again, running spades.py will show you the options:

spades.py

This produces:

SPAdes genome assembler v3.10.1

Usage: /usr/local/SPAdes-3.10.1-Linux/bin/spades.py [options] -o 

Basic options:
-o          directory to store all the resulting files (required)
--sc                    this flag is required for MDA (single-cell) data
--meta                  this flag is required for metagenomic sample data
--rna                   this flag is required for RNA-Seq data
--plasmid               runs plasmidSPAdes pipeline for plasmid detection
--iontorrent            this flag is required for IonTorrent data
--test                  runs SPAdes on toy dataset
-h/--help               prints this usage message
-v/--version            prints version

Input data:
--12          file with interlaced forward and reverse paired-end reads
-1            file with forward paired-end reads
-2            file with reverse paired-end reads
-s            file with unpaired reads
--pe<#>-12            file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-1             file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-2             file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-s             file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-    orientation of reads for paired-end library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--s<#>                file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)
--mp<#>-12            file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-1             file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-2             file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-s             file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-    orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--hqmp<#>-12          file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-1           file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-2           file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-s           file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-  orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--nxmate<#>-1         file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--nxmate<#>-2         file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--sanger              file with Sanger reads
--pacbio              file with PacBio reads
--nanopore            file with Nanopore reads
--tslr        file with TSLR-contigs
--trusted-contigs             file with trusted contigs
--untrusted-contigs           file with untrusted contigs

Pipeline options:
--only-error-correction runs only read error correction (without assembling)
--only-assembler        runs only assembling (without read error correction)
--careful               tries to reduce number of mismatches and short indels
--continue              continue run from the last available check-point
--restart-from      restart run with updated options and from the specified check-point ('ec', 'as', 'k', 'mc')
--disable-gzip-output   forces error correction not to compress the corrected reads
--disable-rr            disables repeat resolution stage of assembling

Advanced options:
--dataset             file with dataset description in YAML format
-t/--threads               number of threads
                                [default: 16]
-m/--memory                RAM limit for SPAdes in Gb (terminates if exceeded)
                                [default: 250]
--tmp-dir              directory for temporary files
                                [default: /tmp]
-k                 comma-separated list of k-mer sizes (must be odd and
                                less than 128) [default: 'auto']
--cov-cutoff             coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset  <33 or 64>      PHRED quality offset in the input reads (33 or 64)
                                [default: auto-detect]

As you can see this is also a “pipeline” of tools that can be switched on or off. SPAdes takes quite a long time, so for the purposes of this practical, something like this may suffice:

spades.py -t 4 \
          -m 32 \
          -k 31,51,71 \
          --only-assembler \
          -1 miseq.1.fastq -2 miseq.2.fastq \
          --nanopore minion.fastq \
          -o hybrid_assembly

In turn, these parameters mean

use 4 threads
max memory is 32Gb
use 3 kmer values to build the de bruijn graph(s) - 31, 51 and 71
only run the assembler, not the correction algorithm (for speed)
read 1 and read 2 of the MiSeq data
the nanopore data
put the output in folder “hybrid_assembly”

jobTree based python wrapper to run the genome simulation tool suite Evolver

Jit — Fri, 08 Dec 2017 16:26:32 -0600

evolverSimControl (eSC) can be used to simulate multi-chromosome genome evolution on an arbitrary phylogeny (Newick format). In addition to simply running evolver, eSC also automatically creates statistical summaries of the simulation as it runs including text and image files. Also included are convenience scripts to: check on a running simulation and see detailed status and logging information; extract fasta sequence files from the leaf nodes of a completed simulation; extract pairwise multiple alignment files (.maf) from leaf and branch nodes from a completed simulation and with the help of mafJoin, join them together into a single maf covering the entire simulation.

Address of the bookmark: https://github.com/dentearl/evolverSimControl