BOL: Related items

GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies.

Abhimanyu Singh — Tue, 23 May 2017 05:20:32 -0500

GRASS (GeneRic ASsembly Scaffolder)-a novel algorithm for scaffolding second-generation sequencing assemblies capable of using diverse information sources. GRASS offers a mixed-integer programming formulation of the contig scaffolding problem, which combines contig order, distance and orientation in a single optimization objective. The resulting optimization problem is solved using an expectation-maximization procedure and an unconstrained binary quadratic programming approximation of the original problem. We compared GRASS with existing HTS scaffolders using Illumina paired reads of three bacterial genomes. Our algorithm constructs a comparable number of scaffolds, but makes fewer errors. This result is further improved when additional data, in the form of related genome sequences, are used.

Address of the bookmark: https://github.com/AlexeyG/GRASS

AMStat: display statistics of large sequence files from next generation sequencing projects

Neel — Fri, 09 Nov 2018 13:34:56 -0600

SAMStat is an efficient C program to quickly display statistics of large sequence files from next generation sequencing projects. When applied to SAM/BAM files all statistics are reported for unmapped, poorly and accurately mapped reads separately. This allows for identification of a variety of problems, such as remaining linker and adaptor sequences, causing poor mapping. Apart from this SAMStat can be used to verify individual processing steps in large analysis pipelines.

Address of the bookmark: http://samstat.sourceforge.net/

FiNGS: Filters for Next Generation Sequencing

Neel — Sat, 27 Feb 2021 01:18:35 -0600

Key features

Filters SNVs from any variant caller to remove false positives
Calculates metrics based on BAM files and provides filtering not possible with other tools
Fully user-configurable filtering (including which filters to use and their thresholds)
Option to use filters identical to ICGC recommendations

FiNGS provides researchers with a tool to reproducibly filter somatic variants that is simple to both deploy and use, with filters and thresholds that are fully configurable by the user. It ingests and emits standard variant call format (VCF) files and will slot into existing sequencing pipelines. It allows users to develop and implement their own filtering strategies and simple sharing of these with others.

FiNGS reliably improves upon the precision of default variant caller outputs and performs better than other tools designed for the same task.

Address of the bookmark: https://github.com/cpwardell/FiNGS

Libraries or management tools for high throughput sequencing data

LEGE — Fri, 04 Oct 2024 02:45:06 -0500

GATB Library. The Genome Analysis Toolbox with de-Bruijn graph. A large part of tools developed by the GenScale team are based on this library.
These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples (e.g. metagenomes). Among them are (the full is available here: https://gatb.inria.fr/software/):
LRez: C++ Library and toolkit for the barcode-based management and indexation of linked-read datasets.

Variant calling and/or genotyping

DiscoSNP++ and discoSnpRAD: Reference-free small variant discovery (SNPs and indels)
MindTheGap: Detection and assembly of large insertion variants
TakeABreak: reference-free inversion discovery tool
SVJedi: Structural Variant genotyper with long read data
SVJedi-graph: Structural Variant genotyper with long read data using a variation graph

Sequence assembly

MinYS: reference-guided genome assembly in metagenomics data
MTG-link: local assembly tool for linked-read data
Minia: De novo short read assembler
de-novo pipeline: de-novo assembly pipeline (error correction / contigs / scaffolding) for genomes and meta-genomes
Mapsembler2: Targeted assembly (not maintained)

Managing k-mers & indexation

findere: simple strategy for speeding up queries and for reducing false positive calls from any Approximate Membership Query data structure.
- fimpera extends findere adding the abundance information.
kmtricks: modular tool suite for counting kmers, and constructing Bloom filters or kmer matrices, for large collections of sequencing data.
kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.
back to sequences: Find sequences (reads, unitigs, genes) related to a set of kmers in large datasets, in a matter of seconds.
Backpack Quotient Filter: k-mer indexing data structure with abundance
short read connector: Detect similar reads from potentially large read set
DSK: Count K-mer in sequences

Pangenome graph manipulation

Pancat: Pangenome Comparison and Analysis Toolkit
GFAGraphs: a Python library to handle pangenome graph files in GFA format.

Comparative metagenomics with k-mers

Simka and SimkaMin: Comparative metagenomics for large-scale datasets
Comparead & Commet: comparison of metagenomic datasets

Species and bacterial strains identification

ORI: software using long nanopore reads to identify bacteria present in a sample at the strain level
StrainFLAIR: STRAIN-level proFiLing using vArIation gRaph

General-purpose sequencing data manipulation

GASSST: long read mapper
Leon: short read compressor (now included in GATB-core)
Bloocoo: short read corrector
BCALM: Construct compacted de Bruijn graphs (unitigs)

Protein Structure

A_Purva: Contact Map Overlap solver
MD-Jeep: Distance Geometry solver
CSA: Comparative Structural Alignment

Workflow

SLICEE: parallel execution of bioinformatics workflows

Comparative Genomics

CASSIS: detection of rearrangement breakpoints
PLAST: intensive bank-to-bank sequence comparison
DRJBreakpointFinder: detection and precise localization of excision sites in proviral segments

NanoPack: visualizing and processing long-read sequencing data

Jit — Tue, 25 Dec 2018 21:20:50 -0600

The NanoPack tools are written in Python3 and released under the GNU GPL3.0 License. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for Linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools.

Address of the bookmark: https://github.com/wdecoster/nanopack

Circlator: automated circularization of genome assemblies using long sequencing reads

Poonam Mahapatra — Tue, 15 May 2018 09:42:32 -0500

A tool to circularize genome assemblies. The algorithm and benchmarks are described in the Genome Biology manuscript. Citation: "Circlator: automated circularization of genome assemblies using long sequencing reads", Hunt et al, Genome Biology 2015 Dec 29;16(1):294. doi: 10.1186/s13059-015-0849-0. PMID: 26714481.

Address of the bookmark: http://sanger-pathogens.github.io/circlator/

SimLoRD: A read simulator for third generation sequencing reads

Aaryan Lokwani — Wed, 22 Aug 2018 10:40:27 -0500

SimLoRD is a read simulator for third generation sequencing reads and is currently focused on the Pacific Biosciences SMRT error model.

Reads are simulated from both strands of a provided or randomly generated reference sequence.

The reference can be read from a FASTA file or randomly generated with a given GC content. It can consist of several chromosomes, whose structure is respected when drawing reads. (Simulation of genome rearrangements may be incorporated at a later stage.)
The read lengths can be determined in four ways: drawing from a log-normal distribution (typical for genomic DNA), sampling from an existing FASTQ file (typical for RNA), sampling from a a text file with integers (RNA), or using a fixed length
Quality values and number of passes depend on fragment length.
Provided subread error probabilities are modified according to number of passes
Outputs reads in FASTQ format and alignments in SAM format

Address of the bookmark: https://bitbucket.org/genomeinformatics/simlord/

LoRMA: a tool for correcting sequencing errors in long reads such those produced by Pacific Biosciences sequencing machines

Jit — Wed, 15 Jun 2016 17:18:36 -0500

LoRMA is a tool for correcting sequencing errors in long reads such those produced by Pacific Biosciences sequencing machines.

Publication:

L. Salmela, R. Walve, E. Rivals, and E. Ukkonen: Accurate selfcorrection of errors in long reads using de Bruijn graphs. Accepted to RECOMB-Seq 2016.

Download:

Address of the bookmark: https://www.cs.helsinki.fi/u/lmsalmel/LoRMA/

Jabba: Hybrid Error Correction for Long Sequencing Reads

Jit — Fri, 05 Jan 2018 03:58:14 -0600

Jabba is a hybrid error correction tool to correct third generation (PacBio / ONT) sequencing data, using second generation (Illumina) data.

Input

Jabba takes as input a concatenated de Bruijn graph and a set of sequences:

the de Bruijn graph should appear in fasta format with 1 entry per node, the meta information should be in the format:
>NODE
the set of sequences should be in fasta or fastq format. These sequences will be corrected (e.g. PacBio reads). The corrections will be written to a file Jabba fasta.
The output is a file in fasta format with corrections of the long reads, and additionally a file in the input format containing uncorrected reads.

https://github.com/biointec/jabba/wiki

https://almob.biomedcentral.com/articles/10.1186/s13015-016-0075-7

Address of the bookmark: https://github.com/biointec/jabba

Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data

Jit — Thu, 11 Feb 2021 21:39:05 -0600

Ktrim is written in C++ for GNU Linux/Unix platforms. After uncompressing the source package, you can find an executable file ktrim under bin/ directory compiled using g++ v4.8.5 and linked with libz v1.2.7 for Linux x86_64 system. If you could not run it (which is usually caused by low version of libc++ or libz library) or you want to build a version optimized for your system, you can re-compile the programs:

user@linux$ make clean && make

Address of the bookmark: https://github.com/hellosunking/Ktrim