BOL: Related items

EXCAVATOR: detecting copy number variants from whole-exome sequencing data

Radha Agarkar — Fri, 04 Jan 2019 10:10:48 -0600

EXCAVATOR, for the detection of copy number variants (CNVs) from whole-exome sequencing data. EXCAVATOR combines a three-step normalization procedure with a novel heterogeneous hidden Markov model algorithm and a calling method that classifies genomic regions into five copy number states. We validate EXCAVATOR on three datasets and compare the results with three other methods. These analyses show that EXCAVATOR outperforms the other methods and is therefore a valuable tool for the investigation of CNVs in largescale projects, as well as in clinical research and diagnostics. EXCAVATOR is freely available at http://sourceforge.net/projects/excavatortool/.

EXCAVATOR is a novel software package for the detection of copy number variants (CNVs) from whole-exome sequencing data.
EXCAVATOR has been published on Genome Biology (http://genomebiology.com/2013/14/10/R120/abstract).

Address of the bookmark: https://sourceforge.net/projects/excavatortool/

McClintock: Meta-pipeline to identify transposable element insertions using next generation sequencing data

BioStar — Tue, 27 Oct 2020 00:21:18 -0500

an integrated bioinformatics pipeline for the detection of TE insertions in whole-genome shotgun data, called McClintock (https://github.com/bergmanlab/mcclintock), which automatically runs and standardizes output for multiple TE detection methods. We demonstrate the utility of McClintock by evaluating six TE detection methods using simulated and real genome data from the model microbial eukaryote, Saccharomyces cerevisiae.

Address of the bookmark: https://github.com/bergmanlab/mcclintock

Libraries or management tools for high throughput sequencing data

LEGE — Fri, 04 Oct 2024 02:45:06 -0500

GATB Library. The Genome Analysis Toolbox with de-Bruijn graph. A large part of tools developed by the GenScale team are based on this library.
These methods enable the analysis of data sets of any size on multi-core desktop computers, including very huge amount of reads data coming from any kind of organisms such as bacteria, plants, animals and even complex samples (e.g. metagenomes). Among them are (the full is available here: https://gatb.inria.fr/software/):
LRez: C++ Library and toolkit for the barcode-based management and indexation of linked-read datasets.

Variant calling and/or genotyping

DiscoSNP++ and discoSnpRAD: Reference-free small variant discovery (SNPs and indels)
MindTheGap: Detection and assembly of large insertion variants
TakeABreak: reference-free inversion discovery tool
SVJedi: Structural Variant genotyper with long read data
SVJedi-graph: Structural Variant genotyper with long read data using a variation graph

Sequence assembly

MinYS: reference-guided genome assembly in metagenomics data
MTG-link: local assembly tool for linked-read data
Minia: De novo short read assembler
de-novo pipeline: de-novo assembly pipeline (error correction / contigs / scaffolding) for genomes and meta-genomes
Mapsembler2: Targeted assembly (not maintained)

Managing k-mers & indexation

findere: simple strategy for speeding up queries and for reducing false positive calls from any Approximate Membership Query data structure.
- fimpera extends findere adding the abundance information.
kmtricks: modular tool suite for counting kmers, and constructing Bloom filters or kmer matrices, for large collections of sequencing data.
kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.
back to sequences: Find sequences (reads, unitigs, genes) related to a set of kmers in large datasets, in a matter of seconds.
Backpack Quotient Filter: k-mer indexing data structure with abundance
short read connector: Detect similar reads from potentially large read set
DSK: Count K-mer in sequences

Pangenome graph manipulation

Pancat: Pangenome Comparison and Analysis Toolkit
GFAGraphs: a Python library to handle pangenome graph files in GFA format.

Comparative metagenomics with k-mers

Simka and SimkaMin: Comparative metagenomics for large-scale datasets
Comparead & Commet: comparison of metagenomic datasets

Species and bacterial strains identification

ORI: software using long nanopore reads to identify bacteria present in a sample at the strain level
StrainFLAIR: STRAIN-level proFiLing using vArIation gRaph

General-purpose sequencing data manipulation

GASSST: long read mapper
Leon: short read compressor (now included in GATB-core)
Bloocoo: short read corrector
BCALM: Construct compacted de Bruijn graphs (unitigs)

Protein Structure

A_Purva: Contact Map Overlap solver
MD-Jeep: Distance Geometry solver
CSA: Comparative Structural Alignment

Workflow

SLICEE: parallel execution of bioinformatics workflows

Comparative Genomics

CASSIS: detection of rearrangement breakpoints
PLAST: intensive bank-to-bank sequence comparison
DRJBreakpointFinder: detection and precise localization of excision sites in proviral segments

LRCstats: a tool for evaluating long reads correction methods

Aaryan Lokwani — Wed, 22 Aug 2018 11:05:04 -0500

LRCstats is an open-source pipeline for benchmarking DNA long read correction algorithms for long reads outputted by third generation sequencing technology such as machines produced by Pacific Biosciences. The reads produced by third generation sequencing technology, as the name suggests, are longer in length than reads produced by next generation sequencing technologies, such as those produced by Illumina. However, long reads are plagued by high error rates, which can cause issues in downstream analysis. Long read correction algorithms reduce the error rate of long reads either through self-correcting methods or using accurate, short reads outputted by next generation sequencing technologies to correct long reads.

Address of the bookmark: https://github.com/cchauve/lrcstats

Wtdbg2: a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore

Neel — Fri, 19 Oct 2018 08:48:43 -0500

Wtdbg2 is a de novo sequence assembler for long noisy reads produced by PacBio or Oxford Nanopore Technologies (ONT). It assembles raw reads without error correction and then builds the consensus from intermediate assembly output. Wtdbg2 is able to assemble the human and even the 32Gb Axolotl genome at a speed tens of times faster than CANU and FALCONwhile producing contigs of comparable base accuracy.

Address of the bookmark: https://github.com/ruanjue/wtdbg2

TULIP - The Uncorrected Long read Itegration Pipeline

Jit — Thu, 23 Nov 2017 09:30:01 -0600

#Running TULIP (The Uncorrected Long-read Integration Process), version 0.4 late 2016 (European eel)

TULIP currently consists of to Perl scripts, tulipseed.perl and tulipbulb.perl. These are very much intended as prototypes, and additional components and/or implementations are likely to follow.
Tulipseed takes as input alignments files of long reads to sparse short seeds, and outputs a graph and scaffold structures. Tulipbulb adds long read sequencing data to these.

https://github.com/Generade-nl/TULIP

Address of the bookmark: https://github.com/Generade-nl/TULIP

TULIP - The Uncorrected Long read Integration Pipeline

Jit — Tue, 15 May 2018 09:06:37 -0500

TULIP currently consists of two Perl scripts, tulipseed.perl and tulipbulb.perl. These are very much intended as prototypes, and additional components and/or implementations are likely to follow. Tulipseed takes as input alignments files of long reads to sparse short seeds, and outputs a graph and scaffold structures.

Address of the bookmark: https://github.com/Generade-nl/TULIP

SimLoRD: A read simulator for third generation sequencing reads

Aaryan Lokwani — Wed, 22 Aug 2018 10:40:27 -0500

SimLoRD is a read simulator for third generation sequencing reads and is currently focused on the Pacific Biosciences SMRT error model.

Reads are simulated from both strands of a provided or randomly generated reference sequence.

The reference can be read from a FASTA file or randomly generated with a given GC content. It can consist of several chromosomes, whose structure is respected when drawing reads. (Simulation of genome rearrangements may be incorporated at a later stage.)
The read lengths can be determined in four ways: drawing from a log-normal distribution (typical for genomic DNA), sampling from an existing FASTQ file (typical for RNA), sampling from a a text file with integers (RNA), or using a fixed length
Quality values and number of passes depend on fragment length.
Provided subread error probabilities are modified according to number of passes
Outputs reads in FASTQ format and alignments in SAM format

Address of the bookmark: https://bitbucket.org/genomeinformatics/simlord/

GSP4PDB: a web tool to visualize, search and explore protein-ligand structural patterns

Neel — Sun, 15 Mar 2020 03:41:12 -0500

GSP4PDB is a user-friendly and efficient application to search and discover new patterns of protein-ligand interaction.

GSP4PDB is part of the services provided by the Bioinformatic Group of the University of Talca

http://gdblab.com/gsp4pdb/gsp4pdb2/

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3352-x

Address of the bookmark: http://gdblab.com/gsp4pdb/gsp4pdb2/

MetaGraph: Ultra Scalable Framework for DNA Search, Alignment, Assembly

Abhi — Sat, 08 Jun 2024 16:15:25 -0500

The MetaGraph framework is designed to work with a wide range of input data sets, indexing from a few samples up to the contents of entire archives with hundreds of thousands of records. The indexing workflow always follows the same principle, transforming single input samples into error-removed, refined sample graphs, which are then merged into a joint metagraph index. Each input sample is annotated in the joint index as a subgraph. This graph index enriched with metadata can then be used for downstream applications such as sequence search or differential assembly.

Searcg link https://metagraph.ethz.ch/search

Pre-print https://www.biorxiv.org/content/10.1101/2020.10.01.322164v4

Address of the bookmark: https://metagraph.ethz.ch/