BOL: Related items

VariantBam: Filtering and profiling of next-generational sequencing data using region-specific rules

Rahul Nayak — Thu, 04 Oct 2018 16:30:44 -0500

VariantBam is a tool to extract/count specific sets of sequencing reads from next-generational sequencing files. To save money, disk space and I/O, one may not want to store an entire BAM on disk. In many cases, it would be more efficient to store only those read-pairs or reads who intersect some region around the variant locations. Alternatively, if your scientific question is focused on only one aspect of the data (e.g. breakpoints), many reads can be removed without losing the information relevant to the problem.

Address of the bookmark: https://github.com/broadinstitute/VariantBam

MIX: Combining multiple assemblies from NGS data

Rahul Nayak — Tue, 08 May 2018 04:58:05 -0500

Mix is a tool that combines two or more draft assemblies, without relying on a reference genome and has the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a path in the extension graph that maximizes the cumulative contig length.

The Mix algorithm, approach and results were published in BMC bioinformatics : http://www.biomedcentral.com/1471-2105/14/S15/S16.

Address of the bookmark: https://github.com/cbib/MIX

Parliament2: Runs a combination of tools to generate structural variant calls on whole-genome sequencing data

Jit — Thu, 28 May 2020 21:57:03 -0500

Parliament2 identifies structural variants in a given sample relative to a reference genome. These structural variants cover large deletion events that are called as Deletions of a region, Insertions of a sequence into a region, Duplications of a region, Inversions of a region, or Translocations between two regions in the genome.

Parliament2 runs a combination of tools to generate structural variant calls on whole-genome sequencing data. It can run the following callers: Breakdancer, Breakseq2, CNVnator, Delly2, Manta, and Lumpy. Because of synergies in how the programs use computational resources, these are all run in parallel. Parliament2 will produce the outputs of each of the tools for subsequent investigation.

Address of the bookmark: https://github.com/dnanexus/parliament2

BFC: a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data

Jit — Thu, 31 May 2018 09:35:23 -0500

BFC is a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data. It is specifically designed for high-coverage whole-genome human data, though also performs well for small genomes. The BFC algorithm is a variant of the classical spectrum alignment algorithm introduced by Pevzner et al (2001). It uses an exhaustive search to find a k-mer path through a read that minimizes a heuristic objective function jointly considering penalties on correction, quality and k-mer support. This algorithm was first implemented in my fermi assembler and then refined a few times in fermi, fermi2 and now in BFC. In the k-mer counting phase, BFC uses a blocked bloom filter to filter out most singleton k-mers and keeps the rest in a hash table (Melsted and Pritchard, 2011). The use of bloom filter is how BFC is named, though other correctors such as Lighter and Bless actually rely more on bloom filter than BFC. https://github.com/lh3/bfc

Address of the bookmark: https://github.com/lh3/bfc

ngs-bits - Short-read sequencing tools

Neel — Thu, 16 Jan 2020 23:14:00 -0600

Binaries of ngs-bits are available via Bioconda. Alternatively, ngs-bits can be built from sources:

Binaries for Linux/macOS
From sources for Linux/macOS
From sources for Windows

Address of the bookmark: https://github.com/imgag/ngs-bits

NanoPack: visualizing and processing long-read sequencing data

Jit — Tue, 25 Dec 2018 21:20:50 -0600

The NanoPack tools are written in Python3 and released under the GNU GPL3.0 License. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for Linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools.

Address of the bookmark: https://github.com/wdecoster/nanopack

GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies.

Abhimanyu Singh — Tue, 23 May 2017 05:20:32 -0500

GRASS (GeneRic ASsembly Scaffolder)-a novel algorithm for scaffolding second-generation sequencing assemblies capable of using diverse information sources. GRASS offers a mixed-integer programming formulation of the contig scaffolding problem, which combines contig order, distance and orientation in a single optimization objective. The resulting optimization problem is solved using an expectation-maximization procedure and an unconstrained binary quadratic programming approximation of the original problem. We compared GRASS with existing HTS scaffolders using Illumina paired reads of three bacterial genomes. Our algorithm constructs a comparable number of scaffolds, but makes fewer errors. This result is further improved when additional data, in the form of related genome sequences, are used.

Address of the bookmark: https://github.com/AlexeyG/GRASS

nanofilt: Filtering and trimming of long read sequencing data

Jit — Mon, 30 Jul 2018 12:01:52 -0500

Filtering on quality and/or read length, and optional trimming after passing filters.
Reads from stdin, writes to stdout.

Intended to be used:

directly after fastq extraction
prior to mapping
in a stream between extraction and mapping

https://github.com/wdecoster/nanofilt

Address of the bookmark: https://github.com/wdecoster/nanofilt

AMStat: display statistics of large sequence files from next generation sequencing projects

Neel — Fri, 09 Nov 2018 13:34:56 -0600

SAMStat is an efficient C program to quickly display statistics of large sequence files from next generation sequencing projects. When applied to SAM/BAM files all statistics are reported for unmapped, poorly and accurately mapped reads separately. This allows for identification of a variety of problems, such as remaining linker and adaptor sequences, causing poor mapping. Apart from this SAMStat can be used to verify individual processing steps in large analysis pipelines.

Address of the bookmark: http://samstat.sourceforge.net/

List of bioinformatics workflow management tools !

Rahul Nayak — Sat, 20 Mar 2021 00:15:25 -0500

Here are list of Workflow Managers

BigDataScript – A cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities. [ paper-2014 | web ]
Bpipe – A small language for defining pipeline stages and linking them together to make pipelines. [ web ]
Common Workflow Language – a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. [ web ]
Cromwell – A Workflow Management System geared towards scientific workflows. [ web ]
Galaxy – a popular open-source, web-based platform for data intensive biomedical research. Has several features, from data analysis to workflow management to visualization tools. [ paper-2018 | web ]
Nextflow (recommended) – A fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner. [ paper-2018 | web ]
Ruffus – Computation Pipeline library for python widely used in science and bioinformatics. [ paper-2010 | web ]
SeqWare – Hadoop Oozie-based workflow system focused on genomics data analysis in cloud environments. [ paper-2010 | web ]
Snakemake – A workflow management system in Python that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment. [ paper-2018 | web ]
Workflow Descriptor Language – Workflow standard developed by the Broad. [ web ]