BOL: Related items

List of bioinformatics workflow management tools !

Rahul Nayak — Sat, 20 Mar 2021 00:15:25 -0500

Here are list of Workflow Managers

BigDataScript – A cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities. [ paper-2014 | web ]
Bpipe – A small language for defining pipeline stages and linking them together to make pipelines. [ web ]
Common Workflow Language – a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. [ web ]
Cromwell – A Workflow Management System geared towards scientific workflows. [ web ]
Galaxy – a popular open-source, web-based platform for data intensive biomedical research. Has several features, from data analysis to workflow management to visualization tools. [ paper-2018 | web ]
Nextflow (recommended) – A fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner. [ paper-2018 | web ]
Ruffus – Computation Pipeline library for python widely used in science and bioinformatics. [ paper-2010 | web ]
SeqWare – Hadoop Oozie-based workflow system focused on genomics data analysis in cloud environments. [ paper-2010 | web ]
Snakemake – A workflow management system in Python that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment. [ paper-2018 | web ]
Workflow Descriptor Language – Workflow standard developed by the Broad. [ web ]

MEGADOCK 4.0

Suleman Khan — Thu, 07 Aug 2014 18:08:54 -0500

An ultra–high-performance protein–protein docking software for heterogeneous supercomputers

Summary: The application of protein–protein docking in large-scale interactome analysis is a major challenge in structural bioinformatics and requires huge computing resources. In this work, we present MEGADOCK 4.0, an FFT-based docking software that makes extensive use of recent heterogeneous supercomputers and shows powerful, scalable performance of over 97% strong scaling.

Availability and Implementation: MEGADOCK 4.0 is written in C++ with OpenMPI and NVIDIA CUDA 5.0 (or later) and is freely available to all academic and non-profit users at: http://www.bi.cs.titech.ac.jp/megadock.

Contact: akiyama@cs.titech.ac.jp

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2014/08/06/bioinformatics.btu532.short

VariantBam: Filtering and profiling of next-generational sequencing data using region-specific rules

Rahul Nayak — Thu, 04 Oct 2018 16:30:44 -0500

VariantBam is a tool to extract/count specific sets of sequencing reads from next-generational sequencing files. To save money, disk space and I/O, one may not want to store an entire BAM on disk. In many cases, it would be more efficient to store only those read-pairs or reads who intersect some region around the variant locations. Alternatively, if your scientific question is focused on only one aspect of the data (e.g. breakpoints), many reads can be removed without losing the information relevant to the problem.

Address of the bookmark: https://github.com/broadinstitute/VariantBam

GrapheR !!!

John Parker — Thu, 14 Aug 2014 14:02:17 -0500

What a wonderful gem GrapheR is.... Oh yes it is. GrapheR is a GUI for base graphics in R by http://www.maximeherve.com/. The package provides a graphical user interface for creating base charts in R. It is ideal for beginners in R, as the user interface is very clear and the code is written along side into a text file, allowing users to recreate the charts directly in the console.

Adding and changing legends? Messing around with the plotting window settings? It is much easier/quicker with this GUI than reading the help file and trying to understand the various parameters.
Here is a little example using the iris data set.

library(GrapheR)
data(iris)
run.GrapheR()

This will bring up a window that helps me to create the chart and tweak the various parameters.

Finally, I find the underlying R code in a file created by GrapheR. For more details read also the package vignette, which is available in English, French and German!

NanoPack: visualizing and processing long-read sequencing data

Jit — Tue, 25 Dec 2018 21:20:50 -0600

The NanoPack tools are written in Python3 and released under the GNU GPL3.0 License. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for Linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools.

Address of the bookmark: https://github.com/wdecoster/nanopack

pybedtools

Shruti Paniwala — Wed, 20 Aug 2014 01:03:41 -0500

pybedtools is a Python wrapper for Aaron Quinlan's BEDtools programs (https://github.com/arq5x/bedtools), which are widely used for genomic interval manipulation or "genome algebra". pybedtools extends BEDTools by offering feature-level manipulations from with Python. See full online documentation, including installation instructions, at http://pythonhosted.org/pybedtools/.

More at http://pythonhosted.org/pybedtools/

A powerful toolset for genome arithmetic.http://code.google.com/p/bedtools/

ngs-bits - Short-read sequencing tools

Neel — Thu, 16 Jan 2020 23:14:00 -0600

Binaries of ngs-bits are available via Bioconda. Alternatively, ngs-bits can be built from sources:

Binaries for Linux/macOS
From sources for Linux/macOS
From sources for Windows

Address of the bookmark: https://github.com/imgag/ngs-bits

Understanding your reads and mapping !

Neel — Wed, 29 Jan 2020 06:29:55 -0600

One of the best tutorial for beginners ...

https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2017/Day1/Session4-seqIntro.html

Address of the bookmark: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2017/Day1/Session4-seqIntro.html

Bioinformatics JRF/SRF position at IARI

Thu, 04 Sep 2014 04:14:01 -0500

DIVISION OF NEMATOLOGY
INDIAN AGRICULTURAL RESEARCH INSTITUTE
NEW DELHI 110012
Applications are invited for the posts of one Junior
Research Fellow and one RA in the DBT funded project entitled “ Plant parasitic nematode genome informatics - insilico resource development”. The project is for a period of three years.

Essential qualifications for JRF
: M. Sc. in Bioinformatics with experience in Proteomics, genomics and structural biology. Knowledge of programming language, pearl and database – HTML, CSS,php and Java script.
Essential qualifications for Research Associate:
MSc/MTech in Bioinformatics with three years experience or Ph.D in Bioinformatics with experience in proteomics, genomics and structural biology. Knowledge of programming language, perl and database
– HTML, CSS, Java script. NGS sequence assembly and analysis and algorithm designing.
Age limit : 35 years maximum (5 year relaxation for SC/ST and women candidates)
Emoluments:
JRF: 16,000 + 30% HRA
.
Res Assoc: Rs22,000 + 30% HRA
The post is purely temporary in nature and is co-terminus with the project. The appointment would be initially for one year and may be extended further upon satisfactory performance.
Interested candidates
should send the duly filled application forms (format in the following page ) so as to reach on or before 20.9.2014 along with all the relevant documents.

More at http://www.iari.res.in/files/JRF_RA-03092014-20140903-135319.pdf

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm