BOL: Related items

BBMap/BBTools package: Multipurpose tool designed for converting reads or other nucleotide data between different formats.

Jit — Mon, 13 Jun 2016 05:47:21 -0500

Reformatis a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:

fastq
fasta
fasta+qual
sam
scarf (an old Illumina format)
bam (if samtools is installed)
gzip
zip
ascii-33 (sanger)
ascii-64 (old Illumina)
paired files
interleaved files

It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.

The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.

Quality trimming (either or both ends)
Quality filtering
Fixed-length trimming
Generation of histograms (base composition, quality, etc)
Subsampling (to a fraction of input reads, or an exact number of reads or bases)
Changing fasta line-wrapping length
Reverse-complementing (all reads or only read 2)
Adding /1 and /2 suffix to read names
GC-content filtering
Length-filtering
Testing for corrupted interleaved files

Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:

Reformat fastq into fasta:
reformat.sh in=x.fq out=y.fa

Interleave paired reads:
reformat.sh in1=x1.fq in2=x2.fq out=y.fq

Note - you can actually use a shortcut if paired read files have the same name with a 1 and a 2. This is equivalent to the above command:
reformat.sh in=x#.fq out=y.fq

De-interleave reads:
reformat.sh in=x.fq out1=y1.fq out2=y2.fq

Verify that interleaving appears correct, assuming Illumina namimg conventions:
reformat.sh in=x.fq vint

Convert ASCII-33 to ASCII-64:
reformat.sh in=x.fq out=y.fq qin=33 qout=64

Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming:
reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50

Subsample 10% of the first 20000 pairs in an interleaved file:
reformat.sh in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t
(in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)

Pipe in a gzipped sam file and pipe out fasta:
reformat.sh in=stdin.sam.gz out=stdout.fa

Reverse-complement reads:
reformat.sh in=x.fq out=y.fq rcomp

For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters:
reformat.sh -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70

For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.

For people using a non-bash terminal, you may need to type "bash reformat.sh" instead of just "reformat.sh".
For users of Windows or other platforms that do not support bash shellscripts, replace "reformat.sh" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads"
for example,
java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa

Reformat can be downloaded with BBTools here:
https://sourceforge.net/projects/bbmap/

EXCAVATOR2tool

Bulbul — Wed, 30 Nov 2016 04:09:19 -0600

EXCAVATOR2 is a collection of bash, R and Fortran scripts and codes that analyses Whole Exome Sequencing (WES) data to identify CNVs. EXCAVATOR2 enhances the identification of all genomic CNVs, both overlapping and non-overlapping targeted exons by integrating the analysis of In-targets and Off- targets reads. Specifically, it improves the precision of calling CNVs overlapping targeted exons from WES data and enlarges the spectrum of detectable CNVs to off-target events.
EXCAVATOR2 can be effectively employed for the identification of CNVs in small as well as large-scale re-sequencing population and cancer studies. Lastly, it’s of particular interest that all WES experiments can be re-analysed using our method with the beneficial effect to identify novelCNVs in extra-exonic regions by having the full-genome CN profile.

Address of the bookmark: https://sourceforge.net/projects/excavator2tool/

Research Associate Bioinformatics in IISc Recruitment 2020

Tue, 23 Jun 2020 21:53:34 -0500

Research Associate Bioinformatics in IISc Recruitment 2020

Essential Qualifications: Ph.D. (Bioinformatics/ Biophysics/ Biotechnology or any other stream of biological/ physical sciences) with a minimum of two publications in reputed peer reviewed journals in the area of structural bioinformatics or biophysics or biomolecular modeling/ simulation.

Job description: Development of bioinformatics tools and algorithms/software for structure based analysis of biomolecular systems. Programmatic access to major biomolecular databases using APIs Knowledge based prediction and analysis of biomolecular structure, function and interactions. Docking/simulations for inhibitor design.

Desirable Qualifications (Research Associate/s): i) Strong computer programming skills (in Python/PERL/PHP or C++ or object oriented database management systems like MySQL etc or scripting languages under LINUX/UNIX environment).

ii) Extensive experience in computational analysis of biomolecular structure/interactions and usage of advanced biomolecular simulation softwares. iii) Adequate knowledge of major databases, webservers and softwares in the area of biomolecular structure/function and drug design. iv) Familiarity with Parallel Programming environments and experience in usage of high-end HPC clusters.

The candidates must highlight their experience in above mentioned fields/topics in their CV. Initial appointment will be for a period of 1 year, subject to extension after review of performance.

Emoluments: As per DST, GOI norms and commensurate with experience.

More at https://www.iisc.ac.in/positions-open/

MGERT: Mobile Genetic Elements Retrieving Tool

Neel — Sat, 18 May 2019 08:58:01 -0500

MGERT is a computational pipeline for easy retrieving of MGE's coding sequences of a particular family from genome assemblies. MGERT utilizes several established bioinformatic tools combined into single pipeline which hides different technical quirks from an inexperienced user.

Address of the bookmark: https://github.com/andrewgull/MGERT

vt: a variant tool set that discovers short variants from Next Generation Sequencing data.

Jit — Tue, 28 Jan 2020 03:44:43 -0600

vt is a variant tool set that discovers short variants from Next Generation Sequencing data.

https://genome.sph.umich.edu/wiki/Vt

https://github.com/atks/vt

Address of the bookmark: https://genome.sph.umich.edu/wiki/Vt

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm)

Abhimanyu Singh — Thu, 29 Dec 2016 03:26:45 -0600

Prodigal (Prokaryotic Dynamic Programming Genefinding Algorithm) is a microbial (bacterial and archaeal) gene finding program developed at Oak Ridge National Laboratory and the University of Tennessee. Key features of Prodigal include:

Speed: Prodigal is an extremely fast gene recognition tool (written in very vanilla C). It can analyze an entire microbial genome in 30 seconds or less.
Accuracy: Prodigal is a highly accurate gene finder. It correctly locates the 3' end of every gene in the experimentally verified Ecogene data set (except those containing introns). It possesses a very sophisticated ribosomal binding site scoring system that enables it to locate the translation initiation site with great accuracy (96% of the 5' ends in the Ecogene data set are located correctly).
Specificity: Prodigal's false positive rate compares favorably with other gene identification programs, and usually falls under 5%.
GC-Content Indifferent: Prodigal performs well even in high GC genomes, with over a 90% perfect match (5'+3') to the Pseudomonas aeruginosa curated annotations.
Metagenomic Version: Prodigal can run in metagenomic mode and analyze sequences even when the organism is unknown.
Ease of Use: Prodigal can be run in one step on a single genomic sequence or on a draft genome containing many sequences. It does not need to be supplied with any knowledge of the organism, as it learns all the properties it needs to on its own.
Open Source: Prodigal source code is freely available under the General Public License.

Download the latest version of Prodigal at the Prodigal github page.
or
Browse the wiki documenation.

Address of the bookmark: http://prodigal.ornl.gov/

HivePlot

Jit — Thu, 16 Feb 2017 11:39:34 -0600

The hive plot is a rational visualization method for drawing networks. Nodes are mapped to and positioned on radially distributed linear axes — this mapping is based on network structural properties. Edges are drawn as curved links. Simple and interpretable.

The purpose of the hive plot is to establish a new baseline for visualization of large networks — a method that is both general and tunable and useful as a starting point in visually exploring network structure.

More at http://www.hiveplot.com/

Address of the bookmark: http://www.hiveplot.com/

Bioinformatics Head (Bioinformatics Manager III), Cancer Genomics Research Laboratory at Frederick National Laboratory

Wed, 18 Aug 2021 00:19:48 -0500

Frederick National Laboratory seeking an enthusiastic, creative, and seasoned bioinformatics professional to join our leadership team and direct the exceptional Bioinformatics Group at the Cancer Genomics Research Laboratory (CGR). CGR has a diverse team of bioinformatics and computational scientists that support all areas of bioinformatics and data analysis (infrastructure, data QC, pipeline development and maintenance, data curation and sharing, methodology development, statistical analyses, machine learning approaches, and scientific interpretation).

More at https://leidosbiomed.csod.com/ats/careersite/jobdetails.aspx?site=4&c=leidosbiomed&id=2040

TARDIS: Toolkit for automated and rapid discovery of structural variants

Neel — Fri, 09 Jun 2017 04:43:31 -0500

tardis

Toolkit for Automated and Rapid DIscovery of Structural variants

Requirements

zlib (http://www.zlib.net)
mrfast (https://github.com/BilkentCompGen/mrfast)
htslib (included as submodule; http://htslib.org/)
Fetching tardis

git clone https://github.com/BilkentCompGen/tardis.git --recursive

https://github.com/BilkentCompGen/tardis

Address of the bookmark: https://github.com/BilkentCompGen/tardis

DeepVariant : an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

Jit — Sat, 25 Jan 2020 13:28:09 -0600

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data.

DeepVariant is an analysis pipeline that uses a deep neural network to call genetic variants from next-generation DNA sequencing data. DeepVariant relies on Nucleus, a library of Python and C++ code for reading and writing data in common genomics file formats (like SAM and VCF) designed for painless integration with the TensorFlow machine learning framework.

https://ai.googleblog.com/2017/12/deepvariant-highly-accurate-genomes.html

https://www.biorxiv.org/content/10.1101/092890v6

Address of the bookmark: https://github.com/google/deepvariant