BOL: Related items

BBMap/BBTools package: Multipurpose tool designed for converting reads or other nucleotide data between different formats.

Jit — Mon, 13 Jun 2016 05:47:21 -0500

Reformatis a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:

fastq
fasta
fasta+qual
sam
scarf (an old Illumina format)
bam (if samtools is installed)
gzip
zip
ascii-33 (sanger)
ascii-64 (old Illumina)
paired files
interleaved files

It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.

The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.

Quality trimming (either or both ends)
Quality filtering
Fixed-length trimming
Generation of histograms (base composition, quality, etc)
Subsampling (to a fraction of input reads, or an exact number of reads or bases)
Changing fasta line-wrapping length
Reverse-complementing (all reads or only read 2)
Adding /1 and /2 suffix to read names
GC-content filtering
Length-filtering
Testing for corrupted interleaved files

Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:

Reformat fastq into fasta:
reformat.sh in=x.fq out=y.fa

Interleave paired reads:
reformat.sh in1=x1.fq in2=x2.fq out=y.fq

Note - you can actually use a shortcut if paired read files have the same name with a 1 and a 2. This is equivalent to the above command:
reformat.sh in=x#.fq out=y.fq

De-interleave reads:
reformat.sh in=x.fq out1=y1.fq out2=y2.fq

Verify that interleaving appears correct, assuming Illumina namimg conventions:
reformat.sh in=x.fq vint

Convert ASCII-33 to ASCII-64:
reformat.sh in=x.fq out=y.fq qin=33 qout=64

Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming:
reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50

Subsample 10% of the first 20000 pairs in an interleaved file:
reformat.sh in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t
(in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)

Pipe in a gzipped sam file and pipe out fasta:
reformat.sh in=stdin.sam.gz out=stdout.fa

Reverse-complement reads:
reformat.sh in=x.fq out=y.fq rcomp

For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters:
reformat.sh -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70

For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.

For people using a non-bash terminal, you may need to type "bash reformat.sh" instead of just "reformat.sh".
For users of Windows or other platforms that do not support bash shellscripts, replace "reformat.sh" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads"
for example,
java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa

Reformat can be downloaded with BBTools here:
https://sourceforge.net/projects/bbmap/

Blobsplorer

Jit — Tue, 14 Jun 2016 10:28:58 -0500

Blobsplorer is a tool for interactive visualization of assembled DNA sequence data ("contigs") derived from (often unintentionally) mixed-species pools. It allows the simultaneous display of GC content, coverage, and taxonomic annotation for collections of contigs with a view to separating out those belonging to different taxa.

Blobsplorer is unlikely to be of use on its own as it requires contig data to be supplied in a format that involves considerable preprocessing (see below for a description). The easiest way to use Blobsplorer is as part of a workflow using scripts from here.

Address of the bookmark: http://nematodes.org/martin/blobsplorer/blobsplorer.html

CNIDARIA: fast, reference-free phylogenomic clustering

Shruti Paniwala — Thu, 16 Jun 2016 17:55:17 -0500

Motivation: Identification of biological specimens is a major requirement for a range of applications. Reference-free methods analyse unprocessed sequencing data without relying on prior knowledge, but these do not scale to arbitrarily large genomes and arbitrarily large phylogenetic distances.

Results: We present Cnidaria, a practical tool for clustering genomic and transcriptomic data with no limitation on ge-nome size or phylogenetic distances. We successfully simultaneously clustered 169 genomic and transcriptomic datasets from 4 kingdoms, achieving 100% accuracy at supra-species level and 78% accuracy for species level.

Availability and Implementation: Cnidaria is written in C++ and Python and is available at http://www.ab.wur.nl/cnidaria.

Contact: Saulo Aflitos - sauloal@gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.

Address of the bookmark: https://github.com/sauloal/cnidaria/wiki

Greengenes database

Jit — Wed, 29 Jun 2016 10:03:31 -0500

The greengenes web application provides access to the 2011 version of the greengenes 16S rRNA gene sequence alignment for browsing, blasting, probing, and downloading. The data and tools presented by greengenes can assist the researcher in choosing phylogenetically specific probes, interpreting microarray results, and aligning/annotating novel sequences. If you are an ARB user, you can use greengenes to keep your own local database current.

Address of the bookmark: http://greengenes.lbl.gov/cgi-bin/nph-index.cgi

NearHGT

Jit — Wed, 22 Jun 2016 05:41:57 -0500

Horizontal gene transfer (HGT), the transfer of genetic material between organisms, is crucial for genetic innovation and the evolution of genome architecture. Existing HGT detection algorithms rely on a strong phylogenetic signal distinguishing the transferred sequence from ancestral (vertically derived) genes in its recipient genome. Detecting HGT between closely related species or strains is challenging, as the phylogenetic signal is usually weak and the nucleotide composition is normally nearly identical. Nevertheless, there is a great importance in detecting HGT between congeneric species or strains, especially in clinical microbiology, where understanding the emergence of new virulent and drug-resistant strains is crucial, and often time-sensitive.

We developed a novel, self-contained technique named Near HGT, based on the synteny index, to measure the divergence of a gene from its native genomic environment and used it to identify candidate HGT events between closely related strains. The method confirms candidate transferred genes based on the constant relative mutability (CRM). Using CRM, the algorithm assigns a confidence score based on “unusual” sequence divergence. A gene exhibiting exceptional deviations according to both synteny and mutability criteria, is considered a validated HGT product. We first employed the technique to a set of three E. coli strains and detected several highly probable horizontally acquired genes. We then compared the method to existing HGT detection tools using a larger strain data set.

When combined with additional approaches our new algorithm provides richer picture and brings us closer to the goal of detecting all newly acquired genes in a particular strain.

Availability: The method is publicly available athttp://research.haifa.ac.il/~ssagi/software/nearHGT.zip

Address of the bookmark: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004408

WgSim

Jit — Thu, 23 Jun 2016 07:26:49 -0500

Reads simulator

Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms.

Wgsim outputs the simulated polymorphisms, and writes the true read coordinates as well as the number of polymorphisms and sequencing errors in read names. One can evaluate the accuracy of a mapper or a SNP caller with wgsim_eval.pl that comes with the package.

Address of the bookmark: https://github.com/lh3/wgsim

QuIN’s web server

Jit — Mon, 27 Jun 2016 10:44:16 -0500

Recent studies of the human genome have indicated that regulatory elements (e.g. promoters and enhancers) at distal genomic locations can interact with each other via chromatin folding and affect gene expression levels. Genomic technologies for mapping interactions between DNA regions, e.g., ChIA-PET and HiC, can generate genome-wide maps of interactions between regulatory elements. These interaction datasets are important resources to infer distal gene targets of non-coding regulatory elements and to facilitate prioritization of critical loci for important cellular functions. With the increasing diversity and complexity of genomic information and public ontologies, making sense of these datasets demands integrative and easy-to-use software tools. Moreover, network representation of chromatin interaction maps enables effective data visualization, integration, and mining. Currently, there is no software that can take full advantage of network theory approaches for the analysis of chromatin interaction datasets. To fill this gap, we developed a web-based application, QuIN, which enables: 1) building and visualizing chromatin interaction networks, 2) annotating networks with user-provided private and publicly available functional genomics and interaction datasets, 3) querying network components based on gene name or chromosome location, and 4) utilizing network based measures to identify and prioritize critical regulatory targets and their direct and indirect interactions.

AVAILABILITY: QuIN’s web server is available at http://quin.jax.org QuIN is developed in Java and JavaScript, utilizing an Apache Tomcat web server and MySQL database and the source code is available under the GPLV3 license available on GitHub:https://github.com/UcarLab/QuIN/.

Address of the bookmark: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004809

Genome Workbench 2.10.7

Gudiya Pal — Fri, 01 Jul 2016 12:09:59 -0500

Genome Workbench 2.10.7 is here! New features include added support for local custom BLAST databases and improvements to Tree View.

For the full list of features, improvements and fixes, see the release notes:https://ncbi.nlm.nih.gov/tools/gbench/releasenotes

New Features

BLAST Tool: added support for local custom BLAST databases
Graphical Sequence View: added log scaling option for graph tracks
Generic Table View: new tutorial added

Bug Fixes and Improvements

Project Tree View: Genomic Collections/Assemblies now show accessions, not just names
Tree View: layout updated to better accommodate nodes of different sizes
Table Import Dialog (MacOS): fixed issue with table visibility
Fixed bug where different molecules IDs in GenBank could resolve to the same sequence
Graphical Sequence View: fixed issue where sequence track was not shown for some sequences
Graphical Sequence View: fixed protein coloration methods
Graphical Sequence View: improved rendering of Markers to better indicate boundaries and produce higher quality PDF images
Create Gene Model tool: fixed scenario when gene model tool failed with local sequences
Search View: ORF Finder – fixed incorrect protein lengths
Fixed bug with not opening project file (.gbp) on a click
Fixed issues in GVF import
Fixed BLAST Search tool against NCBI databases not working
Fixed tblastn (protein BLAST) not working in standalone mode
Fixed GTF export failure

Bioinformatics tools and software

Jit — Tue, 05 Jul 2016 10:02:26 -0500

USEARCH >
Extreme high-throughput sequence analysis. Orders of magnitude faster than BLAST. MUSCLE >
Multiple sequence alignment. Faster and more accurate than CLUSTALW.

UPARSE >
OTU clustering for 16S and other marker genes. Highly accurate OTU sequences and improved diversity measures. UCHIME >
Chimeric sequence detection. PILER >
De novo genome repeat finder. PILER-CR >
Detection of CRISPR repeats in bacterial genomes. QSCORE >
Compare two multiple alignments for benchmarking. PALS >
Whole-genome alignment. PREFAB >
Protein Reference Alignment Database. MSA benchmark collection >
Selected multiple alignment benchmarks in a standardized FASTA format.

Address of the bookmark: http://drive5.com/software.html

Advertisement for Junior Research Fellow(JRF) at School of Computational and Integrative Sciences Jawaharlal Nehru University

Thu, 14 Jul 2016 07:24:53 -0500

Advertisement for Junior Research Fellow(JRF) - (1)

Applications are invited for a post in DST, India funded Project entitled: "Positive and negative impacts of macromolecular crowding agents during target site location by DNA binding proteins – origin of optimal search at physiological ionic concentration (Reference Number: ECR/2016/000188) ''. The selected candidate will be appointed purely on temporary basis, initially for two years as a JRF that may be extended to one year of SRF based on the performance.

Position: Junior Research Fellow (1)

Qualifications & Experience: Candidate must have a consistently good academic record with at least 60% marks in all throughout and must have qualified NET/GATE.

Desirable: Basic knowledge in the field of biophysics, molecular simulations and computational biology are desirable.

Salary: Consolidated Rs. 25,000 per month.

Tenure: The project duration is for three years and the selected candidate would be appointed after an interview. Appointment will be purely on temporary basis as stipulated by the existing rules of the University.

Interested candidates need to send an application to the address mentioned below mentioning the name of the project and post applied for (on the cover of the envelope).

The applications along with CV should be mailed at the address given below. Name, address, contact number and e. mail address of two referees must be enclosed with the application. The last date for the application is July 31st 2016.

Dr. Arnab Bhattacharjee (Principal Investigator)
Assistant Professor
School of Computational and Integrative Sciences
Jawaharlal Nehru University
New Delhi-110067
E-mail: arnab@jnu.ac.in

Note: 1. Only shortlisted candidates will be communicated to appear in the interview at SCIS, JNU and no other communications in this regard will be entertained.

2. No TA/DA will be paid for appearing in interview.