BOL: Related items

BBMap/BBTools package: Multipurpose tool designed for converting reads or other nucleotide data between different formats.

Jit — Mon, 13 Jun 2016 05:47:21 -0500

Reformatis a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:

fastq
fasta
fasta+qual
sam
scarf (an old Illumina format)
bam (if samtools is installed)
gzip
zip
ascii-33 (sanger)
ascii-64 (old Illumina)
paired files
interleaved files

It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.

The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.

Quality trimming (either or both ends)
Quality filtering
Fixed-length trimming
Generation of histograms (base composition, quality, etc)
Subsampling (to a fraction of input reads, or an exact number of reads or bases)
Changing fasta line-wrapping length
Reverse-complementing (all reads or only read 2)
Adding /1 and /2 suffix to read names
GC-content filtering
Length-filtering
Testing for corrupted interleaved files

Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:

Reformat fastq into fasta:
reformat.sh in=x.fq out=y.fa

Interleave paired reads:
reformat.sh in1=x1.fq in2=x2.fq out=y.fq

Note - you can actually use a shortcut if paired read files have the same name with a 1 and a 2. This is equivalent to the above command:
reformat.sh in=x#.fq out=y.fq

De-interleave reads:
reformat.sh in=x.fq out1=y1.fq out2=y2.fq

Verify that interleaving appears correct, assuming Illumina namimg conventions:
reformat.sh in=x.fq vint

Convert ASCII-33 to ASCII-64:
reformat.sh in=x.fq out=y.fq qin=33 qout=64

Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming:
reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50

Subsample 10% of the first 20000 pairs in an interleaved file:
reformat.sh in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t
(in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)

Pipe in a gzipped sam file and pipe out fasta:
reformat.sh in=stdin.sam.gz out=stdout.fa

Reverse-complement reads:
reformat.sh in=x.fq out=y.fq rcomp

For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters:
reformat.sh -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70

For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.

For people using a non-bash terminal, you may need to type "bash reformat.sh" instead of just "reformat.sh".
For users of Windows or other platforms that do not support bash shellscripts, replace "reformat.sh" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads"
for example,
java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa

Reformat can be downloaded with BBTools here:
https://sourceforge.net/projects/bbmap/

Linux command line exercises for NGS data processing

Jit — Wed, 22 Jun 2016 07:59:39 -0500

The purpose of this tutorial is to introduce students to the frequently used tools for NGS analysis as well as giving experience in writing one-liners. Copy the required files to your current directory, change directory (cd) to the linuxTutorial folder, and do all the processing inside:

[uzi@quince-srv2 ~/]$ cp -r /home/opt/MScBioinformatics/linuxTutorial .
[uzi@quince-srv2 ~/]$ cd linuxTutorial
[uzi@quince-srv2 ~/linuxTutorial]$

I have deliberately chosen Awk in the exercises as it is a language in itself and is used more often to manipulate NGS data as compared to the other command line tools such as grep, sed, perl etc. Furthermore, having a command on awk will make it easier to understand advanced tutorials such as Illumina Amplicons Processing Workflow.

In Linux, we use a shell that is a program that takes your commands from the keyboard and gives them to the operating system. Most Linux systems utilize Bourne Again SHell (bash), but there are several additional shell programs on a typical Linux system such as ksh, tcsh, and zsh. To see which shell you are using, type

[uzi@quince-srv2 ~/linuxTutorial]$ echo $SHELL

/bin/bash

Address of the bookmark: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/linux.html

A5-miseq

Jit — Thu, 18 Aug 2016 04:05:23 -0500

_A5-miseq_ is a pipeline for assembling DNA sequence data generated on the Illumina sequencing platform. This README will take you through the steps necessary for running _A5-miseq_.

Point to note:

There are many situations where A5-miseq is not the right tool for the job. In order to produce accurate results, A5-miseq requires Illumina data with certain characteristics. A5-miseq will likely not work well with Illumina reads shorter than around 80nt, or reads where the base qualities are low in all or most reads before 60nt. A5-miseq assumes it is assembling homozygous haploid genomes. Use a different assembler for metagenomes and heterozygous diploid or polyploid organisms. Use a different assembler if a tool like FastQC reports your data quality is dubious. You have been warned! Datasets consisting solely of unpaired reads are not currently supported.

Address of the bookmark: https://sourceforge.net/projects/ngopt/

Scarpa

Poonam Mahapatra — Wed, 13 Jul 2016 07:59:25 -0500

Scarpa is a stand-alone scaffolding tool for NGS data. It can be used together with virtually any genome assembler and any NGS read mapper that supports SAM format. Other features include support for multiple libraries and an option to estimate insert size distributions from data. Scarpa is available free of charge for academic and commercial use under the GNU General Public License (GPL).

See the user manual or the paper for more information about Scarpa. Click here for the supplementary material.

Address of the bookmark: http://compbio.cs.toronto.edu/hapsembler/scarpa.html

CrossMap

Abhimanyu Singh — Mon, 05 Sep 2016 04:07:38 -0500

CrossMap is a program for convenient conversion of genome coordinates (or annotation files) between different assemblies (such as Human hg18 (NCBI36) <> hg19 (GRCh37), Mouse mm9 (MGSCv37) <> mm10 (GRCm38)).
It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF.
CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome.
We do not recommend using CrossMap to convert genome coordinates between species.

Address of the bookmark: http://crossmap.sourceforge.net/

TEannot

Jit — Thu, 18 Aug 2016 10:02:03 -0500

We advise to run first the TEdenovo pipeline but it is not compulsory. We suppose you begin by running the TEannot pipeline on the example provided in the directory "db/" rather than directly on your own genomic sequences. Thus, from now on, the project name is "DmelChr4".

Address of the bookmark: https://urgi.versailles.inra.fr/Tools/REPET/TEannot-tuto

LUMPY

Shruti Paniwala — Thu, 25 Aug 2016 08:05:02 -0500

A probabilistic framework for structural variant discovery.

Ryan M Layer, Colby Chiang, Aaron R Quinlan, and Ira M Hall. 2014. "LUMPY: a Probabilistic Framework for Structural Variant Discovery." Genome Biology 15 (6): R84. doi:10.1186/gb-2014-15-6-r84.

More at https://github.com/arq5x/lumpy-sv

Address of the bookmark: https://github.com/arq5x/lumpy-sv

Ka, Ks and Ka/Ks calculations

Poonam Mahapatra — Mon, 29 Aug 2016 11:44:11 -0500

gKaKs is a codon-based genome-level Ka/Ks computation pipeline developed and based on programs from four widely used packages: BLAT, BLASTALL (including bl2seq, formatdb and fastacmd), PAML (including codeml and yn00) and KaKs_Calculator (including 10 substitution rate estimation methods). gKaKs can automatically detect and eliminate frameshift mutations and premature stop codons to compute the substitution rates (Ka, Ks and Ka/Ks) between a well-annotated genome and a non-annotated genome or even a poorly assembled scaffold dataset. It is especially useful for newly sequenced genomes that have not been well annotated.

Look for KaKs calculation:

https://github.com/fumba/kaks-calculator

http://longlab.uchicago.edu/?q=gKaKs

http://www.ncbi.nlm.nih.gov/pubmed/23314322

Address of the bookmark: http://longlab.uchicago.edu/?q=gKaKs

Redundans

Jit — Thu, 01 Sep 2016 08:28:11 -0500

Redundans pipeline assists an assembly of heterozygous genomes.
Program takes as input assembled contigs, paired-end and/or mate pairs sequencing libraries and returns scaffolded homozygous genome assembly, that should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding more details.

The pipeline consists of three steps/modules:

redundancy reduction: detection and selectively removal of redundant contigs from an initial de novo assembly
scaffolding: joining of genome fragments using paired-end and/or mate-pairs reads
gap closing

Redundans is:

fast & lightweight, multi-core support and memory-optimised, so it can be run even on the laptop for small-to-medium size genomes
flexible toward many sequencing technologies (Illumina, 454 or Sanger) and library types (paired-end, mate pairs, fosmids)
modular: every step can be ommited or replaced by another tools

Address of the bookmark: https://github.com/Gabaldonlab/redundans

Assembly tutorial PPT

Jit — Wed, 07 Sep 2016 03:12:53 -0500

Saved Cornell University assembly workshop PPT.

Reference:

http://cbsu.tc.cornell.edu/lab/doc/assembly_workshop_20150420_lecture1.pdf