BOL: Related items

Linux command line exercises for NGS data processing

Jit — Wed, 22 Jun 2016 07:59:39 -0500

The purpose of this tutorial is to introduce students to the frequently used tools for NGS analysis as well as giving experience in writing one-liners. Copy the required files to your current directory, change directory (cd) to the linuxTutorial folder, and do all the processing inside:

[uzi@quince-srv2 ~/]$ cp -r /home/opt/MScBioinformatics/linuxTutorial .
[uzi@quince-srv2 ~/]$ cd linuxTutorial
[uzi@quince-srv2 ~/linuxTutorial]$

I have deliberately chosen Awk in the exercises as it is a language in itself and is used more often to manipulate NGS data as compared to the other command line tools such as grep, sed, perl etc. Furthermore, having a command on awk will make it easier to understand advanced tutorials such as Illumina Amplicons Processing Workflow.

In Linux, we use a shell that is a program that takes your commands from the keyboard and gives them to the operating system. Most Linux systems utilize Bourne Again SHell (bash), but there are several additional shell programs on a typical Linux system such as ksh, tcsh, and zsh. To see which shell you are using, type

[uzi@quince-srv2 ~/linuxTutorial]$ echo $SHELL

/bin/bash

Address of the bookmark: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/linux.html

Kaiju

Jit — Mon, 27 Jun 2016 11:23:04 -0500

Kaiju is a program for the taxonomic classification of metagenomic high-throughput sequencing reads. Each read is directly assigned to a taxon within the NCBI taxonomy by comparing it to a reference database containing microbial and viral protein sequences.

By default, Kaiju uses either the available complete genomes from NCBI RefSeq or the microbial subset of the non-redundant protein database nr used by NCBI BLAST, optionally also including fungi and microbial eukaryotes.

Kaiju translates reads into amino acid sequences, which are then searched in the database using a modified backward search on a memory-efficient implementation of the Burrows-Wheeler transform, which finds maximum exact matches (MEMs), optionally allowing mismatches in the protein alignment. The search can process up to millions of reads per minute using, for example, only 10 GB RAM with a protein database comprising 4821 microbial genomes. Kaiju can also be used for querying any other protein database without taxonomic classification, using either protein or nucleotide queries.

Kaiju is described in Menzel, P. et al. (2016) Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat. Commun. 7:11257 (open access).

Address of the bookmark: http://kaiju.binf.ku.dk/

WiseScaffolder

Poonam Mahapatra — Wed, 13 Jul 2016 08:08:57 -0500

Function

WiseScaffolder is a stand-alone semi-automatic application for genome scaffolding of pre-assembled contigs using mate-pair data. It also produces editable scaffold maps, allowing either to build gapped scaffolds or usable as a common thread for the manual improvement of scaffolds.

Description

WiseScaffolder includes 4 subcommands: dumpconfig generates a configuration file that notably specifies the average insert size of the mate-pair library preprocess allows the detection and correction of chimerae, the estimation of contigs copy number and produces valuable outputs for the manual improvement of scaffolds scaffold constitutes the central scaffold-builder and comprises two modules:

i) the interative_scaffold_extender, which works with big, unambiguous contigs, or when they run out, single copy contigs, and

ii) the small_contig_inserter, which inserts the small contigs within scaffolds buildfasta converts the scaffold(s) map(s) into Fasta sequences.

Address of the bookmark: http://abims.sb-roscoff.fr/wisescaffolder

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap

valet

Jit — Thu, 22 Sep 2016 04:27:09 -0500

VALET is a pipeline for performing de novo validation of metagenomic assemblies. VALET checks a number of properties that should hold true for a correct assembly (e.g., mate-pairs are aligned at the correct distance from each other in the assembly, the depth of coverage is fairly uniform along contigs, etc.). The violations of these invariants are reported allowing one to pinpoint areas that were potentially mis-assembled, or to compare the quality of different assemblies. For comparing multiple assemblies of the same data-sets, VALET also reports an overall estimate of the likelihood a particular assembly is correct.

Home Page:

VALET code repository

Address of the bookmark: https://www.cbcb.umd.edu/software/valet

RepeatModeler

Jit — Thu, 18 Aug 2016 09:57:15 -0500

RepeatModeler is a de-novo repeat family identification and modeling package. At the heart of RepeatModeler are two de-novo repeat finding programs ( RECON and RepeatScout ) which employ complementary computational methods for identifying repeat element boundaries and family relationships from sequence data. RepeatModeler assists in automating the runs of RECON and RepeatScout given a genomic database and uses the output to build, refine and classify consensus models of putative interspersed repeats.

Address of the bookmark: http://www.repeatmasker.org/RepeatModeler.html

FinisherSC:a repeat-aware tool for upgrading de novo assembly using long reads

Jit — Mon, 20 Aug 2018 04:08:50 -0500

Here is the command to run the tool:

python finisherSC.py destinedFolder mummerPath

If you are running on server computer and would like to use multiple threads, then the following commands can generate 20 threads to run FinisherSC.

python finisherSC.py -par 20 destinedFolder mummerPath

Sometimes, if the names of raw reads and contigs consists of special characters/formats, FinisherSC/MUMmer may not parse them correctly. In that case, you want to have a quick renaming of the names of contigs/reads in contigs.fasta or raw_reads.fasta using the following command.

    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' raw_reads.fasta > newRaw_reads.fasta
    cp newRaw_reads.fasta raw_reads.fasta
    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' contigs.fasta > newContigs.fasta
    cp newContigs.fasta contigs.fasta

Address of the bookmark: https://github.com/kakitone/finishingTool

LUMPY

Shruti Paniwala — Thu, 25 Aug 2016 08:05:02 -0500

A probabilistic framework for structural variant discovery.

Ryan M Layer, Colby Chiang, Aaron R Quinlan, and Ira M Hall. 2014. "LUMPY: a Probabilistic Framework for Structural Variant Discovery." Genome Biology 15 (6): R84. doi:10.1186/gb-2014-15-6-r84.

More at https://github.com/arq5x/lumpy-sv

Address of the bookmark: https://github.com/arq5x/lumpy-sv

Ka, Ks and Ka/Ks calculations

Poonam Mahapatra — Mon, 29 Aug 2016 11:44:11 -0500

gKaKs is a codon-based genome-level Ka/Ks computation pipeline developed and based on programs from four widely used packages: BLAT, BLASTALL (including bl2seq, formatdb and fastacmd), PAML (including codeml and yn00) and KaKs_Calculator (including 10 substitution rate estimation methods). gKaKs can automatically detect and eliminate frameshift mutations and premature stop codons to compute the substitution rates (Ka, Ks and Ka/Ks) between a well-annotated genome and a non-annotated genome or even a poorly assembled scaffold dataset. It is especially useful for newly sequenced genomes that have not been well annotated.

Look for KaKs calculation:

https://github.com/fumba/kaks-calculator

http://longlab.uchicago.edu/?q=gKaKs

http://www.ncbi.nlm.nih.gov/pubmed/23314322

Address of the bookmark: http://longlab.uchicago.edu/?q=gKaKs

BRAKER: pipeline for fully automated prediction of protein coding genes with GeneMark-ES/ET and AUGUSTUS in novel eukaryotic genomes

Jit — Thu, 01 Sep 2016 08:02:59 -0500

Gene finding in eukaryotic genomes is notoriously difficult to automate. The task is to design a work flow with a minimal set of tools that would reach state-of-the-art performance across a wide range of species. GeneMark-ET is a gene prediction tool that incorporates RNA-Seq data into unsupervised training and subsequently generates ab initio gene predictions. AUGUSTUS is a gene finder that usually requires supervised training and uses information from RNA-Seq reads in the prediction step. Complementary strengths of GeneMark-ET and AUGUSTUS provided motivation for designing a new combined tool for automatic gene prediction.

http://www.ncbi.nlm.nih.gov/pubmed/26559507

Address of the bookmark: http://bioinf.uni-greifswald.de/bioinf/braker/