BOL: Related items

coursera genome assembly tutorial

Jit — Sat, 25 Nov 2017 08:57:25 -0600

Solutions to Coursera Genome Sequencing (Bioinformatics II)

Address of the bookmark: https://github.com/iansealy/coursera-assembly

Tools for bacterial whole genome annotation

Radha Agarkar — Sat, 16 Dec 2017 17:37:47 -0600

RAST – Web tool (upload contigs), uses the subsystems in the SEED database and provides detailed annotation and pathway analysis. Takes several hours per genome but I think this is the best way to get a high quality annotation (if you have only a few genomes to annotate).

Prokka – Standalone command line tool, takes just a few minutes per genome. This is the best way to get good quality annotation in a flash, which is particularly useful if you have loads of genomes or need to annotate a pangenome or metagenome. Note however that the quality of functional information is not as good as RAST, and you will need several extra steps if you want to do functional profiling and pathway analysis of your genome(s)… which is in-built in RAST.

NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements.

PGAP: NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Automatic Annotation Pipeline (PGAAP; see Pubmed Article) developed in 2005 has been replaced with an upgraded version that is capable of processing a larger data volume. NCBI's annotation pipeline depends on several internal databases and is not currently available for download or use outside of the NCBI environment.

BEACON (automated tool for Bacterial GEnome Annotation ComparisON), a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/.

BlastKOLA: Assigns K numbers to the user's sequence data by BLAST searches, respectively, against a nonredundant set of KEGG GENES. KOALA (KEGG Orthology And Links Annotation) is KEGG's internal annotation tool for K number assignment of KEGG GENES using SSEARCH computation. Annotate Sequence in KEGG Mapper and Pathogen Checker in KEGG Pathogen are special interfaces to this server and can be executed in an interactive mode. BlastKOALA is suitable for annotating fully sequenced genomes.

PAGIT: Provides a toolkit for improving the quality of genome assemblies created via an assembly software. PAGIT compiled four tools: (i) ABACAS which classifies and orientates contigs and estimates the sizes of gaps between them; (ii) IMAGE uses paired-end reads to extend contigs and close gaps within the scaffolds; (iii) ICORN for identifying and correcting small errors in consensus sequences and; (iv) RATT for help annotation. The software was mainly created to analyze parasite genomes of up to about 300 Mb.

MAKER: A portable and easily configurable genome annotation pipeline. MAKER allows smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases. It identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidence-based quality values. MAKER's inputs are minimal and its ouputs can be directly loaded into a Generic Model Organism Database (GMOD). They can also be viewed in the Apollo genome browser; this feature of MAKER provides an easy means to annotate, view and edit individual contigs and BACs without the overhead of a database. MAKER is available for download and can be tested online via the MAKER Web Annotation Service (MWAS).

MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease.

Magic-BLAST: a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome.

Jit — Tue, 26 Dec 2017 22:23:39 -0600

Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.

Magic-BLAST incorporates within the NCBI BLAST code framework ideas developed in the NCBI Magic pipeline, in particular hit extensions by local walk and jump (http://www.ncbi.nlm.nih.gov/pubmed/26109056), and recursive clipping of mismatches near the edges of the reads, which avoids accumulating artefactual mismatches near splice sites and is needed to distinguish short indels from substitutions near the edges.

Address of the bookmark: https://ncbi.github.io/magicblast/

List of visualization tools for genome alignments

Rahul Nayak — Fri, 02 Feb 2018 13:25:33 -0600

Genome browsers are useful not only for showing final results but also for improving analysis protocols, testing data quality, and generating result drafts. Its integration in analysis pipelines allows the optimization of parameters, which leads to better results. But sometime, we need publication ready figure of genomes. Following are the list of genome alignment visualization tools, which could be useful for analysis and interpretation of results:

ABySS Explorer

Interactive Java application that uses a novel graph-based representation to display a sequence assembly and associated metadata

http://www.bcgsc.ca/platform/bioinfo/software/abyss-explorer

BamView

Genome browser and annotation tool that allows visualization of sequence features, next-generation sequencing (NGS) data and the results of analyses within the context of the sequence, and also its six-frame translation

http://www.sanger.ac.uk/resources/software/artemis/

DNannotator

Annotation web toolkit for regional genomic sequences

http://bioapp.psych.uic.edu/DNannotator.htm

JVM

Java Visual Mapping tool for NGS reads

http://www.springer.com/cda/content/document/cda_downloaddocument/9789401792448-c2.pdf?SGWID=0-0-45-1487072-p176815501

LookSeq

Web-based visualization of sequences derived from multiple sequencing technologies. Low- or high-depth read pileups and easy visualization of putative single nucleotide and structural variation

http://lookseq.sourceforge.net

MagicViewer

Visualization of short read alignment, identification of genetic variation and association with annotation information of a reference genome

http://bioinformatics.zj.cn/magicviewer/

MapView

Alignments of huge-scale single-end and pair-end short reads

http://omictools.com/mapview-s1367.html

MultiPipMaker

Computes alignments of similar regions in two DNA sequences. The resulting alignments are summarized with a ‘percent identity plot’ (pip)

http://pipmaker.bx.psu.edu/pipmaker/

PileLineGUI

Handling genome position files in NGS studies

http://sing.ei.uvigo.es/pileline/pilelinegui.html

SAMtools tview

Simple and fast text alignment viewer; NGS compatible

http://www.htslib.org/

SEWAL

Uses a locality-sensitive hashing algorithm to enumerate all unique sequences in an entire Illumina sequencing run

http://www.sourceforge.net/projects/sewal

STAR

A web-based integrated solution to management and visualization of sequencing data

http://wanglab.ucsd.edu/star/browser

SVA

Software for annotating and visualizing sequenced human genomes

http://www.svaproject.org

Viewer (IGV)

Visualization of large heterogeneous datasets, providing a smooth and intuitive user experience at all levels of genome resolution

https://www.broadinstitute.org/igv/

ZOOM Lite

NGS data mapping and visualization software

http://bioinfor.com/zoom/lite/

CrossMap: a program for convenient conversion of genome coordinates

Jit — Thu, 31 May 2018 06:00:47 -0500

CrossMap is a program for convenient conversion of genome coordinates (or annotation files) between different assemblies (such as Human hg18 (NCBI36) <> hg19 (GRCh37), Mouse mm9 (MGSCv37) <> mm10 (GRCm38)). It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF. CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome. We do not recommend using CrossMap to convert genome coordinates between species.

Address of the bookmark: http://crossmap.sourceforge.net

assemblytics: delta file to analyze alignments of an assembly to another assembly or a reference genome

Jit — Thu, 14 Jun 2018 07:31:00 -0500

Download and install MUMmer Align your assembly to a reference genome using nucmer (from MUMmer package) $ nucmer -maxmatch -l 100 -c 500 REFERENCE.fa ASSEMBLY.fa -prefix OUT Consult the MUMmer manual if you encounter problems Optional: Gzip the delta file to speed up upload (usually 2-4X faster) $ gzip OUT.delta Then use the OUT.delta.gz file for upload. Upload the .delta or delta.gz file (view example) to Assemblytics Important: Use only contigs rather than scaffolds from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence does not match perfectly to the distance in the reference. The unique sequence length required represents an anchor for determining if a sequence is unique enough to safely call variants from, which is an alternative to the mapping quality filter for read alignment. http://assemblytics.com/

Address of the bookmark: http://assemblytics.com/

LINKS scaffolder bloomfilter setting !

Jit — Fri, 15 Jun 2018 10:39:54 -0500

➜ bin git:(master) ✗ ls -l
total 68
drwxrwxr-x 3 urbe urbe 4096 Jun 15 12:15 lib
-rwxrwxrwx 1 urbe urbe 65141 Jun 15 17:13 LINKS
➜ bin git:(master) ✗ pwd
/home/urbe/Tools/LINKS_1.8.6/bin

➜ bloomfilter git:(master) ✗ swig -Wall -c++ -perl5 BloomFilter.i
➜ bloomfilter git:(master) ✗ g++ -c BloomFilter_wrap.cxx -I/home/urbe/anaconda3/lib/perl5/5.22.0/x86_64-linux-thread-multi/CORE/ -fPIC -Dbool=char -O3
BloomFilter_wrap.cxx:1892:30: fatal error: ../BloomFilter.hpp: No such file or directory
compilation terminated.
➜ bloomfilter git:(master) ✗ cd swig
➜ swig git:(master) ✗ g++ -c BloomFilter_wrap.cxx -I/home/urbe/anaconda3/lib/perl5/5.22.0/x86_64-linux-thread-multi/CORE/ -fPIC -Dbool=char -O3
In file included from BloomFilter_wrap.cxx:1877:0:
../BloomFilter.hpp: In member function ‘void BloomFilter::loadHeader(FILE*)’:
../BloomFilter.hpp:141:59: warning: ignoring return value of ‘size_t fread(void*, size_t, size_t, FILE*)’, declared with attribute warn_unused_result [-Wunused-result]
fread(&header, sizeof(struct FileHeader), 1, file);
^
➜ swig git:(master) ✗ g++ -Wall -shared BloomFilter_wrap.o -o BloomFilter.so -O3
➜ swig git:(master) ✗ cd ..
➜ bloomfilter git:(master) ✗ cd ..
➜ lib git:(master) ✗ cd ..
➜ bin git:(master) ✗ ./LINKS
Usage: ./LINKS [v1.8.6]
-f sequences to scaffold (Multi-FASTA format, required)
-s file-of-filenames, full path to long sequence reads or MPET pairs [see below] (Multi-FASTA/fastq format, required)
-m MPET reads (default -m 1 = yes, default = no, optional)
! DO NOT SET IF NOT USING MPET. WHEN SET, LINKS WILL EXPECT A SPECIAL FORMAT UNDER -s
! Paired MPET reads in their original outward orientation <- -> must be separated by ":"
>template_name
ACGACACTATGCATAAGCAGACGAGCAGCGACGCAGCACG:ATATATAGCGCACGACGCAGCACAGCAGCAGACGAC
-d distance between k-mer pairs (ie. target distances to re-scaffold on. default -d 4000, optional)
Multiple distances are separated by comma. eg. -d 500,1000,2000,3000
-k k-mer value (default -k 15, optional)
-t step of sliding window when extracting k-mer pairs from long reads (default -t 2, optional)
Multiple steps are separated by comma. eg. -t 10,5
-o offset position for extracting k-mer pairs (default -o 0, optional)
-e error (%) allowed on -d distance e.g. -e 0.1 == distance +/- 10% (default -e 0.1, optional)
-l minimum number of links (k-mer pairs) to compute scaffold (default -l 5, optional)
-a maximum link ratio between two best contig pairs (default -a 0.3, optional)
*higher values lead to least accurate scaffolding*
-z minimum contig length to consider for scaffolding (default -z 500, optional)
-b base name for your output files (optional)
-r Bloom filter input file for sequences supplied in -s (optional, if none provided will output to .bloom)
NOTE: BLOOM FILTER MUST BE DERIVED FROM THE SAME FILE SUPPLIED IN -f WITH SAME -k VALUE
IF YOU DO NOT SUPPLY A BLOOM FILTER, ONE WILL BE CREATED (.bloom)
-p Bloom filter false positive rate (default -p 0.001, optional; increase to prevent memory allocation errors)
-x Turn off Bloom filter functionality (-x 1 = yes, default = no, optional)
-v Runs in verbose mode (-v 1 = yes, default = no, optional)

Error: Missing mandatory options -f and -s.

ERROR fixed

perl: symbol lookup error: /home/urbe/Tools/LINKS_new/bin/./lib/bloomfilter/swig/BloomFilter.so: undefined symbol: Perl_Gthr_key_ptr

Converting a VCF into a FASTA given some reference !

Jit — Fri, 20 Jul 2018 10:03:53 -0500

Samtools/BCFtools (Heng Li) provides a Perl script vcfutils.pl which does this, the function vcf2fq (lines 469-528)

This script has been modified by others to convert InDels as well, e.g. this by David Eccles

./vcf2fq.pl -f <input.fasta> <all-site.vcf> > <output.fastq>

https://github.com/gringer/bioinfscripts/blob/master/vcf2fq.pl

https://github.com/lh3/samtools/blob/master/bcftools/vcfutils.pl

Scribl : HTML5 canvas genomics graphic library

Jit — Thu, 25 Oct 2018 09:38:53 -0500

Scribl is a javascript, Canvas-based graphics library that easily generates biological visuals of genomic regions, alignments, and assembly data. Scribl can also be used in conventional offline pipelines, since everything needed to generate charts can be contained in a single html file.

Address of the bookmark: http://chmille4.github.io/Scribl/

pyGenomeTracks: Standalone program and library to plot beautiful genome browser tracks

Neel — Fri, 09 Nov 2018 12:34:23 -0600

pyGenomeTracks aims to produce high-quality genome browser tracks that are highly customizable. Currently, it is possible to plot:

bigwig
bed (many options)
bedgraph
links (represented as arcs)
Hi-C matrices (if HiCExplorer is installed)

Address of the bookmark: https://github.com/deeptools/pyGenomeTracks