BOL: Related items

Maq: Mapping and Assembly with Quality

Jit — Tue, 22 Nov 2016 04:51:39 -0600

Maq stands for Mapping and Assembly with Quality It builds assembly by mapping short reads to reference sequences. Maq is a project hosted by SourceForge.net. The project page is available athttp://sourceforge.net/projects/maq/. Maq is previously known as mapass2.

Run Maq Now

Follow these steps to try Maq. All you need is a reference sequence file in the FASTA format.

Prepare a reference sequence (ref.fasta). Better a bacterial genome.
Download maq, maq-data and maqview at the download page.
Copy maq, maq.pl and maq_eval.pl to the $PATH or to the same directory.
Simulate diploid reference and read sequences, map reads, call variants and evaluate the results in one go:
```
maq.pl demo ref.fasta calib-30.dat
```
where calib-30.dat is contained in maq-data.

View the alignment:

cd maqdemo/easyrun;
maqindex -i -c consensus.cns all.map;
maqview -c consensus.cns all.map

Even for advanced maq users, running `maq.pl demo' is recommended. You may find something helpful.

Address of the bookmark: http://maq.sourceforge.net

BIMA V3: an aligner customized for mate pair library sequencing

Abhimanyu Singh — Wed, 14 Dec 2016 15:20:00 -0600

Summary: Mate pair library sequencing is an effective and economical method for detecting genomic structural variants and chromosomal abnormalities. Unfortunately, the mapping and alignment of mate pair read pairs to a reference genome is a challenging and
time consuming process for most NGS alignment programs. Large insert sizes, introduction of library preparation protocol artifacts (biotin junction reads, paired-end read contamination, chimeras, etc.), and presence of structural variant breakpoints within reads increases mapping and alignment complexity. We describe an algorithm that is up to 20 times faster and 25% more accurate than popular NGS alignment programs when processing mate pair sequencing.
Availability: http://bioinformaticstools.mayo.edu/research/bima/
Contact: vasmatzis.george@mayo.edu

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2014/02/12/bioinformatics.btu078.full.pdf

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

Mercator

Jit — Mon, 06 Feb 2017 04:20:36 -0600

Our basic strategy in building homology maps is to use exons that are orthologous in multiple genomes as map "anchors." Given K genomes, the steps in the map construction are as follows:

For each genome, obtain a set of exon annotations. These annotations can be a combination of both exon predictions (e.g. Genscan) and annotations that have been experimentally verified (e.g. RefSeq). Ideally, we would like to have these annotations be as sensitive as possible. Specificity is not a concern, as incorrect annotations are not likely not have significant alignments with other gene annotations.
Compare all exons against all exons in other genomes and record significant alignments between exons. Currently, we use BLAT to do this all-vs-all comparison with alignments being performed in protein space.
Construct a graph with each vertex corresponding to a exon and edges between vertices whose corresponding exons have significant alignments.
Identify cliques in this graph. These cliques are potential anchors to be used in the map.
Starting with the largest cliques (those that have exons in all or most of the genomes), join neighboring (adjacent in genomic coordinates, in each genome) cliques to form runs. Smaller cliques that are inconsistent with runs formed by larger cliques are filtered out. After the smallest cliques have been considered, cliques that are not part of a run are discarded.
The extents of each run in each genome are outputted as orthologous segments. The cliques from each run are used to output the exact genomic coordinates of anchors within each orthologous segment. These anchors can be used by genomic alignment programs (such as MAVID) to do a detailed alignment of each orthologous segment.

https://www.biostat.wisc.edu/~cdewey/mercator/

Address of the bookmark: https://www.biostat.wisc.edu/~cdewey/mercator/

Cgaln

Jit — Wed, 22 Feb 2017 05:14:15 -0600

Cgaln (Coarse grained alignment) is a program designed to align a pair of whole genomic sequences of not only bacteria but also entire chromosomes of vertebrates on a nominal desktop computer. Cgaln performs an alignment job in two steps, at the block level and then at the nucleotide level. The former "coarse-grained" alignment can explore genomic rearrangements and reduce the regions to be analyzed in the next step. The latter is devoted to detailed alignment within the limited regions found in the first stage. The output of Cgaln is 'glocal' in the sense that rearrangements are taken into consideration while each alignable region is extended as long as possible. Thus, Cgaln is not only fast and memory-efficient, but also can filter noisy outputs without missing the most important homologous segment pairs.

http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

Address of the bookmark: http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

gbtools: Interactive Visualization of Metagenome Bins in R

Jit — Sun, 26 Mar 2017 15:41:31 -0500

We have developed gbtools, a software package that allows users to visualize metagenomic assemblies by plotting coverage (sequencing depth) and GC values of contigs, and also to annotate the plots with taxonomic information. Different sets of annotations, including taxonomic assignments from conserved marker genes or SSU rRNA genes, can be imported simultaneously; users can choose which annotations to plot. Bins can be manually defined from plots, or be imported from third-party binning tools and overlaid onto plots, such that results from different methods can be compared side-by-side. gbtools reports summary statistics of bins including marker gene completeness, and allows the user to add or subtract bins with each other.

Tool at https://github.com/kbseah/genome-bin-tools

Address of the bookmark: http://journal.frontiersin.org/article/10.3389/fmicb.2015.01451/full

Genome Annotation Transfer Utility (GATU)

Jit — Mon, 29 May 2017 05:54:53 -0500

Genome Annotation Transfer Utility (GATU) was designed to facilitate quick, efficient annotation of similar genomes using genomes that have already been annotated. For example, whenever a new strain of SARS coronavirus is sequenced, it is possible, using GATU, to automatically annotate the new strain using a previously-annotated strain of SARS CoV. This saves researchers from tedious manual annotation of these sequences.

The program utilizes tBLASTn and BLASTn algorithms to map genes from the reference genome (the annotated strain) to the new sequence (the unannotated strain). The goal is to annotate the majority of the new genome’s genes in a single step. ORFs present in the target genome and absent from the reference genome are also identified; these ORFs can be further analyzed using BLAST, VGO and BBB. Afterwards, they can either be accepted for/rejected from annotation. GATU can handle multiple-exon genes as well as mature peptides. Although it was designed for use with viral genomes, GATU can also be used to help annotate larger genomes (ie. bacterial genomes).

The output is saved in GenBank, XML, or EMBL file format.

Address of the bookmark: https://virology.uvic.ca/help/tool-help/help-books/genome-annotation-transfer-utility-gatu-documentation/

funannotate: Eukaryotic Genome Annotation Pipeline

Jit — Wed, 19 Sep 2018 07:47:22 -0500

Funannotate is a genome prediction, annotation, and comparison software package. It was originally written to annotate fungal genomes (small eukaryotes ~ 30 Mb genomes), but has evolved over time to accomodate larger genomes. The impetus for this software package was to be able to accurately and easily annotate a genome for submission to NCBI GenBank. Existing tools (such as Maker) require significant manually editing to comply with GenBank submission rules, thus funannotate is aimed at simplifying the genome submission process.

Address of the bookmark: https://github.com/nextgenusfs/funannotate

Jvarkit : Java utilities for Bioinformatics

Jit — Fri, 08 Jun 2018 09:31:55 -0500

Collection of Java tool kits for bioinformatics works: Jvarkit : Java utilities for Bioinformatics

Address of the bookmark: http://lindenb.github.io/jvarkit/

You can't hide from Genome Hackers

Neel — Sat, 13 Oct 2018 14:17:28 -0500

Young computational biologist named Yaniv Erlich shocked the research world by showing it was possible to unmask the identities of people listed in anonymous genetic databases using only an Internet connection

Paper: http://science.sciencemag.org/content/early/2018/10/10/science.aau4832

More at https://www.wired.com/story/genome-hackers-show-no-ones-dna-is-anonymous-anymore/