BOL: Related items

MIRO : miRNA omics

Jit — Tue, 04 Oct 2016 14:50:48 -0500

The MIRO (the miRNA omics) pipeline is a flexible and powerful tool for the analysis of miRNA (or more generall short RNA) expression using short-read deep sequencing data. In its present implementation MIRO is especially adapted for the analysis of reads generated with the Illumina sequencing platform. MIRO allows to preprocess the Solexa-reads, map them flexibly to several reference genomes using one of four different mappers, create differential gene (miRNA) expression profiles and cluster reads using one of several algorithm. MIRO output is furthermore compatible with software such as genome browsers and miRDeep.

Address of the bookmark: http://seq.crg.es/download/software/Miro/

Understanding Greedy Algorithms

Jit — Mon, 12 Dec 2016 04:37:40 -0600

Learning greedy algo for biologist.

https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

This webpage is also useful for the same:

http://learninglover.com/examples.php?id=59

http://www.cs.rpi.edu/~magdon/ps/conference/super_biokdd.pdf

https://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/lecture-slides/MIT7_91JS14_Lecture6.pdf

http://schatzlab.cshl.edu/teaching/AssemblyClass/01.%20Assembly%20Intro.pdf

http://lsl.sinica.edu.tw/Services/Class/files/20150612449.pdf

http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-43.pdf

Address of the bookmark: https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

PANDASEQ

Shruti Paniwala — Mon, 23 Jan 2017 04:54:32 -0600

PANDASEQ assembles paired-end Illumina reads into sequences, trying to correct for errors and uncalled bases. The assembler reads two files in FASTQ format with quality information. If amplification primers were used (e.g., to isolate a variable region of the 16S gene, or the constant regions around zinc finger binding residues), they can be removed from the sequence during assembly. The final sequence will correct any uncalled bases in the overlapping region using the complementary strand. When mismatches occur in the overlapping region, the base with the better quality score is chosen.
The algorithm is as follows:

1.Find the positions where the forward and reverse primers match best above the threshold and discard the ends of the sequence, including the primer.
2.Pick and overlap to maximise the probability of the forward and reverse reads having come from a single piece of DNA.
3.Identify the masking of the end of the read with the quality score B or # as done by CASAVA and adjust the probabilities in this region.
4.Construct an assembled sequence between the primers and calculate the quality.
5.Check for various constraints, including quality, length, uncalled bases, and user-supplied modules.

http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html

Address of the bookmark: http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html

Cgaln

Jit — Wed, 22 Feb 2017 05:14:15 -0600

Cgaln (Coarse grained alignment) is a program designed to align a pair of whole genomic sequences of not only bacteria but also entire chromosomes of vertebrates on a nominal desktop computer. Cgaln performs an alignment job in two steps, at the block level and then at the nucleotide level. The former "coarse-grained" alignment can explore genomic rearrangements and reduce the regions to be analyzed in the next step. The latter is devoted to detailed alignment within the limited regions found in the first stage. The output of Cgaln is 'glocal' in the sense that rearrangements are taken into consideration while each alignable region is extended as long as possible. Thus, Cgaln is not only fast and memory-efficient, but also can filter noisy outputs without missing the most important homologous segment pairs.

http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

Address of the bookmark: http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

gbtools: Interactive Visualization of Metagenome Bins in R

Jit — Sun, 26 Mar 2017 15:41:31 -0500

We have developed gbtools, a software package that allows users to visualize metagenomic assemblies by plotting coverage (sequencing depth) and GC values of contigs, and also to annotate the plots with taxonomic information. Different sets of annotations, including taxonomic assignments from conserved marker genes or SSU rRNA genes, can be imported simultaneously; users can choose which annotations to plot. Bins can be manually defined from plots, or be imported from third-party binning tools and overlaid onto plots, such that results from different methods can be compared side-by-side. gbtools reports summary statistics of bins including marker gene completeness, and allows the user to add or subtract bins with each other.

Tool at https://github.com/kbseah/genome-bin-tools

Address of the bookmark: http://journal.frontiersin.org/article/10.3389/fmicb.2015.01451/full

Salzberg lab

Mon, 15 May 2017 05:14:01 -0500

We are a computational biology lab that develops novel methods for analysis of DNA and RNA sequences. Our research includes software for aligning and assembling RNA-seq data, whole-genome assembly, and microbiome analysis. We work closely with biomedical scientists to apply these methods to current problems arising in a broad spectrum of biological and medical research areas. We’re also part of the Center for Computational Biology, a group of 20+ faculty members and their labs at Johns Hopkins working on computational, statistical, and mathematical methods that can turn massive genomic data sets into biologically and clinically useful information.

https://salzberg-lab.org/

DESCHRAMBLER

Jit — Thu, 29 Jun 2017 11:54:59 -0500

DESCHRAMBLER is shown to produce highly accurate reconstructions using data simulation and by benchmarking it against other reconstruction tools

You can find the detail of reconstructed data at http://bioinfo.konkuk.ac.kr/DESCHRAMBLER/

Address of the bookmark: https://github.com/jkimlab/DESCHRAMBLER

Rectangle Graph for Repeat Resolution in Genome Assembly

Rahul Nayak — Thu, 28 Dec 2017 09:43:03 -0600

Ultimate tool for resolving repeats in genome assemblies.

Though the specific implementation of the idea of the rectangle graph approach is already included into the current SPAdes distribution, we're also releasing the Rectangle Graph Module (RGM) as the separate code which can be run independently of SPAdes. Although RGM differs from the current implementation of the rectangle graph approach in SPAdes, in the future we plan to integrate RGM in SPAdes. RGM can be run with other genome assemblers if they use the graph format as SPAdes files.

For more details see: Nikolay Vyahhi, Son K. Pham, Pavel Pevzner. From de Bruijn Graphs to Rectangle Graphs for Genome Assembly, Lecture Notes in Bioinformatics 7534 (2012), pp. 249-261.

Address of the bookmark: http://bioinf.spbau.ru/en/rectangles

Genome assembly stats plotting

Jit — Wed, 28 Feb 2018 03:45:39 -0600

A de novo genome assembly can be summarised b

y a number of metrics, including:

Overall assembly length
Number of scaffolds/contigs
Length of longest scaffold/contig
Scaffold/contig N50 and N90Assembly base composition, in particular percentage GC and percentage Ns
CEGMA completeness
Scaffold/contig length/count distribution

assembly-stats supports two widely used presentations of these values, tabular and cumulative length plots, and introduces an additional circular plot that summarises most commonly used assembly metrics in a single visualisation. Each of these presentations is generated using javascript from a common (JSON) data structure, allowing toggling between alternative views, and each can be applied to a single or multiple assemblies to allow direct comparison of alternate assemblies.

Tabular presentation allows direct comparison of exact values between assemblies, the limitations of this approach lie in the necessary omission of distributions and the challenge of interpreting ratios of values that may vary by several orders of magnitude.

Address of the bookmark: https://github.com/rjchallis/assembly-stats