BOL: Related items

OPERA : Optimal Paired-End Read Assembler

Jit — Fri, 09 Sep 2016 05:28:58 -0500

OPERA (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly). It uses information from paired-end/mate-pair/long reads to order and orient the intermediate contigs/scaffolds assembled in a genome assembly project, in a process known as Scaffolding. OPERA is based on an exact algorithm that is guaranteed to minimize the discordance of scaffolds with the information provided by the paired-end/mate-pair/long reads (for further details see Gao et al, 2011).

Note that since the original publication, we have made significant changes to OPERA (v1.0 onwards) including refinements to its basic algorithm (to reduce local errors, improve efficiency etc.) and incorporated features that are important for scaffolding large genomes (multi-library support, better repeat-handling etc.), in addition to other scalability and usability improvements (bam and gzip support, smaller memory footprint). We therefore encourage you to download and use our latest version: OPERA-LG. In our benchmarks, it has significantly improved corrected N50 and reduced the number of scaffolding errors. Furthermore, our latest release contains the wrapper script OPERA-long-read that enables scaffolding with long-reads from third-generation sequencing technologies (PacBio or Oxford Nanopore). The manuscript describing the new features and algorithms is available at Genome Biology. We look forward to getting your feedback to improve it further.

Address of the bookmark: https://sourceforge.net/p/operasf/wiki/The%20OPERA%20wiki/

DECIPHER

Anjana — Fri, 30 Sep 2016 09:33:12 -0500

DECIPHER is a software toolset that can be used to maintain, analyze, and decipher large amounts of DNA sequence data. To install DECIPHER, see the Downloads page.

To begin using DECIPHER read the "Getting Started DECIPHERing" tutorial. Refer to the PDF documents below for instructions on how to use DECIPHER for various tasks.

Address of the bookmark: http://decipher.cee.wisc.edu/Documentation.html

GeneBreak: a tool to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach

Jit — Sat, 01 Oct 2016 15:15:29 -0500

Development of cancer is driven by somatic alterations, including numerical and structural chromosomal aberrations. Currently, several computational methods are available and are widely applied to detect numerical copy number aberrations (CNAs) of chromosomal segments in tumor genomes. However, there is lack of computational methods that systematically detect structural chromosomal aberrations by virtue of the genomic location of CNA-associated chromosomal breaks and identify genes that appear non-randomly affected by chromosomal breakpoints across (large) series of tumor samples. ‘GeneBreak’ is developed to systematically identify genes recurrently affected by the genomic location of chromosomal CNA-associated breaks by a genome-wide approach, which can be applied to DNA copy number data obtained by array-Comparative Genomic Hybridization (CGH) or by (low-pass) whole genome sequencing (WGS). First, ‘GeneBreak’ collects the genomic locations of chromosomal CNA-associated breaks that were previously pinpointed by the segmentation algorithm that was applied to obtain CNA profiles. Next, a tailored annotation approach for breakpoint-to-gene mapping is implemented. Finally, dedicated cohort-based statistics is incorporated with correction for covariates that influence the probability to be a breakpoint gene. In addition, multiple testing correction is integrated to reveal recurrent breakpoint events. This easy-to-use algorithm, ‘GeneBreak’, is implemented in R (www.cran.r-project.org) and is available from Bioconductor (www.bioconductor.org/packages/release/bioc/html/GeneBreak.html).

Address of the bookmark: http://www.bioconductor.org/packages/release/bioc/html/GeneBreak.html

eFORGE.v1.2

Jit — Fri, 28 Oct 2016 09:06:59 -0500

The eFORGE tool provides a method to view the tissue specific regulatory component of a set of EWAS DMPs. eFORGE analysis takes a set of DMPs, such as those hits above genome-wide significance threshold in an EWAS study, and analyses whether there is enrichment for overlap of putative functional elements compared to matched background DMPs. It assesses enrichment on a per cell type basis, since functional elements are differentially active in different cell types, and hence can expose tissue-specific signals of enrichment for the given test DMP set. This can reveal the sites of action underlying the EWAS signal, and provide confirmation of the validity of the EWAS where a tissue-specific mechanism is known or expected for the phenotype. Conversely unknown tissue involvements can also be revealed.

Address of the bookmark: http://eforge.cs.ucl.ac.uk/eFORGE.v1.2/?documentation

Standardized velvet assembly report

Poonam Mahapatra — Fri, 09 Dec 2016 03:59:59 -0600

Requirements:

velvet (velveth velvetg should be in your PATH)
R (with Sweave)
pdflatex (usually part of TeTeX)
ggplot2 (from R prompt type install.packages("ggplot2","proto","xtable"))
Perl

Optional:

BLAT or BLAST (to generate alignments against a reference genome). If using BLAT, add faToTwoBit,gfClient,gfServer to your PATH. If using BLAST, add blastall and formatdb.

Edit permute.sh to your liking, paying particular attention to the kmer, cvCut, expCov, and other flags

To Run:

perl fastaAllSize mysequences.fa > mysequences.stat or gunzip -c mysequences.fa.gz | fastaAllSize > mysequences.stat Substitute fastqAllSize for fastq files.
./permute.sh mysequences (leave out the .fa)

https://github.com/leipzig/standardized-velvet-assembly-report

Address of the bookmark: https://github.com/leipzig/standardized-velvet-assembly-report

Understanding Greedy Algorithms

Jit — Mon, 12 Dec 2016 04:37:40 -0600

Learning greedy algo for biologist.

https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

This webpage is also useful for the same:

http://learninglover.com/examples.php?id=59

http://www.cs.rpi.edu/~magdon/ps/conference/super_biokdd.pdf

https://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/lecture-slides/MIT7_91JS14_Lecture6.pdf

http://schatzlab.cshl.edu/teaching/AssemblyClass/01.%20Assembly%20Intro.pdf

http://lsl.sinica.edu.tw/Services/Class/files/20150612449.pdf

http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-43.pdf

Address of the bookmark: https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation

Neel — Thu, 15 Dec 2016 05:47:35 -0600

MyPro is an improved genomics software pipeline for prokaryotic genomes. MyPro is user-friendly and requires minimal programming skills. High-quality prokaryotic genome assembly and annotation can be obtained with ease. It performed better than de novo assemblers and contig integration software. Produces more contiguous assemblies, higher N50 values and lower number of contigs.

More at https://sourceforge.net/projects/sb2nhri/files/MyPro/

Address of the bookmark: http://www.sciencedirect.com/science/article/pii/S0167701215001207

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

PANDASEQ

Shruti Paniwala — Mon, 23 Jan 2017 04:54:32 -0600

PANDASEQ assembles paired-end Illumina reads into sequences, trying to correct for errors and uncalled bases. The assembler reads two files in FASTQ format with quality information. If amplification primers were used (e.g., to isolate a variable region of the 16S gene, or the constant regions around zinc finger binding residues), they can be removed from the sequence during assembly. The final sequence will correct any uncalled bases in the overlapping region using the complementary strand. When mismatches occur in the overlapping region, the base with the better quality score is chosen.
The algorithm is as follows:

1.Find the positions where the forward and reverse primers match best above the threshold and discard the ends of the sequence, including the primer.
2.Pick and overlap to maximise the probability of the forward and reverse reads having come from a single piece of DNA.
3.Identify the masking of the end of the read with the quality score B or # as done by CASAVA and adjust the probabilities in this region.
4.Construct an assembled sequence between the primers and calculate the quality.
5.Check for various constraints, including quality, length, uncalled bases, and user-supplied modules.

http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html

Address of the bookmark: http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html

ABACAS

Surabhi Chaudhary — Thu, 16 Feb 2017 12:15:55 -0600

ABACAS is intended to rapidly contiguate (align, order, orientate) , visualize and design primers to close gaps on shotgun assembled contigs based on a reference sequence. It uses MUMmer to find alignment positions and identify syntenies of assembly contigs against the reference. The output is then processed to generate a pseudomolecule taking overlaping contigs and gaps in to account. MUMmer's alignment generating programs, Nucmer and Promer are used followed by the 'delta-filter' utility function. Users could also run tblastx on contigs that are not used to generate the pseudomolecule.

Address of the bookmark: http://abacas.sourceforge.net/Manual.html#9._Colour_code