BOL: Related items

RECORD

Bulbul — Fri, 25 Nov 2016 08:23:36 -0600

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used. Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge. Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.

More at https://sourceforge.net/projects/record-genome-assembler/files/

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pubmed/26558255

SGA: String Graph Assembler

Jit — Thu, 08 Dec 2016 05:08:59 -0600

SGA is a de novo genome assembler based on the concept of string graphs. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads.

More at

https://github.com/jts/sga

SGA dependencies:
-google sparse hash library (http://code.google.com/p/google-sparsehash/)
-the bamtools library (https://github.com/pezmaster31/bamtools)
-zlib (http://www.zlib.net/)
-(optional but suggested) the jemalloc memory allocator (http://www.canonware.com/jemalloc/download.html)

Address of the bookmark: https://github.com/jts/sga

Understanding Greedy Algorithms

Jit — Mon, 12 Dec 2016 04:37:40 -0600

Learning greedy algo for biologist.

https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

This webpage is also useful for the same:

http://learninglover.com/examples.php?id=59

http://www.cs.rpi.edu/~magdon/ps/conference/super_biokdd.pdf

https://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/lecture-slides/MIT7_91JS14_Lecture6.pdf

http://schatzlab.cshl.edu/teaching/AssemblyClass/01.%20Assembly%20Intro.pdf

http://lsl.sinica.edu.tw/Services/Class/files/20150612449.pdf

http://www.cs.jhu.edu/~langmea/resources/lecture_notes/assembly_scs.pdf

https://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-43.pdf

Address of the bookmark: https://www.topcoder.com/community/data-science/data-science-tutorials/greedy-is-good/

Genome Assembly Tutorial

Abhimanyu Singh — Tue, 20 Dec 2016 07:56:01 -0600

If genomes were completely random sequences in a statistical sense, 'overlap-consensus-layout' method would have been enough to assemble large genomes from Sanger reads. In contrast, real genomes often have long repetitive regions, and they are hard to assemble using overlap-consensus-layout approach. De Bruijn graph-based assembly approach was originally proposed to handle the assembly of repetitive regions better.

More at http://www.homolog.us/Tutorials/index.php?p=1.4&s=1

Address of the bookmark: http://www.homolog.us/Tutorials/index.php?p=1.4&s=1

Harvest

Jit — Tue, 31 Jan 2017 10:57:56 -0600

Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes, including variant calls, recombination detection, and phylogenetic trees.

Tools

Parsnp - Core-genome alignment and analysis
Gingr - Interactive visualization of alignments, trees and variants
HarvestTools - Archiving and postprocessing

Citation

Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology, 15 (11), 1-15 [PDF]

Address of the bookmark: http://harvest.readthedocs.io/en/latest/index.html

bedtools

Jit — Fri, 24 Feb 2017 04:50:44 -0600

Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one tointersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

bedtools is developed in the Quinlan laboratory at the University of Utah and benefits from fantastic contributions made by scientists worldwide.

Address of the bookmark: http://bedtools.readthedocs.io/en/latest/index.html

splitbam: splits a BAM by chromosomes

Jit — Tue, 28 Feb 2017 09:01:28 -0600

splitbam splits a BAM by chromosomes.

Using the reference sequence dictionary (*.dict), it also creates some empty BAM files if no sam record was found for a chromosome. A pair of 'mock' SAM-Records can also be added to those empty BAMs to avoid some tools (like samtools) to crash.

Usage

java -jar splitbam.jar -p OUT/__CHROM__/__CHROM__.bam -R ref.fasta (bam|sam|stdin)

Options

-h help; This screen.
-R (indexed reference file) REQUIRED.
-u (unmapped chromosome name): default:Unmapped
-e | --empty : generate EMPTY bams for chromosome having no read mapped
-m | --mock : if option '-e', add a mock pair of sam records to the empty bam
-p (output file/bam pattern) REQUIRED. MUST contain __CHROM__ and end with .bam
-s assume input is sorted.
-x | --index create index.
-t | --tmp (dir) tmp file directory
-G (file) chrom-group file (see below)

Address of the bookmark: https://code.google.com/archive/p/jvarkit/wikis/SplitBam.wiki

MaxBin: software for binning assembled metagenomic sequences based on an Expectation-Maximization algorithm.

Jit — Mon, 06 Mar 2017 04:03:38 -0600

MaxBin is software for binning assembled metagenomic sequences based on an Expectation-Maximization algorithm. Users can understand the underlying bins (genomes) of the microbes in their metagenomes by simply providing assembled metagenomic sequences and the reads coverage information or sequencing reads. For users' convenience MaxBin will report genome-related statistics, including estimated completeness, GC content and genome size in the binning summary page.

Users can use MEGAN or similar software on MaxBin bins to find the taxonomy of each bin after the binning process is finished.

https://academic.oup.com/bioinformatics/article/32/4/605/1744462/MaxBin-2-0-an-automated-binning-algorithm-to

The most recent version of MaxBin is 2.2, which supports the analysis of coassemblies of multiple samples. It is available at this JBEI downloads sites as well as MaxBin and MaxBin 2.0 sourceforge sites.

Address of the bookmark: http://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html

GroopM: Metagenomic binning toolset

Jit — Tue, 07 Mar 2017 08:59:45 -0600

GroopM is a metagenomic binning toolset. It leverages spatio-temoral
dynamics (differential coverage) to accurately (and almost automatically)
extract population genomes from multi-sample metagenomic datasets.

GroopM is largely parameter-free. Use: groopm -h for more info.

For installation and usage instructions see : http://ecogenomics.github.io/GroopM/

Address of the bookmark: https://github.com/ecogenomics/GroopM

DBG2OLC:Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Jit — Wed, 19 Apr 2017 10:09:51 -0500

DBG2OLC:Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies

Our work is published in Scientific Reports:

Ye, C. et al. DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies. Sci. Rep. 6, 31900; doi: 10.1038/srep31900 (2016).

http://www.nature.com/articles/srep31900

The manual can be downloaded from:

https://github.com/yechengxi/DBG2OLC/raw/master/Manual.docx

To use precompiled versions,please go to:

https://github.com/yechengxi/DBG2OLC/tree/master/compiled

Address of the bookmark: https://github.com/yechengxi/DBG2OLC