BOL: Related items

Genome assembly training tutorial at Galaxy !

Jit — Sun, 27 Dec 2020 05:25:45 -0600

In this tutorial we assemble and annotate the genome of E. coli strain C-1. This strain is routinely used in experimental evolution studies involving bacteriophages. For instance, now classic works by Holly Wichman and Jim Bull (Bull 1997, Bull 1998, Wichman 1999) have been performed using this strain and bacteriophage phiX174.

Address of the bookmark: https://training.galaxyproject.org/training-material/topics/assembly/tutorials/unicycler-assembly/tutorial.html

IVA: accurate de novo assembly of RNA virus genomes

Neel — Wed, 23 Jun 2021 07:51:59 -0500

IVA (Iterative Virus Assembler) designed specifically for read pairs sequenced at highly variable depth from RNA virus samples. We tested IVA on datasets from 140 sequenced samples from human immunodeficiency virus-1 or influenza-virus-infected people and demonstrated that IVA outperforms all other virus de novo assemblers.

Availability and implementation: The software runs under Linux, has the GPLv3 licence and is freely available from http://sanger-pathogens.github.io/iva

https://pubmed.ncbi.nlm.nih.gov/25725497/

Address of the bookmark: https://github.com/sanger-pathogens/iva

auN: a new metric to measure assembly contiguity

Jit — Tue, 02 Aug 2022 01:18:47 -0500

Given a de novo assembly, we often measure the “average” contig length by N50. N50 is neither the real average nor median. It is the length of the contig such that this and longer contigs cover at least 50% of the assembly. A longer N50 indicates better contiguity. We can similarly define Nx such that contigs no shorter than Nx covers x% of the assembly. The Nx curve plots Nx as a function of x, where x is ranged from 0 to 100.

Address of the bookmark: https://lh3.github.io/2020/04/08/a-new-metric-on-assembly-contiguity

RATT

Jitendra Narayan — Sun, 07 Feb 2016 16:09:40 -0600

RATT is software to transfer annotation from a reference (annotated) genome to an unannotated query genome.

It was first developed to transfer annotations between different genome assembly versions. However, it can also transfer annotations between strains and even different species, like Plasmodium chabaudi onto P. berghei, between different Leishmania species or Salmonella enterica onto other Salmonella serotypes. RATT is able to transfer any entries present on a reference sequence, such as the systematic id or an annotator's notes; such information would be lost in a de novo annotation.

More at http://ratt.sourceforge.net/

Address of the bookmark: http://ratt.sourceforge.net/

Sequence assembly with MIRA 4

Priya Singh — Wed, 06 Apr 2016 08:21:22 -0500

MIRA is a multi-pass DNA sequence data assembler/mapper for whole genome and EST/RNASeq projects. MIRA assembles/maps reads gained by

electrophoresis sequencing (aka Sanger sequencing)
454 pyro-sequencing (GS20, FLX or Titanium)
Ion Torrent
Solexa (Illumina) sequencing
(in development) Pacific Biosciences sequencing

into contiguous sequences (called contigs). One can use the sequences of different sequencing technologies either in a single assembly run (a true hybrid assembly) or by mapping one type of data to an assembly of other sequencing type (a semi-hybrid assembly (or mapping)) or by mapping a data against consensus sequences of other assemblies (a simple mapping).

The MIRA acronym stands for Mimicking Intelligent Read Assembly and the program pretty well does what its acronym says (well, most of the time anyway). It is the Swiss army knife of sequence assembly that I've used and developed during the past 14 years to get assembly jobs I work on done efficiently - and especially accurately. That is, without me actually putting too much manual work into it.

More at http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html

Address of the bookmark: http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html

Understanding Fastqc Output

Jit — Fri, 15 Apr 2016 05:47:40 -0500

Understanding Following table and graphs

Duplication level
kmer profile
per base GC content
per base N content
per base quality
per base sequence content
per sequence GC content
per sequence quality
sequence length distribution

More at http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

Address of the bookmark: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/3%20Analysis%20Modules/

ALE: a Generic Assembly Likelihood Evaluation Framework for Assessing the Accuracy of Genome and Metagenome Assemblies

Neel — Tue, 26 Apr 2016 03:38:43 -0500

Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

More at http://www.ncbi.nlm.nih.gov/pubmed/23303509

Address of the bookmark: http://sc932.github.io/ALE/about.html

Bambus

Jit — Tue, 16 Aug 2016 08:09:15 -0500

Bambus 2.0, the second generation Bambus scaffolder available as an open source package. While most other scaffolders are closely tied to a specific assembly program, Bambus accepts the output from most current assemblers and provides the user with great flexibility in choosing the scaffolding parameters. In particular, Bambus is able to accept contig linking data other than specified by mate-pairs. Such sources of information include alignment to a reference genome (Bambus can directly use the output of MUMmer), physical mapping data, or information about gene synteny.

Home Page:

http://sourceforge.net/apps/mediawiki/amos/index.php?title=Bambus2

Address of the bookmark: https://www.cbcb.umd.edu/software/bambus2

RECORD

Bulbul — Fri, 25 Nov 2016 08:23:36 -0600

Background. Next-generation sequencing technologies are now producing multiple times the genome size in total reads from a single experiment. This is enough information to reconstruct at least some of the differences between the individual genome studied in the experiment and the reference genome of the species. However, in most typical protocols, this information is disregarded and the reference genome is used. Results. We provide a new approach that allows researchers to reconstruct genomes very closely related to the reference genome (e.g., mutants of the same species) directly from the reads used in the experiment. Our approach applies de novo assembly software to experimental reads and so-called pseudoreads and uses the resulting contigs to generate a modified reference sequence. In this way, it can very quickly, and at no additional sequencing cost, generate new, modified reference sequence that is closer to the actual sequenced genome and has a full coverage. In this paper, we describe our approach and test its implementation called RECORD. We evaluate RECORD on both simulated and real data. We made our software publicly available on sourceforge. Conclusion. Our tests show that on closely related sequences RECORD outperforms more general assisted-assembly software.

More at https://sourceforge.net/projects/record-genome-assembler/files/

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pubmed/26558255

SGA: String Graph Assembler

Jit — Thu, 08 Dec 2016 05:08:59 -0600

SGA is a de novo genome assembler based on the concept of string graphs. The major goal of SGA is to be very memory efficient, which is achieved by using a compressed representation of DNA sequence reads.

More at

https://github.com/jts/sga

SGA dependencies:
-google sparse hash library (http://code.google.com/p/google-sparsehash/)
-the bamtools library (https://github.com/pezmaster31/bamtools)
-zlib (http://www.zlib.net/)
-(optional but suggested) the jemalloc memory allocator (http://www.canonware.com/jemalloc/download.html)

Address of the bookmark: https://github.com/jts/sga