BOL: Related items

Short-read assembly using Spades !

Abhimanyu Singh — Mon, 31 Jan 2022 07:18:16 -0600

If we only had Illumina reads, we could also assemble these using the tool Spades.

You can try this here, or try it later on your own data.

Get data

We will use the same Illumina data as we used above:

illumina_R1.fastq.gz: the Illumina forward reads
illumina_R2.fastq.gz: the Illumina reverse reads

Assemble

Run Spades:

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o spades_assembly_all_illumina

-1 is input file of forward reads
-2 is input file of reverse reads
--careful minimizes mismatches and short indels
--cov-cutoff auto computes the coverage threshold (rather than the default setting, “off”)
-o is the output directory

Results

Move into the output directory and look at the contigs:

infoseq contigs.fasta

Mitochondrial genome assembly tools !

Abhi — Wed, 06 Sep 2023 00:37:18 -0500

Mitochondrial genome assembly tools are specialized software and algorithms designed to accurately reconstruct the mitochondrial genome (mitogenome) from sequencing data, typically obtained through techniques like next-generation sequencing (NGS). The mitochondrial genome is relatively small compared to the nuclear genome, making it an ideal target for assembly. Here are some commonly used mitochondrial genome assembly tools:

MitoFinder: Mitofinder is a pipeline to assemble mitochondrial genomes and annotate mitochondrial genes from trimmed read sequencing data.

MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads

MITObim: MITObim is a tool specifically developed for the iterative assembly of mitochondrial genomes. It starts with a reference mitogenome and iteratively refines the assembly using the read data.

MITOS: MITOS is a web-based platform that provides a pipeline for annotating mitochondrial genomes. It integrates multiple software tools for assembly, annotation, and visualization of mitogenomes.

MIRA: MIRA (Mimicking Intelligent Read Assembly) is a versatile genome assembly tool that can be used for mitochondrial genome assembly. It supports various sequencing technologies and allows for reference-based or de novo assembly.

NOVOPlasty: NOVOPlasty is a user-friendly tool designed for de novo assembly of organelle genomes, including mitochondria. It utilizes a seed-and-extend algorithm and is suitable for both short-read and long-read data.

MITOS2: MITOS2 is an updated version of the MITOS pipeline, which automates the annotation of mitochondrial genomes. It provides improved accuracy and additional features for mitochondrial genome analysis.

GetOrganelle: While primarily designed for chloroplast genome assembly, GetOrganelle can also be used for mitochondrial genome assembly. It is particularly useful for dealing with high-throughput sequencing data.

SPAdes: SPAdes (St. Petersburg genome assembler) is a versatile genome assembly tool that can be employed for mitochondrial genome assembly, especially when dealing with complex datasets that may contain nuclear mitochondrial DNA sequences (numts).

IDBA-UD: IDBA-UD (Iterative De Bruijn Graph De Novo Assembler) is another de novo assembly tool that can be used for mitochondrial genome assembly, especially in cases with relatively low coverage.

Velvet: Velvet is a de novo assembly tool that can be applied to mitochondrial genome assembly, especially when working with short-read data.

When selecting a mitochondrial genome assembly tool, it's important to consider the specific characteristics of your sequencing data, such as read length and coverage, as well as the complexity of the mitochondrial genome. Additionally, some tools are better suited for specific organisms or research objectives, so choosing the right tool will depend on your particular project requirements.

CANU: Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing.

Jit — Tue, 26 Apr 2016 11:38:10 -0500

Canu is a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION). The software is currently alpha level, feel free to use and report issues encountered.

Canu is a hierachical assembly pipeline which runs in four steps:

Detect overlaps in high-noise sequences using MHAP
Generate corrected sequence consensus
Trim corrected sequences
Assemble trimmed corrected sequences

Read the documentation

New release https://github.com/marbl/canu/releases

Address of the bookmark: https://github.com/marbl/canu

Redundans

Jit — Thu, 01 Sep 2016 08:28:11 -0500

Redundans pipeline assists an assembly of heterozygous genomes.
Program takes as input assembled contigs, paired-end and/or mate pairs sequencing libraries and returns scaffolded homozygous genome assembly, that should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding more details.

The pipeline consists of three steps/modules:

redundancy reduction: detection and selectively removal of redundant contigs from an initial de novo assembly
scaffolding: joining of genome fragments using paired-end and/or mate-pairs reads
gap closing

Redundans is:

fast & lightweight, multi-core support and memory-optimised, so it can be run even on the laptop for small-to-medium size genomes
flexible toward many sequencing technologies (Illumina, 454 or Sanger) and library types (paired-end, mate pairs, fosmids)
modular: every step can be ommited or replaced by another tools

Address of the bookmark: https://github.com/Gabaldonlab/redundans

Standardized velvet assembly report

Poonam Mahapatra — Fri, 09 Dec 2016 03:59:59 -0600

Requirements:

velvet (velveth velvetg should be in your PATH)
R (with Sweave)
pdflatex (usually part of TeTeX)
ggplot2 (from R prompt type install.packages("ggplot2","proto","xtable"))
Perl

Optional:

BLAT or BLAST (to generate alignments against a reference genome). If using BLAT, add faToTwoBit,gfClient,gfServer to your PATH. If using BLAST, add blastall and formatdb.

Edit permute.sh to your liking, paying particular attention to the kmer, cvCut, expCov, and other flags

To Run:

perl fastaAllSize mysequences.fa > mysequences.stat or gunzip -c mysequences.fa.gz | fastaAllSize > mysequences.stat Substitute fastqAllSize for fastq files.
./permute.sh mysequences (leave out the .fa)

https://github.com/leipzig/standardized-velvet-assembly-report

Address of the bookmark: https://github.com/leipzig/standardized-velvet-assembly-report

ScaffMatch

Jit — Tue, 13 Dec 2016 10:23:56 -0600

caffMatch is a novel scaffolding tool based on Maximum-Weight Matching able to produce high-quality scaffolds from NGS data (reads and contigs). The tool is written in Python 2.7. It also includes a bash script wrapper that calls aligner in case one needs to first map reads to contigs (instead of providing .sam files).

The arguments accepted by ScaffMatch are:

-w) Working directory -- this is the directory where ScaffMatch files are stored. These are .sam files produced after mapping reads to contigs and the resulting scaffolds file `scaffolds.fa` fasta file;

-c) Contig fasta file;

-m) Command line argument with no options. It is used when .sam files are used instead of reads .fastq files. Do not use this option if you provide reads files;

-1) (Comma separated list of) either .fastq or .sam file(s) corresponding to the first read of the read pair;

-2) (Comma separated list of) either .fastq or .sam file(s) corresponding to the second read of the read pair;

-i) (Comma separated list of) insert size(s) of the library(-ies);

-s) (Comma separated list of) library(-ies) standard deviation(s) of insert size(s);

-t) Bundle threshold. Pairs of contigs supported by number of read pairs less than the value of this argument are discarded. Optional argument, by default it is equal to 5;

-g) Matching heuristics: use `max_weight` for Maximum Weight Matching heuristics with the Insertion step, use `backbone` for Maximum Weight Matching heuristics without the Insertion step, use `greedy` for Greedy Matching heuristics;

-l) Log file - where to store the logs. Optional argument. By default, stdout is used.

Address of the bookmark: http://alan.cs.gsu.edu/NGS/?q=content/scaffmatch

PEAR

Jit — Mon, 19 Dec 2016 09:28:30 -0600

PEAR is an ultrafast, memory-efficient and highly accurate pair-end read merger. It is fully parallelized and can run with as low as just a few kilobytes of memory.

PEAR evaluates all possible paired-end read overlaps and without requiring the target fragment size as input. In addition, it implements a statistical test for minimizing false-positive results. Together with a highly optimized implementation, it can merge millions of paired end reads within a couple of minutes on a standard desktop computer.

Address of the bookmark: http://sco.h-its.org/exelixis/web/software/pear/doc.html

Genome Assembly Tutorial

Abhimanyu Singh — Tue, 20 Dec 2016 07:56:01 -0600

If genomes were completely random sequences in a statistical sense, 'overlap-consensus-layout' method would have been enough to assemble large genomes from Sanger reads. In contrast, real genomes often have long repetitive regions, and they are hard to assemble using overlap-consensus-layout approach. De Bruijn graph-based assembly approach was originally proposed to handle the assembly of repetitive regions better.

More at http://www.homolog.us/Tutorials/index.php?p=1.4&s=1

Address of the bookmark: http://www.homolog.us/Tutorials/index.php?p=1.4&s=1

Harvest

Jit — Tue, 31 Jan 2017 10:57:56 -0600

Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes, including variant calls, recombination detection, and phylogenetic trees.

Tools

Parsnp - Core-genome alignment and analysis
Gingr - Interactive visualization of alignments, trees and variants
HarvestTools - Archiving and postprocessing

Citation

Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology, 15 (11), 1-15 [PDF]

Address of the bookmark: http://harvest.readthedocs.io/en/latest/index.html

bedtools

Jit — Fri, 24 Feb 2017 04:50:44 -0600

Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one tointersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. While each individual tool is designed to do a relatively simple task (e.g., intersect two interval files), quite sophisticated analyses can be conducted by combining multiple bedtools operations on the UNIX command line.

bedtools is developed in the Quinlan laboratory at the University of Utah and benefits from fantastic contributions made by scientists worldwide.

Address of the bookmark: http://bedtools.readthedocs.io/en/latest/index.html