BOL: Related items

Short-read assembly using Spades !

Abhimanyu Singh — Mon, 31 Jan 2022 07:18:16 -0600

If we only had Illumina reads, we could also assemble these using the tool Spades.

You can try this here, or try it later on your own data.

Get data

We will use the same Illumina data as we used above:

illumina_R1.fastq.gz: the Illumina forward reads
illumina_R2.fastq.gz: the Illumina reverse reads

Assemble

Run Spades:

spades.py -1 illumina_R1.fastq.gz -2 illumina_R2.fastq.gz --careful --cov-cutoff auto -o spades_assembly_all_illumina

-1 is input file of forward reads
-2 is input file of reverse reads
--careful minimizes mismatches and short indels
--cov-cutoff auto computes the coverage threshold (rather than the default setting, “off”)
-o is the output directory

Results

Move into the output directory and look at the contigs:

infoseq contigs.fasta

Understanding PacBio

Jitendra Narayan — Fri, 24 Feb 2017 10:17:36 -0600

This tutorial includes resources for learning more about PacBio data and bioinformatics analysis, and includes content suitable for both beginners and experts. Below are links to training modules (webinars and PowerPoint presentations) to help you get started with your data processing, as well as information for specialized applications.

Training Resources:

Specialized Applications:

Address of the bookmark: https://github.com/PacificBiosciences/Bioinformatics-Training/wiki

fragScaff: Genome Assembly with Contiguity Preserving Transposition

Jit — Mon, 14 May 2018 04:28:14 -0500

Contiguity preserving transposition and sequencing (CPT-seq) is an entirely in vitro means of generating libraries comprised of 9216 indexed pools, each of which contains thousands of sparsely sequenced long fragments ranging from 5 kilobases to >1 megabase. This software, fragScaff, leverages coincidences between the content of different pools as a source of contiguity information for scaffolding de novo genome assemblies. FragScaff is complementary to Lachesis, providing midrange contiguity to support robust, accurate chromosome-scale de novo genome assemblies without the need for laborious in vivo cloning steps.

Further information about fragScaff, including source code, is available at:https://sourceforge.net/projects/fragscaff/files.

Manuscript describing fragScaff was published as: Adey A, Kitzman JO, Burton JN, Daza R, Kumar A, Christiansen L, Ronaghi M, Amini S, L Gunderson K, Steemers FJ, Shendure J#. In vitro, long-range sequence information for de novo genome assembly via transposase contiguity. Genome Research 2014 Dec;24(12):2041-9. doi: 10.1101/gr.178319.114. PubMed PMID: 25327137.

Address of the bookmark: https://sourceforge.net/projects/fragscaff/files/

Genome assembly tutorial "Genome Assembly for short and long reads"

Jit — Sat, 19 Jan 2019 17:29:53 -0600

In this lab we will perform de novo genome assembly of a bacterial genome. You will be guided through the genome assembly starting with data quality control, through to building contigs and analysis of the results. At the end of the lab you will know:

How to perform basic quality checks on the input data
How to run a short read assembler on Illumina data
How to run a long read assembler on Pacific Biosciences or Oxford Nanopore data
How to improve the accuracy of a long read assembly using short reads
How to assess the quality of an assembly

https://bioinformaticsdotca.github.io/high-throughput_biology_2017

Address of the bookmark: https://bioinformaticsdotca.github.io/high-throughput_biology_2017_module6_lab

CoLoRMap: Correcting Long Reads by Mapping short reads

Jit — Mon, 20 Aug 2018 14:17:05 -0500

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

ehaghshe@sfu.ca or cedric.chauve@sfu.ca

Address of the bookmark: https://github.com/sfu-compbio/colormap

EAGLER: a scaffolding tool for long reads.

Jit — Mon, 04 Jun 2018 05:26:03 -0500

EAGLER is a scaffolding tool for long reads. The scaffolder takes as input a draft genome created by any NGS assembler and a set of long reads. The long reads are used to extend the contigs present in the NGS draft and possibly join overlapping contigs. EAGLER supports both PacBio and Oxford Nanopore reads.

The tool should be compatible with most UNIX flavors and has been successfully tested on the following operating systems:

Mac OS X 10.11.1
Mac OS X 10.10.3
Ubuntu 14.04 LTS

https://bib.irb.hr/datoteka/844447.Diplomski_2015_Luka_terbi.pdf

Address of the bookmark: https://github.com/mculinovic/EAGLER

LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly

Rahul Nayak — Thu, 14 May 2020 15:09:52 -0500

LR_Gapcloser is a gap closing tool using long reads from studied species. The long reads could be downloaed from public read archive database (for instance, NCBI SRA database ) or be your own data. Then they are fragmented and aligned to scaffolds using BWA mem algorithm in BWA package. In the package, we provided a compiled bwa, so the user needn't to install bwa. LR_Gapcloser uses the alignments to find the bridging that cross the gap, and then fills the long read original sequence into the genomic gaps.

Address of the bookmark: https://github.com/CAFS-bioinformatics/LR_Gapcloser

Long read assembly workshop !

Rahul Nayak — Thu, 04 Oct 2018 17:23:18 -0500

This is a tutorial for a workshop on long-read (PacBio) genome assembly.

It demonstrates how to use long PacBio sequencing reads to assemble a bacterial genome, and includes additional steps for circularising, trimming, finding plasmids, and correcting the assembly with short-read Illumina data.

Please comment if you know any other long read addembly tutorial.

Address of the bookmark: http://sepsis-omics.github.io/tutorials/modules/cmdline_assembly_v2/

Cogent: a tool for reconstructing the coding genome using high-quality full-length transcriptome sequences.

Jit — Tue, 18 Jun 2019 05:33:04 -0500

Cogent is a tool that identifies gene families and reconstructs the coding genome using high-quality transcriptome data without a reference genome, and can be used to check assemblies for the presence of these known coding sequences.

Cogent is a tool for reconstructing the coding genome using high-quality full-length transcriptome sequences. It is designed to be used on Iso-Seq data and in cases where there is no reference genome or the ref genome is highly incomplete.

See a recent presentation on Cogent being applied to the Cuttlefish Iso-Seq data.

Cogent preliminary draft paper (updated 2016Dec version), Supplementary

Please see wiki for details on usage.

Address of the bookmark: https://github.com/Magdoll/Cogent

odgi: optimized dynamic genome/graph implementation

Abhimanyu Singh — Tue, 01 Feb 2022 23:42:21 -0600

odgi provides an efficient and succinct dynamic DNA sequence graph model, as well as a host of algorithms that allow the use of such graphs in bioinformatic analyses.

Careful encoding of graph entities allows odgi to efficiently compute and transform pangenomes with minimal overheads. odgi implements a dynamic data structure that leveraged multi-core CPUs and can be updated on the fly.

The edges and path steps are recorded as deltas between the current node id and the target node id, where the node id corresponds to the rank in the global array of nodes. Graphs built from biological data sets tend to have local partial order and, when sorted, the deltas be small. This allows them to be compressed with a variable length integer representation, resulting in a small in-memory footprint at the cost of packing and unpacking.

The RAM and computational savings are substantial. In partially ordered regions of the graph, most deltas will require only a single byte.

Address of the bookmark: https://github.com/pangenome/odgi