BOL: Related items

Rebaler: program for conducting reference-based assemblies using long reads.

Jit — Tue, 18 Sep 2018 07:52:41 -0500

Rebaler is a program for conducting reference-based assemblies using long reads. It relies mainly on minimap2 for alignment and Racon for making consensus sequences.

I made Rebaler for bacterial genomes (specifically for the task of testing basecallers). It should in principle work for non-bacterial genomes as well, but I haven't tested it.

Address of the bookmark: https://github.com/rrwick/Rebaler

Jabba: Hybrid Error Correction for Long Sequencing Reads

Jit — Fri, 05 Jan 2018 03:58:14 -0600

Jabba is a hybrid error correction tool to correct third generation (PacBio / ONT) sequencing data, using second generation (Illumina) data.

Input

Jabba takes as input a concatenated de Bruijn graph and a set of sequences:

the de Bruijn graph should appear in fasta format with 1 entry per node, the meta information should be in the format:
>NODE
the set of sequences should be in fasta or fastq format. These sequences will be corrected (e.g. PacBio reads). The corrections will be written to a file Jabba fasta.
The output is a file in fasta format with corrections of the long reads, and additionally a file in the input format containing uncorrected reads.

https://github.com/biointec/jabba/wiki

https://almob.biomedcentral.com/articles/10.1186/s13015-016-0075-7

Address of the bookmark: https://github.com/biointec/jabba

Porechop: tool for finding and removing adapters from Oxford Nanopore reads

Rahul Nayak — Tue, 29 May 2018 07:33:44 -0500

Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

Porechop also supports demultiplexing of Nanopore reads that were barcoded with the Native Barcoding Kit, PCR Barcoding Kit or Rapid Barcoding Kit.

Address of the bookmark: https://github.com/rrwick/Porechop

coursera genome assembly tutorial

Jit — Sat, 25 Nov 2017 08:57:25 -0600

Solutions to Coursera Genome Sequencing (Bioinformatics II)

Address of the bookmark: https://github.com/iansealy/coursera-assembly

ALPACA: A hybrid strategy for assembly of genomic DNA shotgun sequencing reads.

Seema Singh — Mon, 30 Apr 2018 04:38:40 -0500

ALPACA requires Celera Assembler 8.3 or later. It is recommended to build Celera Assembler from source. (Why? The pre-built binaries CA_8.3rc1 and CA8.3rc2 will work for any large data set.

Detail paper at https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3927-8

Address of the bookmark: https://github.com/VicugnaPacos/ALPACA

Nanopolis: polish a genome assembly

Rahul Nayak — Thu, 26 Jul 2018 04:51:28 -0500

Software package for signal-level analysis of Oxford Nanopore sequencing data. Nanopolish can calculate an improved consensus sequence for a draft genome assembly, detect base modifications, call SNPs and indels with respect to a reference genome and more (see Nanopolish modules, below).

Quickstart

http://nanopolish.readthedocs.io/en/latest/quickstart_consensus.html

Algorithms

http://simpsonlab.github.io/2017/06/30/nanopolish-v0.7.0/

Address of the bookmark: https://github.com/jts/nanopolish

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

NxRepair: error correction in de novo assemblies using Nextera Mate Pair Reads

BioStar — Thu, 24 Jan 2019 10:35:12 -0600

NxRepair is a python module that automatically detects large structural errors in de novo assemblies using Nextera mate pair reads. The decector will break a contig at the site of an identified misassembly and will generate a new fasta file containing both the corrected contigs and the correct, unaffected contigs.

https://nxrepair.readthedocs.io/en/latest/tutorial.html

nxrepair aligned_matepairs.bam assemblyfasta.fasta error_locations.csv new_fasta.fasta

Address of the bookmark: https://github.com/rebeccaroisin/nxrepair

Flye: Fast and accurate de novo assembler for single molecule sequencing reads

BioJoker — Tue, 02 Apr 2019 21:54:55 -0500

Flye is a de novo assembler for single molecule sequencing reads, such as those produced by PacBio and Oxford Nanopore Technologies. It is designed for a wide range of datasets, from small bacterial projects to large mammalian-scale assemblies. The package represents a complete pipeline: it takes raw PB / ONT reads as input and outputs polished contigs. Flye also includes a special mode for metagenome assembly.

Address of the bookmark: https://github.com/fenderglass/Flye

BlobToolKit: A toolkit for genome assembly QC

Jit — Fri, 21 Feb 2020 00:17:50 -0600

Filtering raw genomic datasets is essential to avoid chimeric assemblies and to increase the validity of sequence-based biological inference. BlobToolKit extends the BlobTools1/Blobology2 approach to simplify interactive and reproducible filtering.

BlobToolKit is comprised of four components:

BlobToolKit Viewer allows browser-based interactive visualisation and filtering of preliminary or published genomic datasets even for highly fragmented assemblies.
BlobTools2 is a command-line program to convert assemblies and analysis results into datasets that can be further processed using BlobTools2 and/or visualised in the Viewer.
The BlobToolKit Specification features a formal schema and validator for the JSON-based BlobDir format used by BlobTools2 and the Viewer.
The BlobToolKit Pipeline is a configurable Snakemake pipeline that automates all steps from retrieving public datasets through running analyses and generating a BlobDir dataset with BlobTools2, ready for visualisation in the Viewer.

Paper https://www.biorxiv.org/content/10.1101/844852v1.full.pdf

Address of the bookmark: https://blobtoolkit.genomehubs.org/