BOL: Related items

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads

Rahul Nayak — Fri, 11 May 2018 05:07:45 -0500

MECAT is an ultra-fast Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads. MECAT employs novel alignment and error correction algorithms that are much more efficient than the state of art of aligners and error correction tools. MECAT can be used for effectively de novo assemblying large genomes. For example, on a 32-thread computer with 2.0 GHz CPU , MECAT takes 9.5 days to assemble a human genome based on 54x SMRT data, which is 40 times faster than the current PBcR-Mhap pipeline. MECAT performance were compared with PBcR-Mhap pipeline, FALCON and Canu(v1.3) in five real datasets. The quality of assembled contigs produced by MECAT is the same or better than that of the PBcR-Mhap pipeline and FALCON.

https://www.nature.com/articles/nmeth.4432

Address of the bookmark: https://github.com/xiaochuanle/MECAT

ChopStitch: exon annotation and splice graph construction using transcriptome assembly and whole genome sequencing data

Rahul Nayak — Tue, 03 Jul 2018 04:14:52 -0500

ChopStitch is a new method for finding putative exons and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also detects base substitutions in transcript sequences corresponding to sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are reported as splice graphs in dot output format.

Address of the bookmark: https://github.com/bcgsc/ChopStitch

LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly

Rahul Nayak — Thu, 14 May 2020 15:09:52 -0500

LR_Gapcloser is a gap closing tool using long reads from studied species. The long reads could be downloaed from public read archive database (for instance, NCBI SRA database ) or be your own data. Then they are fragmented and aligned to scaffolds using BWA mem algorithm in BWA package. In the package, we provided a compiled bwa, so the user needn't to install bwa. LR_Gapcloser uses the alignments to find the bridging that cross the gap, and then fills the long read original sequence into the genomic gaps.

Address of the bookmark: https://github.com/CAFS-bioinformatics/LR_Gapcloser

Genobuntu: A software package containing more than 70 software and packages oriented towards NGS and genome assembly

BioStar — Tue, 11 Dec 2018 05:15:57 -0600

Genobuntu is a software package containing more than 70 software and packages oriented towards NGS. In its current version, Genobuntu supports pre assembly tools, genome assemblers as well as post assembly tools.

Commonly used biological software and example script files for different assembly pipelines have also been provided, where the example script files can be updated to suit one’s experimental needs. Genobuntu attempts to reduce the amount of time and energy needed to build software workstations and it can also act as a good teaching source for a class room setting.

https://sourceforge.net/projects/genobuntu/

Address of the bookmark: https://sourceforge.net/projects/genobuntu/

GMASS: a novel measure for genomeassembly structural similarity

Abhimanyu Singh — Sun, 14 Apr 2019 20:35:40 -0500

The GMASS score is a novel measure for representing structural similarity between two assemblies. It will contribute to the understanding of assembly output and developing de novo assemblers.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-2710-z

Address of the bookmark: http://bioinfo.konkuk.ac.kr/GMASS/htdocs/syncircos.php

RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes

Rahul Nayak — Fri, 01 May 2020 03:00:40 -0500

RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step.

Address of the bookmark: https://github.com/AppliedBioinformatics/RefKA

CANU: Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing.

Jit — Tue, 26 Apr 2016 11:38:10 -0500

Canu is a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION). The software is currently alpha level, feel free to use and report issues encountered.

Canu is a hierachical assembly pipeline which runs in four steps:

Detect overlaps in high-noise sequences using MHAP
Generate corrected sequence consensus
Trim corrected sequences
Assemble trimmed corrected sequences

Read the documentation

New release https://github.com/marbl/canu/releases

Address of the bookmark: https://github.com/marbl/canu

Redundans

Jit — Thu, 01 Sep 2016 08:28:11 -0500

Redundans pipeline assists an assembly of heterozygous genomes.
Program takes as input assembled contigs, paired-end and/or mate pairs sequencing libraries and returns scaffolded homozygous genome assembly, that should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding more details.

The pipeline consists of three steps/modules:

redundancy reduction: detection and selectively removal of redundant contigs from an initial de novo assembly
scaffolding: joining of genome fragments using paired-end and/or mate-pairs reads
gap closing

Redundans is:

fast & lightweight, multi-core support and memory-optimised, so it can be run even on the laptop for small-to-medium size genomes
flexible toward many sequencing technologies (Illumina, 454 or Sanger) and library types (paired-end, mate pairs, fosmids)
modular: every step can be ommited or replaced by another tools

Address of the bookmark: https://github.com/Gabaldonlab/redundans

Standardized velvet assembly report

Poonam Mahapatra — Fri, 09 Dec 2016 03:59:59 -0600

Requirements:

velvet (velveth velvetg should be in your PATH)
R (with Sweave)
pdflatex (usually part of TeTeX)
ggplot2 (from R prompt type install.packages("ggplot2","proto","xtable"))
Perl

Optional:

BLAT or BLAST (to generate alignments against a reference genome). If using BLAT, add faToTwoBit,gfClient,gfServer to your PATH. If using BLAST, add blastall and formatdb.

Edit permute.sh to your liking, paying particular attention to the kmer, cvCut, expCov, and other flags

To Run:

perl fastaAllSize mysequences.fa > mysequences.stat or gunzip -c mysequences.fa.gz | fastaAllSize > mysequences.stat Substitute fastqAllSize for fastq files.
./permute.sh mysequences (leave out the .fa)

https://github.com/leipzig/standardized-velvet-assembly-report

Address of the bookmark: https://github.com/leipzig/standardized-velvet-assembly-report

ScaffMatch

Jit — Tue, 13 Dec 2016 10:23:56 -0600

caffMatch is a novel scaffolding tool based on Maximum-Weight Matching able to produce high-quality scaffolds from NGS data (reads and contigs). The tool is written in Python 2.7. It also includes a bash script wrapper that calls aligner in case one needs to first map reads to contigs (instead of providing .sam files).

The arguments accepted by ScaffMatch are:

-w) Working directory -- this is the directory where ScaffMatch files are stored. These are .sam files produced after mapping reads to contigs and the resulting scaffolds file `scaffolds.fa` fasta file;

-c) Contig fasta file;

-m) Command line argument with no options. It is used when .sam files are used instead of reads .fastq files. Do not use this option if you provide reads files;

-1) (Comma separated list of) either .fastq or .sam file(s) corresponding to the first read of the read pair;

-2) (Comma separated list of) either .fastq or .sam file(s) corresponding to the second read of the read pair;

-i) (Comma separated list of) insert size(s) of the library(-ies);

-s) (Comma separated list of) library(-ies) standard deviation(s) of insert size(s);

-t) Bundle threshold. Pairs of contigs supported by number of read pairs less than the value of this argument are discarded. Optional argument, by default it is equal to 5;

-g) Matching heuristics: use `max_weight` for Maximum Weight Matching heuristics with the Insertion step, use `backbone` for Maximum Weight Matching heuristics without the Insertion step, use `greedy` for Greedy Matching heuristics;

-l) Log file - where to store the logs. Optional argument. By default, stdout is used.

Address of the bookmark: http://alan.cs.gsu.edu/NGS/?q=content/scaffmatch