BOL: Related items

Liftoff: An accurate GFF3/GTF lift over pipeline

Neel — Sun, 20 Dec 2020 01:36:37 -0600

Liftoff is a tool that accurately maps annotations in GFF or GTF between assemblies of the same, or closely-related species. Unlike current coordinate lift-over tools which require a pre-generated “chain” file as input, Liftoff is a standalone tool that takes two genome assemblies and a reference annotation as input and outputs an annotation of the target genome.

Address of the bookmark: https://github.com/agshumate/Liftoff

LncPipe:A Nextflow-based pipeline for comprehensive analyses of long non-coding RNAs from RNA-seq datasets

LEGE — Fri, 17 Sep 2021 01:57:02 -0500

The pipeline was developed based on a popular workflow framework Nextflow, composed of four core procedures including reads alignment, assembly, identification and quantification. It contains various unique features such as well-designed lncRNAs annotation strategy, optimized calculating efficiency, diversified classification and interactive analysis report. LncPipe allows users additional control in interuppting the pipeline, resetting parameters from command line, modifying main script directly and resume analysis from previous checkpoint.

Ref https://www.lncrnablog.com/lncpipe-a-nextflow-based-pipeline-for-identification-and-analysis-of-long-non-coding-rnas-from-rna-seq-data/

Address of the bookmark: https://github.com/likelet/LncPipe

SqueezeMeta: a fully automated metagenomics pipeline, from reads to bins

BioStar — Sat, 06 Jul 2024 04:29:16 -0500

SqueezeMeta is a full automatic pipeline for metagenomics/metatranscriptomics, covering all steps of the analysis. SqueezeMeta includes multi-metagenome support allowing the co-assembly of related metagenomes and the retrieval of individual genomes via binning procedures. Thus, SqueezeMeta features several unique characteristics:

Co-assembly procedure with read mapping for estimation of the abundances of genes in each metagenome
Co-assembly of a large number of metagenomes via merging of individual metagenomes
Includes binning and bin checking, for retrieving individual genomes
The results are stored in a database, where they can be easily exported and shared, and can be inspected anywhere using a web interface.
Internal checks for the assembly and binning steps inform about the consistency of contigs and bins, allowing to spot potential chimeras.
Metatranscriptomic support via mapping of cDNA reads against reference metagenomes

Address of the bookmark: https://github.com/jtamames/SqueezeMeta

Comparison of Short Read De Novo Alignment Algorithms

Rahul Agarwal — Wed, 21 Aug 2013 07:56:01 -0500

Excellent article to introduce different sequencing methods along with tools for de novo assembly of sequencing reads and their relevant references.

Title: Comparison of Short Read De Novo Alignment Algorithms

Author: Nikhil Gopal

Address of the bookmark: http://biochem218.stanford.edu/Projects%202011/Gopal%202011.pdf

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

RGFA: powerful and convenient handling of assembly graphs

Rahul Nayak — Thu, 25 Jan 2018 05:47:53 -0600

RGFA, an implementation of the proposed GFA specification in Ruby. It allows the user to conveniently parse, edit and write GFA files. Complex operations such as the separation of the implicit instances of repeats and the merging of linear paths can be performed. A typical application of RGFA is the editing of a graph, to finish the assembly of a sequence, using information not available to the assembler. We illustrate a use case, in which the assembly of a repetitive metagenomic fosmid insert was completed using a script based on RGFA.

https://github.com/ggonnella/rgfa

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5103826/

MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads

Rahul Nayak — Fri, 11 May 2018 05:07:45 -0500

MECAT is an ultra-fast Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads. MECAT employs novel alignment and error correction algorithms that are much more efficient than the state of art of aligners and error correction tools. MECAT can be used for effectively de novo assemblying large genomes. For example, on a 32-thread computer with 2.0 GHz CPU , MECAT takes 9.5 days to assemble a human genome based on 54x SMRT data, which is 40 times faster than the current PBcR-Mhap pipeline. MECAT performance were compared with PBcR-Mhap pipeline, FALCON and Canu(v1.3) in five real datasets. The quality of assembled contigs produced by MECAT is the same or better than that of the PBcR-Mhap pipeline and FALCON.

https://www.nature.com/articles/nmeth.4432

Address of the bookmark: https://github.com/xiaochuanle/MECAT

Cerulean: A hybrid assembly using high throughput short and long reads

Rahul Nayak — Tue, 05 Jun 2018 10:10:15 -0500

Cerulean extends contigs assembled using short read datasets like Illumina paired-end reads using long reads like PacBio RS long reads. Cerulean v0.1 has been implemented with bacterial genomes in mind. The method is fully described in Deshpande, V., Fung, E. D., Pham, S., & Bafna, V. (2013). Cerulean: A hybrid assembly using high throughput short and long reads. arXiv preprint arXiv:1307.7933. http://arxiv.org/abs/1307.7933

Address of the bookmark: https://sourceforge.net/projects/ceruleanassembler/

ChopStitch: exon annotation and splice graph construction using transcriptome assembly and whole genome sequencing data

Rahul Nayak — Tue, 03 Jul 2018 04:14:52 -0500

ChopStitch is a new method for finding putative exons and constructing splice graphs using an assembled transcriptome and whole genome shotgun sequencing (WGSS) data. ChopStitch identifies exon-exon boundaries in de novo assembled RNA-seq data with the help of a Bloom filter that represents the k-mer spectrum of WGSS reads. The algorithm also detects base substitutions in transcript sequences corresponding to sequencing or assembly errors, haplotype variations, or putative RNA editing events. The primary output of our tool is a FASTA file containing putative exons. Further, exon edges are interrogated for alternative exon-exon boundaries to detect transcript isoforms, which are reported as splice graphs in dot output format.

Address of the bookmark: https://github.com/bcgsc/ChopStitch

FinisherSC:a repeat-aware tool for upgrading de novo assembly using long reads

Jit — Mon, 20 Aug 2018 04:08:50 -0500

Here is the command to run the tool:

python finisherSC.py destinedFolder mummerPath

If you are running on server computer and would like to use multiple threads, then the following commands can generate 20 threads to run FinisherSC.

python finisherSC.py -par 20 destinedFolder mummerPath

Sometimes, if the names of raw reads and contigs consists of special characters/formats, FinisherSC/MUMmer may not parse them correctly. In that case, you want to have a quick renaming of the names of contigs/reads in contigs.fasta or raw_reads.fasta using the following command.

    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' raw_reads.fasta > newRaw_reads.fasta
    cp newRaw_reads.fasta raw_reads.fasta
    perl -pe 's/>[^\$]*$/">Seg" . ++$n ."\n"/ge' contigs.fasta > newContigs.fasta
    cp newContigs.fasta contigs.fasta

Address of the bookmark: https://github.com/kakitone/finishingTool