BOL: Related items

Run miniasm assembler on nanopore reads !

Jit — Mon, 18 Dec 2017 04:07:50 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

Find the detail of the reads repeats:

fq2fa ONT_A.fastq ONT_A.fasta

minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf

awk '{if($1==$6){print}}' AONT.paf > AONTself.paf

awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list

Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)

minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps

minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps

minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps

Assemble with miniasm:

miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta

minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf

awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf

minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps

Njoy the assembly !

Frequent Paired-end reads (PE 2x100) mapping command lines

Jit — Tue, 15 May 2018 08:59:29 -0500

bowtie2 -x hs37m -X 650 -q -1 r1.fq -2 r2.fq -S r12.bowtie2.sam

bwa aln hs37m.fa r1.fq > r1.sai && bwa aln hs37m.fa r2.fq > r2.sai \
&& bwa sampe hs37m r1.sai r2.sai r1.fq r2.fq > r12.bwa.sam

bwa bwasw ../index/bwa/hs37m.fa r12.fq > r12.bwasw.sam

gsnap -A sam -d hs37m r1.fq r2.fq > r12.gsnap.sam

novoalign -r Random -o SAM -f r1.fq r2.fq -i 500 50 -d hs37m-k14s3.novo > r12.novo.sam

smalt map -f samsoft -i 650 -o r12.smalt-k20s13.sam hs37m-k20s13 r1.fq r2.fq

stampy.py -g hs37m -h hs37m -o r12.stampy.sam -M r1.fq,r2.fq

soap -D hs37m.fa.index -a r1.fq -b r2.fq -l 32 -g 3 -u dummy -2 dummy -o r12.soap

npScarf: real-time scaffolder using SPAdes contigs and Nanopore sequencing reads

Shruti Paniwala — Mon, 11 Jun 2018 05:14:57 -0500

npScarf (jsa.np.npscarf) is a program that connect contigs from a draft genomes to generate sequences that are closer to finish. These pipelines can run on a single laptop for microbial datasets. In real-time mode, it can be integrated with simple structural analyses such as gene ordering, plasmid forming.

Address of the bookmark: http://japsa.readthedocs.io/en/latest/tools/jsa.np.npscarf.html

Rebaler: program for conducting reference-based assemblies using long reads.

Jit — Tue, 18 Sep 2018 07:52:41 -0500

Rebaler is a program for conducting reference-based assemblies using long reads. It relies mainly on minimap2 for alignment and Racon for making consensus sequences.

I made Rebaler for bacterial genomes (specifically for the task of testing basecallers). It should in principle work for non-bacterial genomes as well, but I haven't tested it.

Address of the bookmark: https://github.com/rrwick/Rebaler

Deepbinner: a signal-level demultiplexer for Oxford Nanopore reads

Neel — Tue, 27 Nov 2018 03:38:49 -0600

Deepbinner is a tool for demultiplexing barcoded Oxford Nanopore sequencing reads. It does this with a deep convolutional neural network classifier, using many of the architectural advances that have proven successful in image classification. Unlike other demultiplexers (e.g. Albacore and Porechop), Deepbinner identifies barcodes from the raw signal (a.k.a. squiggle) which gives it greater sensitivity and fewer unclassified reads.

Reasons to use Deepbinner:
- To minimise the number of unclassified reads (use Deepbinner by itself).
- To minimise the number of misclassified reads (use Deepbinner in conjunction with Albacore demultiplexing).
- You plan on running signal-level downstream analyses, like Nanopolish. Deepbinner can demultiplex the fast5 fileswhich makes this easier.
Reasons to not use Deepbinner:
- You only have basecalled reads not the raw fast5 files (which Deepbinner requires).
- You have a small/slow computer. Deepbinner is more computationally intensive than Porechop.
- You used a sequencing/barcoding kit other than the ones Deepbinner was trained on.

Address of the bookmark: https://github.com/rrwick/Deepbinner

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads

BioStar — Tue, 04 Feb 2020 23:23:16 -0600

Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.

Usage: perl run_rcorrector.pl [OPTIONS]
OPTIONS:
	Required
	-s seq_files: comma separated files for single-end data sets
	-1 seq_files_left: comma separated files for the first mate in the paried-end data sets
	-2 seq_files_right: comma separated files for the second mate in the paired-end data sets
	-i seq_files_interleaved: comma sperated files for interleaved paired-end data sets
	Optional
	-k INT: kmer_length (<=32, default: 23)
	-od STRING: output_file_directory (default: ./)
	-t INT: number of threads to use (default: 1)
	-trim : allow trimming (default: false)
	-maxcorK INT: the maximum number of correction within k-bp window (default: 4)
	-wk FLOAT: the proportion of kmers that are used to estimate weak kmer count threshold, lower for more divergent genome (default: 0.95)
	-ek INT: expected number of kmers; does not affect the correctness of program but affects the memory usage (default: 100000000)
	-stdout: output the corrected reads to stdout (default: not used)
	-verbose: output some correction information to stdout (default: not used)
	-stage INT: start from which stage (default: 0)
		0-start from begining(storing kmers in bloom filter) ;
		1-start from count kmers showed up in bloom filter;
		2-start from dumping kmer counts into a jf_dump file;
		3-start from error correction.

Address of the bookmark: https://github.com/mourisl/Rcorrector/

SHAMAN: a user-friendly website for metataxonomic analysis from raw reads to statistical analysis

BioStar — Mon, 17 Aug 2020 05:21:09 -0500

SHAMAN is a shiny application for differential analysis of metagenomic data (16S, 18S, 23S, 28S, ITS and WGS) including bioinformatics treatment of raw reads for targeted metagenomics, statistical analysis and results visualization with a large variety of plots (barplot, boxplot, heatmap, …).
The bioinformatics treatment is based on Vsearch [Rognes 2016] which showed to be both accurate and fast [Wescott 2015].The statistical analysis is based on DESeq2 R package [Anders and Huber 2010] which robustly identifies the differential abundant features as suggested in [McMurdie and Holmes 2014] and [Jonsson2016]. SHAMAN robustly identifies the differential abundant genera with the Generalized Linear Model implemented in DESeq2 [Love 2014].
SHAMAN is compatible with standard formats for metagenomic analysis (.csv, .tsv, .biom) and figures can be downloaded in several formats. A presentation about SHAMAN is available here and a poster here.

More at https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03666-4

Address of the bookmark: https://github.com/aghozlane/shaman

FastProNGS: fast preprocessing of next-generation sequencing reads

Rahul Nayak — Sat, 26 Dec 2020 08:35:21 -0600

FastProNGS to integrate the quality control process with automatic adapter removal. Parallel processing was implemented to speed up the process by allocating multiple threads. Compared with similar up-to-date preprocessing tools, FastProNGS is by far the fastest.

Address of the bookmark: https://github.com/Megagenomics/FastProNGS

HairSplitter: assembling long reads in an unknown number of haplotypes

BioStar — Wed, 07 Dec 2022 00:13:40 -0600

Pros and cons of HairSplitter Limitations of HairSplitter:

Not very fast: it re-polishes the whole assembly

Limited in the number of haplotypes

Strengths of HairSplitter:

Very modular, can be used with any assembler

Naive: makes no assumption on ploidy, parameter-free

Safe: won’t artificially duplicate contigs

HairSplitter splits collapsed assemblies from “draft” assemblies obtained by any means

HairSplitter can recover haplotypes and distinguish repeated elements

Only needs sequencing reads, potentially error-prone

HairSplitter splits collapsed assemblies from “draft” assemblies obtained by any means

HairSplitter can recover haplotypes and distinguish repeated elements

Only needs sequencing reads, potentially error-prone

Not really available yet (github.com/RolandFaure/HairSplitter)

https://hal.archives-ouvertes.fr/hal-03864075/file/RolandFaure_presentation_SeqBIM_2022.pdf

Address of the bookmark: https://hal.archives-ouvertes.fr/hal-03817928/document

CovCal: Coverage / Read Count Calculator

Jit — Wed, 15 Jun 2016 18:08:13 -0500

Coverage / Read Count Calculator

Calculate how much sequencing you need to hit a target depth of coverage (or vice versa).

Instructions: set the read length/configuration and genome size, then select what you want to calculate.

Written by Stephen Turner, based on the Lander-Waterman formula, inspired by a similar calculator written by James Hadfield. Coverage is calculated as C=LN/G and reads as N=CG/L where C = Coverage (X),L = Read length (bp), G = Haploid genome size (bp), and N = Number of reads. Source code on GitHub.

Address of the bookmark: http://apps.bioconnector.virginia.edu/covcalc/