BOL: Related items

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

Jit — Mon, 20 Aug 2018 14:14:11 -0500

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Results: We designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platformtextquoterights error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads. Availability:

Hercules source code is available at https://github.com/BilkentCompGen/Hercules

Address of the bookmark: https://github.com/BilkentCompGen/Hercules

Rainbow: an integrated tool for efficient clustering and assembling RAD-seq reads

Rahul Nayak — Fri, 19 Oct 2018 08:23:42 -0500

Rainbow is developed to provide an ultra-fast and memory-efficient solution to clustering and assembling short reads produced by RAD-seq. First, Rainbow clusters reads using a spaced seed method. Then, Rainbow implements a heterozygote calling like strategy to divide potential groups into haplotypes in a top–down manner. And along a guided tree, it iteratively merges sibling leaves in a bottom–up manner if they are similar enough. Here, the similarity is defined by comparing the 2nd reads of a RAD segment. This approach tries to collapse heterozygote while discriminate repetitive sequences. At last, Rainbow uses a greedy algorithm to locally assemble merged reads into contigs. Rainbow not only outputs the optimal but also suboptimal assembly results. Based on simulation and a real guppy RAD-seq data, we show that Rainbow is more competent than the other tools in dealing with RAD-seq data

Address of the bookmark: https://sourceforge.net/projects/bio-rainbow/files/

Genome assembly tutorial "Genome Assembly for short and long reads"

Jit — Sat, 19 Jan 2019 17:29:53 -0600

In this lab we will perform de novo genome assembly of a bacterial genome. You will be guided through the genome assembly starting with data quality control, through to building contigs and analysis of the results. At the end of the lab you will know:

How to perform basic quality checks on the input data
How to run a short read assembler on Illumina data
How to run a long read assembler on Pacific Biosciences or Oxford Nanopore data
How to improve the accuracy of a long read assembly using short reads
How to assess the quality of an assembly

https://bioinformaticsdotca.github.io/high-throughput_biology_2017

Address of the bookmark: https://bioinformaticsdotca.github.io/high-throughput_biology_2017_module6_lab

SViper: Swipe your Structural Variants called on long (ONT/PacBio) reads with short exact (Illumina) reads.

Neel — Sun, 22 Dec 2019 03:48:28 -0600

Call sviper

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants

This will output a polished_variants.vcf file, that contains all the refined variants.

Sometimes it is helpful to look at the polished sequence, e.g. with the IGV browser. In that case you want SViper to output the polished and aligned sequences in a bam file via the option --output-polished-bam:

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants --output-polished-bam

Address of the bookmark: https://github.com/smehringer/SViper

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads

BioStar — Tue, 04 Feb 2020 23:23:16 -0600

Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.

Usage: perl run_rcorrector.pl [OPTIONS]
OPTIONS:
	Required
	-s seq_files: comma separated files for single-end data sets
	-1 seq_files_left: comma separated files for the first mate in the paried-end data sets
	-2 seq_files_right: comma separated files for the second mate in the paired-end data sets
	-i seq_files_interleaved: comma sperated files for interleaved paired-end data sets
	Optional
	-k INT: kmer_length (<=32, default: 23)
	-od STRING: output_file_directory (default: ./)
	-t INT: number of threads to use (default: 1)
	-trim : allow trimming (default: false)
	-maxcorK INT: the maximum number of correction within k-bp window (default: 4)
	-wk FLOAT: the proportion of kmers that are used to estimate weak kmer count threshold, lower for more divergent genome (default: 0.95)
	-ek INT: expected number of kmers; does not affect the correctness of program but affects the memory usage (default: 100000000)
	-stdout: output the corrected reads to stdout (default: not used)
	-verbose: output some correction information to stdout (default: not used)
	-stage INT: start from which stage (default: 0)
		0-start from begining(storing kmers in bloom filter) ;
		1-start from count kmers showed up in bloom filter;
		2-start from dumping kmer counts into a jf_dump file;
		3-start from error correction.

Address of the bookmark: https://github.com/mourisl/Rcorrector/

SLR-superscaffolder: A scaffold assemble pipeline for stLFR reads.

Jit — Fri, 14 Feb 2020 14:23:30 -0600

This is a scaffold assembler designed for stLFR reads[1]. It uses the link-reads information from stLFR reads to assemble contigs to scaffolds.

Here is an illustration of this pipeline:

Address of the bookmark: https://github.com/BGI-Qingdao/SLR-superscaffolder

SHAMAN: a user-friendly website for metataxonomic analysis from raw reads to statistical analysis

BioStar — Mon, 17 Aug 2020 05:21:09 -0500

SHAMAN is a shiny application for differential analysis of metagenomic data (16S, 18S, 23S, 28S, ITS and WGS) including bioinformatics treatment of raw reads for targeted metagenomics, statistical analysis and results visualization with a large variety of plots (barplot, boxplot, heatmap, …).
The bioinformatics treatment is based on Vsearch [Rognes 2016] which showed to be both accurate and fast [Wescott 2015].The statistical analysis is based on DESeq2 R package [Anders and Huber 2010] which robustly identifies the differential abundant features as suggested in [McMurdie and Holmes 2014] and [Jonsson2016]. SHAMAN robustly identifies the differential abundant genera with the Generalized Linear Model implemented in DESeq2 [Love 2014].
SHAMAN is compatible with standard formats for metagenomic analysis (.csv, .tsv, .biom) and figures can be downloaded in several formats. A presentation about SHAMAN is available here and a poster here.

More at https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03666-4

Address of the bookmark: https://github.com/aghozlane/shaman

FastProNGS: fast preprocessing of next-generation sequencing reads

Rahul Nayak — Sat, 26 Dec 2020 08:35:21 -0600

FastProNGS to integrate the quality control process with automatic adapter removal. Parallel processing was implemented to speed up the process by allocating multiple threads. Compared with similar up-to-date preprocessing tools, FastProNGS is by far the fastest.

Address of the bookmark: https://github.com/Megagenomics/FastProNGS

Understanding kmer !

BioStar — Wed, 18 Aug 2021 04:27:51 -0500

What is a k-mer anyway? A k-mer is just a sequence of k characters in a string (or nucleotides in a DNA sequence). Now, it is important to remember that to get all k-mers from a sequence you need to get the first k characters, then move just a single character for the start of the next k-mer and so on. Effectively, this will create sequences that overlap in k-1 positions.

Address of the bookmark: https://bioinfologics.github.io/post/2018/09/17/k-mer-counting-part-i-introduction/

MitoHiFi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads

Abhi — Tue, 05 Sep 2023 07:31:35 -0500

MitoHiFi v3.2 is a python pipeline distributed under MIT License !

MitoHiFi was first developed to assemble the mitogenomes for a wide range of species in the Darwin Tree of Life Project (DToL)

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-023-05385-y

Address of the bookmark: https://github.com/marcelauliano/MitoHiFi