BOL: Related items

SViper: Swipe your Structural Variants called on long (ONT/PacBio) reads with short exact (Illumina) reads.

Neel — Sun, 22 Dec 2019 03:48:28 -0600

Call sviper

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants

This will output a polished_variants.vcf file, that contains all the refined variants.

Sometimes it is helpful to look at the polished sequence, e.g. with the IGV browser. In that case you want SViper to output the polished and aligned sequences in a bam file via the option --output-polished-bam:

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants --output-polished-bam

Address of the bookmark: https://github.com/smehringer/SViper

Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads

BioStar — Tue, 04 Feb 2020 23:23:16 -0600

Rcorrector has an accuracy higher than or comparable to existing methods, including the only other method (SEECER) designed for RNA-seq reads, and is more time and memory efficient. With a 5 GB memory footprint for 100 million reads, it can be run on virtually any desktop or server. The software is available free of charge under the GNU General Public License from https://github.com/mourisl/Rcorrector/.

Usage: perl run_rcorrector.pl [OPTIONS]
OPTIONS:
	Required
	-s seq_files: comma separated files for single-end data sets
	-1 seq_files_left: comma separated files for the first mate in the paried-end data sets
	-2 seq_files_right: comma separated files for the second mate in the paired-end data sets
	-i seq_files_interleaved: comma sperated files for interleaved paired-end data sets
	Optional
	-k INT: kmer_length (<=32, default: 23)
	-od STRING: output_file_directory (default: ./)
	-t INT: number of threads to use (default: 1)
	-trim : allow trimming (default: false)
	-maxcorK INT: the maximum number of correction within k-bp window (default: 4)
	-wk FLOAT: the proportion of kmers that are used to estimate weak kmer count threshold, lower for more divergent genome (default: 0.95)
	-ek INT: expected number of kmers; does not affect the correctness of program but affects the memory usage (default: 100000000)
	-stdout: output the corrected reads to stdout (default: not used)
	-verbose: output some correction information to stdout (default: not used)
	-stage INT: start from which stage (default: 0)
		0-start from begining(storing kmers in bloom filter) ;
		1-start from count kmers showed up in bloom filter;
		2-start from dumping kmer counts into a jf_dump file;
		3-start from error correction.

Address of the bookmark: https://github.com/mourisl/Rcorrector/

SLR-superscaffolder: A scaffold assemble pipeline for stLFR reads.

Jit — Fri, 14 Feb 2020 14:23:30 -0600

This is a scaffold assembler designed for stLFR reads[1]. It uses the link-reads information from stLFR reads to assemble contigs to scaffolds.

Here is an illustration of this pipeline:

Address of the bookmark: https://github.com/BGI-Qingdao/SLR-superscaffolder

KAD: Assessing genome assemblies using K-mer copies in assemblies and K-mer abundance in Illumina reads

Jit — Fri, 19 Jun 2020 07:34:12 -0500

KAD is designed for evaluating the accuracy of nucleotide base quality of genome assemblies. Briefly, abundance of k-mers are quantified for both sequencing reads and assembly sequences. Comparison of the two values results in a single value per k-mer, K-mer Abundance Difference (KAD), which indicates how well the assembly matches read data for each k-mer.

where, c is the count of a k-mer from reads, m is the mode of counts of read k-mers, and n is the copy of the k-mer in the assembly.

Address of the bookmark: https://github.com/liu3zhenlab/KAD

Hifiasm: a haplotype-resolved assembler for accurate Hifi reads

Jit — Thu, 24 Dec 2020 10:03:36 -0600

Hifiasm is a fast haplotype-resolved de novo assembler for PacBio Hifi reads. It can assemble a human genome in several hours and works with the California redwood genome, one of the most complex genomes sequenced so far. Hifiasm can produce primary/alternate assemblies of quality competitive with the best assemblers. It also introduces a new graph binning algorithm and achieves the best haplotype-resolved assembly given trio data.

Address of the bookmark: https://github.com/chhylp123/hifiasm

Bioinformatics tools for telomere to telomere assembly !

BioStar — Tue, 17 Aug 2021 13:17:09 -0500

● Merfin – k-mer-based assembly and variant calling evaluation for improved consensus accuracy (Arang Rhie)
● PanGenie – algorithm that leverages a pangenome reference built from haplotype-resolved genome assemblies in conjunction with k-mer count information from raw, short-read sequencing data to genotype a wide spectrum of genetic variation (Tobias Marschall)
● SQANTI3 – an automated pipeline for the classification of long-read transcripts that can assess the quality of data and the preprocessing pipeline (Rocío Amorín de Hegedüs @rocioadh)
● tama (Transcriptome Annotation by Modular Algorithms) – software designed for processing Iso-Seq data and other long-read transcriptome data (Richard Kuo @GenomeRIK)
● pbaa (PacBio Amplicon Analysis) – separates complex mixtures of amplicon targets from genomic samples to cluster and generate high-quality consensus sequences from HiFi reads (Zev Kronenberg @zevkronenberg)
● bellerophon – analyzes MHC typing and other low-complexity gene amplicon data; performs allele calling while detecting polymorphic sites within the sequences and removing potential chimeric sequence variants (Yuanyuan Cheng @Yuanyuan929)
● svpack – tools for filtering, comparing, and annotating structural variant (SV) calls in VCF format (Aaron Wenger)
● JumboDB – tool for de Bruijn graph construction (Anton Bankevich @AntonBankevich)
● uLTRA – tool for splice alignment of long transcriptomic reads to a genome, guided by a database of exon annotations. (Kristoffer Sahlin @krsahlin)
● LeafGo – workflow to rapidly produce high-quality de novo plant genomes (Luca Ermini @ermini_luca)

Reference:

https://www.pacb.com/blog/young-investigators-share-stellar-science-career-advice-and-bioinformatics-tools-at-smrt-leiden-2021/

Steps to find palindrome in genomes !

BioStar — Thu, 09 Mar 2023 02:56:54 -0600

Palindromes are sequences of nucleotides that read the same backward as forward. They can be present in genomes and have various biological functions. Here are some methods for discovering palindromes in genomes:

Direct sequence search: One of the simplest ways to discover palindromes is to search the genome sequence directly for palindromic sequences using pattern matching tools, such as regular expressions or string algorithms. This approach can be useful for discovering simple palindromes, but may miss more complex palindromic structures.
Dot plot analysis: Dot plot analysis is a graphical method that can be used to identify palindromic regions in a genome. It involves plotting the genome sequence against itself and examining the diagonal patterns that emerge. Palindromic regions will appear as symmetrical patterns along the diagonal.
Restriction enzyme analysis: Some restriction enzymes, such as EcoRI and HindIII, recognize palindromic sequences and cleave DNA at these sites. By digesting the genome with these enzymes and examining the resulting fragments, palindromic regions can be identified.
Next-generation sequencing: High-throughput sequencing technologies, such as PacBio and Oxford Nanopore, can generate long reads that can span entire palindromic regions. By mapping these reads to the genome, palindromic regions can be identified and characterized.
Comparative genomics: Comparing the genomes of related species can also reveal palindromic regions that are conserved across evolutionarily divergent lineages. This approach can help identify functional palindromes that are under selective pressure.

Overall, the discovery of palindromic sequences in genomes can be accomplished using a variety of methods, each with their own advantages and limitations. A combination of these methods can provide a comprehensive understanding of the palindromic landscape of a genome.

CovCal: Coverage / Read Count Calculator

Jit — Wed, 15 Jun 2016 18:08:13 -0500

Coverage / Read Count Calculator

Calculate how much sequencing you need to hit a target depth of coverage (or vice versa).

Instructions: set the read length/configuration and genome size, then select what you want to calculate.

Written by Stephen Turner, based on the Lander-Waterman formula, inspired by a similar calculator written by James Hadfield. Coverage is calculated as C=LN/G and reads as N=CG/L where C = Coverage (X),L = Read length (bp), G = Haploid genome size (bp), and N = Number of reads. Source code on GitHub.

Address of the bookmark: http://apps.bioconnector.virginia.edu/covcalc/

LoRDEC: a hybrid error correction program for long, PacBio reads

Jit — Mon, 10 Apr 2017 04:16:09 -0500

LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.

Usually, errors in PacBio reads include many insertions and deletions, and comparatively less substitutions. LoRDEC can correct errors of all these types.
After correction, a larger portion of the sequence of PacBio reads is usable for detection of region of similarity with other sequences, for aligning them to the contigs of an assembly, etc.

Why is LoRDEC different?

It is efficient and can process large read data sets, included from eukaryotic or vertebrate species, on a usual computing server, and even works on desktop/laptop computers.
It adopts a novel graph based approach: it builds a succinct De Bruijn Graph (DBG) representing the short reads, and seeks a corrective sequence for each erroneous region of a long read by traversing chosen paths in the graph.

Address of the bookmark: http://www.atgc-montpellier.fr/lordec/

GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads

Jit — Mon, 14 May 2018 05:25:48 -0500

This software is provided ``as is” without warranty of any kind. In no event shall the author be held responsible for any damage resulting from the use of this software. The program package, including source codes, executables, and this documentation, is distributed free of charge. If you use this program in a publication, please cite the following reference:
Chong Chu, Xin Li, and Yufeng Wu. "GAPPadder: A Sensitive Approach for Closing Gaps on Draft Genomes with Short Sequence Reads." bioRxiv (2017): 125534.

Address of the bookmark: https://github.com/Reedwarbler/GAPPadder