BOL: Related items

Miropeats: discovers regions of sequence similarity amongst any set of DNA sequences

Poonam Mahapatra — Mon, 26 Aug 2019 17:55:24 -0500

Miropeats discovers regions of sequence similarity amongst any set of DNA sequences and then presents this similarity information graphically. Sequence similarity searching is a very general tool that forms the basis of many different biological sequence analyses but it is limited by the verbosity of traditional alignment presentation styles. Miropeats enhances the utility of conventional DNA sequence comparisons when looking at long lengths of sequence similarity by summarizing extensive large scale sequence similarities on a single page of graphics. The latest version of Miropeats can be used as a general pairwise alignment program or in its traditional role sorting out a big mess of overlapping or similar regions.

Address of the bookmark: http://www.littlest.co.uk/software/bioinf/old_packages/miropeats/

RePS: Repeat-masked Phrap with scaffolding, a WGS sequence assembler

Jit — Sat, 04 Jan 2020 01:08:09 -0600

RePS (Repeat-masked Phrap with scaffolding), a WGS sequence assembler, that explicitly identifies exact kmer repeats from the shotgun data and removes them prior to the assembly. The established software Phrap is used to compute meaningful error probabilities for each base. Clone-end-pairing information is used to construct scaffolds that order and orient the contigs. The updated version of RePS incorporates some of the ideas introduced by Phusion on clustering

More at

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC186573/

Address of the bookmark: ftp://ftp.genomics.org.cn/pub/ricedb/Tools/RePS/RePS-IBM-AIX.tar.gz

CLARK: Fast, accurate and versatile sequence classification system

Jit — Sat, 15 Feb 2020 01:49:01 -0600

CLARK, a method based on a supervised sequence classification using discriminative k-mers. Considering two distinct specific classification problems (see the article for details), namely (1) the taxonomic classification of metagenomic reads to known bacterial genomes, and (2) the assignment of BAC clones and transcript to chromosome arms/centromeres (in the absence of a finished assembly for the reference genome), CLARK outperforms in classification speed and precision the best state-of-the-art methods.

http://clark.cs.ucr.edu/Spaced/

Address of the bookmark: http://clark.cs.ucr.edu/Spaced/

flexidot: Highly customizable, ambiguity-aware dotplots for visual sequence analyses

Jit — Fri, 24 Apr 2020 08:39:28 -0500

FlexiDot is a cross-platform dotplot suite generating high quality self, pairwise and all-against-all visualizations. To improve dotplot suitability for comparison of consensus and error-prone sequences, FlexiDot harbors routines for strict and relaxed handling of mismatches and ambiguous residues. The custom shading modules facilitate dotplot interpretation and motif identification by adding information on sequence annotations and sequence similarities to the images. Combined with collage-like outputs, FlexiDot supports simultaneous visual screening of a large sequence sets, allowing dotplot use for routine screening.

Address of the bookmark: https://github.com/molbio-dresden/flexidot

LoReTTA, a user-friendly tool for assembling viral genomes from PacBio sequence data

Neel — Wed, 23 Jun 2021 07:54:53 -0500

LoReTTA (Long Read Template-Targeted Assembler), a tool designed for performing de novo assembly of long reads generated from viral genomes on the PacBio platform. LoReTTA exploits a reference genome to guide the assembly process, an approach that has been successful with short reads.

https://academic.oup.com/ve/article/7/1/veab042/6248116

Address of the bookmark: https://academic.oup.com/ve/article/7/1/veab042/6248116

ContigExtender: a new approach to improving de novo sequence assembly for viral metagenomics data

LEGE — Wed, 08 May 2024 07:32:45 -0500

ContigExtender, was developed to extend contigs, complementing de novo assembly. ContigExtender employs a novel recursive Overlap Layout Candidates (r-OLC) strategy that explores multiple extending paths to achieve longer and highly accurate contigs. ContigExtender is effective for extending contigs significantly in in silico synthesized and real metagenomics datasets.

More at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7953547/

Address of the bookmark: https://github.com/dengzac/contig-extender

Seal: SEquence ALignment evaluation suite

Jit — Wed, 03 Jan 2018 05:05:46 -0600

Seal is a comprehensive sequencing simulation and alignment tool evaluation suite. This software (implemented in Java) provides several utilities that can be used to evaluate alignment algorithms, including:

Reading a pre-existing reference genome from one or more FASTA files.
Alternatively, generating an artificial reference genome based on input parameters (length, repeat count, repeat length, repeat variability rate).
Simulating reads from random locations in the genome based on input parameters of read length, coverage, sequencing error rate, and indel rate.
Applying alignment tools to the genome and the reads through a standardized interface.
Parsing the output of the alignment tool and calculating the number of reads that were correctly or incorrectly mapped.
Computing run times and measures of accuracy.

Seal has interfaces to evaluate the following software packages:

Bowtie
BWA
MAQ
mrFAST
mrsFAST
Novoalign
SHRiMP
SOAPv2

Address of the bookmark: http://compbio.case.edu/seal/

Pollux: platform independent error correction of single and mixed genomes

Jit — Fri, 19 May 2017 09:41:27 -0500

Pollux: General-purpose error corrector that corrects errors introduced by Illumina, Ion Torrent, and Roche 454 sequencing technologies and can be applied to single- or mixed-genome data. In addition to correcting substitution errors, we locate and correct insertion, deletion, and homopolymer errors while remaining sensitive to low coverage areas of sequencing projects. Using published data sets, we correct 94% of Illumina MiSeq errors, 88% of Ion Torrent PGM errors, 85% of Roche 454 GS Junior errors. Introduced errors are 20 to 70 times more rare than successfully corrected errors. Furthermore, we show that the quality of assemblies improves when reads are corrected by our software.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-014-0435-6

Address of the bookmark: https://github.com/emarinier/pollux

proovread : large-scale high-accuracy PacBio correction through iterative short read consensus

Jit — Fri, 05 Jan 2018 04:12:20 -0600

proovread : large-scale high-accuracy PacBio correction through iterative short read consensus

outperforms PacBioToCA/LSC in terms of accuracy and contiguity/sensitivity (http://dx.doi.org/10.1093/bioinformatics/btu392)
is easy to install/run/configure
supports various types of dat
- HiSeq/MiSeq (100-500bp)
- Unitigs
- 454, ...

proovread maps high coverage data to pacbio reads (bwa mem, blasr, daligner) in multiple iterations.

Address of the bookmark: https://github.com/BioInf-Wuerzburg/proovread

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

Jit — Mon, 20 Aug 2018 14:14:11 -0500

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Results: We designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platformtextquoterights error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads. Availability:

Hercules source code is available at https://github.com/BilkentCompGen/Hercules

Address of the bookmark: https://github.com/BilkentCompGen/Hercules