BOL: Related items

MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons

Jit — Mon, 18 Feb 2019 04:21:50 -0600

MACSE aligns coding NT sequences with respect to their AA translation while allowing NT sequences to contain multiple frameshifts and/or stop codons. MACSE is hence the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.

For further details about the underlying algorithm see the original publication:
MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons.
Vincent Ranwez, Sébastien Harispe, Frédéric Delsuc, Emmanuel JP Douzery
PLoS One 2011, 6(9): e22594.

Address of the bookmark: https://bioweb.supagro.inra.fr/macse/index.php?menu=releases

Elastic BLAST !

Abhi — Tue, 06 Sep 2022 18:14:57 -0500

ElasticBLAST is a new way to BLAST large numbers of queries, faster and on the cloud. Here are the top three reasons you should use ElasticBLAST:

1. ElasticBLAST can handle much LARGER queries!

ElasticBLAST can search query sets that have hundreds to millions of sequences and against BLAST databases of all sizes.

2. ElasticBLAST is FASTER

ElasticBLAST distributes your searches across multiple cloud instances to process them simultaneously. The ability to scale resources in this way allows you to process large numbers of queries in a shorter time than you could with BLAST+.

3. ElasticBLAST is EASY to run on the cloud

ElasticBLAST is easy to set up using our step-by-step instructions (Amazon Web Services (AWS), Google Cloud Platform (GCP)) and allows you to leverage the power of the cloud. Once configured, it manages the software and database installation, handles partitioning of the BLAST workload among the various instances, and deallocates cloud resources when the searches are done.

ElasticBLAST also selects the instance (i.e., machine) type for you based on database size. Of course, you can also choose the instance type manually if you prefer.

Address of the bookmark: https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/

Nucl2Vec: Local alignment of DNA sequences using Distributed Vector Representation

Jit — Tue, 16 Mar 2021 05:45:44 -0500

We demonstrate a novel approach forlocal alignment of DNA reads with respect to reference genome.For this process we have used Skip-gram model for creatingencoding(Nucl2Vec) and k-nearest neighbor for the alignment.With our new approach we have reduced computation cost forlocal alignment , while achieving accuracy comparable to existingdefacto standard BWA-MEM tool.

https://prakharg24.github.io/papers/401851.full.pdf

Address of the bookmark: https://prakharg24.github.io/papers/401851.full.pdf

Harvest: a suite of core-genome alignment and visualization tools

Jit — Fri, 08 Dec 2017 07:16:03 -0600

Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes, including variant calls, recombination detection, and phylogenetic trees.

Tools

Parsnp - Core-genome alignment and analysis
Gingr - Interactive visualization of alignments, trees and variants
HarvestTools - Archiving and postprocessing

Address of the bookmark: https://harvest.readthedocs.io/en/latest/

xmatchview: smith-waterman alignment visualization

Rahul Nayak — Thu, 28 Dec 2017 09:00:58 -0600

xmatchview and xmatchview-conifer are imaging tools for comparing the synteny between DNA sequences. It allows users to align 2 DNA sequences in fasta format using cross_match and displays the alignment in a variety of image formats. xmatchview and xmatchview-conifer are written in python and run on linux and windows. They serve as visual tools for analyzing cross_match alignments. Cross_match (Green, P. (1994) http://www.phrap.org) uses an implementation of the Smith-Waterman algorithm for comparing DNA sequences that is sensitive.

http://www.bcgsc.ca/platform/bioinfo/software/xmatchview

Address of the bookmark: https://github.com/warrenlr/xmatchview

minialign: fast and accurate alignment tool for PacBio and Nanopore long reads

Jit — Thu, 24 May 2018 08:33:26 -0500

Minialign is a little bit fast and moderately accurate nucleotide sequence alignment tool designed for PacBio and Nanopore long reads. It is built on three key algorithms, minimizer-based index of the minimap overlapper, array-based seed chaining, and SIMD-parallel Smith-Waterman-Gotoh extension.

Address of the bookmark: https://github.com/ocxtal/minialign

Understanding BLASTn output format 6 !

Rahul Nayak — Wed, 27 Jun 2018 18:38:21 -0500

BLASTn output format 6

BLASTn maps DNA against DNA, for example gene sequences against a reference genome

blastn -query genes.ffn -subject genome.fna -outfmt 6

BLASTn tabular output format 6

Column headers:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

1.	qseqid	query (e.g., gene) sequence id
2.	sseqid	subject (e.g., reference genome) sequence id
3.	pident	percentage of identical matches
4.	length	alignment length
5.	mismatch	number of mismatches
6.	gapopen	number of gap openings
7.	qstart	start of alignment in query
8.	qend	end of alignment in query
9.	sstart	start of alignment in subject
10.	send	end of alignment in subject
11.	evalue	expect value
12.	bitscore	bit score

Define your own output format

by adding the option -outfmt, as for example:

-outfmt "6 qseqid sseqid pident qlen length mismatch gapope evalue bitscore"

supported format specifiers are:
qseqid    Query Seq-id
qgi   Query GI
qacc    Query accesion
qaccver   Query accesion.version
qlen    Query sequence length
sseqid    Subject Seq-id
sallseqid All subject Seq-id(s), separated by a ';'
sgi       Subject GI
sallgi    All subject GIs
sacc      Subject accession
saccver   Subject accession.version
sallacc   All subject accessions
slen      Subject sequence length
qstart    Start of alignment in query
qend    End of alignment in query
sstart    Start of alignment in subject
send      End of alignment in subject
qseq      Aligned part of query sequence
sseq      Aligned part of subject sequence
evalue    Expect value
bitscore  Bit score
score   Raw score
length    Alignment length
pident    Percentage of identical matches
nident    Number of identical matches
mismatch  Number of mismatches
positive  Number of positive-scoring matches
gapopen   Number of gap openings
gaps      Total number of gaps
ppos      Percentage of positive-scoring matches
frames    Query and subject frames separated by a '/'
qframe    Query frame
sframe    Subject frame
btop      Blast traceback operations (BTOP)
staxids   Subject Taxonomy ID(s), separated by a ';'
sscinames Subject Scientific Name(s), separated by a ';'
scomnames Subject Common Name(s), separated by a ';'
sblastnames Subject Blast Name(s), separated by a ';'   (in alphabetical order)
sskingdoms  Subject Super Kingdom(s), separated by a ';'     (in alphabetical order)
stitle    Subject Title
salltitles  All Subject Title(s), separated by a '<>'
sstrand   Subject Strand
qcovs   Query Coverage Per Subject
qcovhsp   Query Coverage Per HSP

default values are:
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"

rHAT: a seed-and-extension-based noisy long read alignment tool

Abhimanyu Singh — Sun, 23 Sep 2018 05:12:22 -0500

rHAT is a seed-and-extension-based noisy long read alignment tool. It is suitable for aligning 3rd generation sequencing reads which are in large read length with relatively high error rate, especially Pacbio's Single Molecule Read-time (SMRT) sequencing reads.

Address of the bookmark: https://github.com/dfguan/rHAT

Shouji: a fast and efficient pre-alignment filter for sequence alignment

Jit — Mon, 04 Nov 2019 07:09:45 -0600

The ability to generate massive amounts of sequencing data continues to overwhelm the processing capacity of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes.

We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator design that adopts modern FPGA (field-programmable gate array) architectures to further boost the performance of our algorithm.

More at https://github.com/CMU-SAFARI/Shouji

Address of the bookmark: https://github.com/CMU-SAFARI/Shouji

Caretta – A multiple protein structure alignment and feature extraction suite

Rahul Nayak — Fri, 18 Dec 2020 02:09:44 -0600

Caretta – a multiple protein structure alignment and feature extraction suite

Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning.

Address of the bookmark: http://www.bioinformatics.nl/caretta/