BOL: Related items

Seal: SEquence ALignment evaluation suite

Jit — Wed, 03 Jan 2018 05:05:46 -0600

Seal is a comprehensive sequencing simulation and alignment tool evaluation suite. This software (implemented in Java) provides several utilities that can be used to evaluate alignment algorithms, including:

Reading a pre-existing reference genome from one or more FASTA files.
Alternatively, generating an artificial reference genome based on input parameters (length, repeat count, repeat length, repeat variability rate).
Simulating reads from random locations in the genome based on input parameters of read length, coverage, sequencing error rate, and indel rate.
Applying alignment tools to the genome and the reads through a standardized interface.
Parsing the output of the alignment tool and calculating the number of reads that were correctly or incorrectly mapped.
Computing run times and measures of accuracy.

Seal has interfaces to evaluate the following software packages:

Bowtie
BWA
MAQ
mrFAST
mrsFAST
Novoalign
SHRiMP
SOAPv2

Address of the bookmark: http://compbio.case.edu/seal/

Breakpointer: using local mapping artifacts to support sequence breakpoint discovery from single-end reads

Jit — Tue, 12 Jun 2018 12:41:10 -0500

Breakpointer is a fast tool for locating sequence breakpoints from the alignment of single end reads (SE) produced by next generation sequencing (NGS). It adopts a heuristic method in searching for local mapping signatures created by insertion/deletions (indels) or more complex structural variants(SVs).

Address of the bookmark: https://github.com/ruping/Breakpointer

Dynamic Programming Alignment

Thu, 22 Aug 2013 09:38:28 -0500

lecture 9, Chem. C100, Spring 2013, UCLA

Basics of BLAST Programs !

BioStar — Fri, 26 Jul 2024 06:04:26 -0500

The Basic Local Alignment Search Tool (BLAST) is a powerful bioinformatics program used to compare an input sequence (such as DNA, RNA, or protein sequences) against a database of sequences to find regions of similarity. Developed by the National Center for Biotechnology Information (NCBI), BLAST is widely used for identifying species, finding functional and evolutionary relationships between sequences, and predicting the function of novel sequences.

Key Features of BLAST:
1. Sequence Comparison: BLAST searches for local alignments between the query sequence and sequences in a database. It identifies regions of similarity, which can help infer functional and evolutionary relationships.

2. Speed and Efficiency: BLAST uses heuristic algorithms, making it faster than exhaustive search methods, suitable for large-scale database searches.

3. Versatility: There are several versions of BLAST for different types of sequence comparisons:
- blastn: Compares a nucleotide query sequence against a nucleotide sequence database.
- blastp: Compares a protein query sequence against a protein sequence database.
- blastx: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
- tblastn: Compares a protein query sequence against a nucleotide sequence database translated in all reading frames.
- tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

4. Scoring and E-value: BLAST results are scored based on the quality and length of the alignments. The E-value (expect value) indicates the number of alignments one can expect to find by chance, with lower E-values representing more significant matches.

5. Output Formats: BLAST provides results in various formats, including plain text, HTML, XML, and JSON, making it adaptable for different types of analyses and integrations with other tools.

Applications of BLAST:
- Genomic Research: Identifying genes, understanding genetic diversity, and mapping genome sequences.
- Protein Function Prediction: Inferring the function of unknown proteins by comparing them to known protein sequences.
- Evolutionary Studies: Exploring evolutionary relationships between organisms by comparing their genetic material.
- Medical Research: Identifying pathogens, understanding disease mechanisms, and developing treatments by comparing sequences of interest.

Overall, BLAST is an essential tool in bioinformatics, offering a reliable and efficient way to analyze and interpret biological sequence data.

A web-based tool for sequence alignment statistics and innovative visualization

BioStar — Thu, 04 Apr 2024 01:44:50 -0500

AlignStatPlot, a new R package and online tool that is well-documented and easy-to usefor MSA and post-MSA analysis. This tool performs both traditional and cutting-edge analy-ses on sequencing data and generates new visualisation methods for MSA results. Whencompared to currently available tools, AlignStatPlot provides a robust ability to handle andvisualise diversity data, while the online version will save time and encourage researchersto focus on explaining their findings. It is a simple tool that can be used in conjunction withpopulation genetics software (PDF) AlignStatPlot: An R package and online tool for robust sequence alignment statistics and innovative visualization of big data.

Address of the bookmark: https://bioinformatics.um6p.ma/AlignStatPlot/

Ancestral sequence reconstruction (ASR) or ancestral gene/sequence reconstruction/resurrection tools to study molecular evolution

Jit — Tue, 30 May 2017 04:20:05 -0500

Ancestral sequence reconstruction (ASR) – also known as ancestral gene/sequence reconstruction/resurrection – is a technique used in the study of molecular evolution. The method consists of the synthesis of an ancestral gene and expression of the corresponding ancestral protein. The idea of protein 'resurrection' was suggested in 1963 by Pauling and Zuckerkandl. Some early efforts were made in the eighties-nineties, led by the laboratory of Steven A. Benner, showing the potential of this technique – one that only started to be fulfilled in the post-genomic era. Thanks to the improvement of algorithms and of better sequencing and synthesis techniques, the method was developed further in the early 2000s to allow the resurrection of a greater variety of and much more ancient genes. Over the last decade, ancestral protein resurrection has developed as a strategy to reveal the mechanisms and dynamics of protein evolution.

Following are the list of Ancestral /sequence/ reconstruction (ASR) tools:

inferCars

Reconstructs contiguous regions of an ancestral genome. Given information about adjacencies between conserved segments in each modern species, our goal is to infer segment order in the ancestral genome. To get a clean and precise statement of the problem, we formalize it using graph theory. We develop an algorithm that identifies a most parsimonious scenario for the history of each individual adjacency, although the whole-genome prediction is not guaranteed to optimize traditional measures like the number of breakpoints. We introduce weights to the graph edges to model the reliability of each adjacency.

ANGES:reconstructing ANcestral GEnomeS maps

A suite of Python programs that allows reconstructing ancestral genome maps from the comparison of the organization of extant-related genomes. ANGES can reconstruct ancestral genome maps for multichromosomal linear genomes and unichromosomal circular genomes. It implements methods inspired from techniques developed to compute physical maps of extant genomes.

Cocos

Constructs phylogenies of multi-domain proteins. With a given species tree and domain phylogenies, the procedure infers the composition of ancestral multi-domain proteins. Cocos implements and extend a suggested algorithmic approach by Behzadi and Vingron in an easy-to-use program. Such method could be applied to reconstruction of partial homologous units such as bacterial operons or protein complexes.

MySSP

Constructs an initial DNA sequence at the root of the tree and simulates evolution across the tree using a variety of common models of DNA evolution. MySSP is a program for the simulation of DNA sequence evolution across a phylogenetic tree. It is designed for large-scale studies, including simulation of multiple replicates and outputs sequences into NEXUS, MEGA, or FASTA formats. MySSP has a fairly simple graphical user interface (GUI) for basic use, but also has a specialized batch script interpreter to allow for more complicated or large-scale simulations.

PARANA: Parsimonious Ancestral Reconstruction And Network Analysis

Performs parsimony based inference of ancestral biological networks. Given multiple extant networks and phylogenetic information relating extant nodes, PARANA finds a parsimonious set of ancestral interaction events (edge gains and losses) which explain the extant networks. The framework adopted by PARANA is able to represent network evolution under models that support gene duplication and loss and independent interaction gain and loss. The method works on both directed and undirected networks and can incorporate asymmetric interaction gain and loss costs. In contrast to previous approaches, PARANA does not require knowing the relative ordering of unrelated duplication events and thus, works on phylogenetic trees even where branch lengths are not provided.

GapAdj: Gapped Adjacencies

A synteny-based method that is flexible enough to handle a model of evolution involving whole genome duplication events, in addition to rearrangements, gene insertions, and losses. Ancestral relationships between markers are defined in term of Gapped Adjacencies, i.e. pairs of markers separated by up to a given number of markers. It improves on a previous restricted to direct adjacencies, which revealed a high accuracy for adjacency prediction, but with the drawback of being overly conservative, i.e. of generating a large number of contiguous ancestral regions (CARs).

ANCESTOR

A web server allowing one to easily and quickly perform the last three steps of the ancestral genome reconstruction procedure. Ancestors implements several alignment algorithms, an indel maximum likelihood solver and a context-dependent maximum likelihood substitution inference algorithm. The results presented by the server include the posterior probabilities for the last two steps of the ancestral genome reconstruction and the expected error rate of each ancestral base prediction.

ProCARs

Reconstructs ancestral gene orders as contiguous ancestral regions (CARs) with a progressive homology-based method. ProCARs runs from a phylogeny tree (without branch lengths needed) with a marked ancestor and a block file. This homology-based method is based on iteratively detecting and assembling ancestral adjacencies, while allowing some micro-rearrangements of synteny blocks at the extremities of the progressively assembled CARs. The method starts with a set of blocks as the initial set of CARs, and detects iteratively the potential ancestral adjacencies between extremities of CARs, while building up the CARs progressively by adding, at each step, new non-conflicting adjacencies that induce the less homoplasy phenomenon. The species tree is used, in some additional internal steps, to compute a score for the remaining conflicting adjacencies, and to detect other reliable adjacencies, in order to reach completely assembled ancestral genomes.

FastML

A user-friendly tool for the reconstruction of ancestral sequences. FastML implements various novel features that differentiate it from existing tools: (i) FastML uses an indel-coding method, in which each gap, possibly spanning multiples sites, is coded as binary data. FastML then reconstructs ancestral indel states assuming a continuous time Markov process. FastML provides the most likely ancestral sequences, integrating both indels and characters; (ii) FastML accounts for uncertainty in ancestral states: it provides not only the posterior probabilities for each character and indel at each sequence position, but also a sample of ancestral sequences from this posterior distribution, and a list of the k-most likely ancestral sequences; (iii) FastML implements a large array of evolutionary models, which makes it generic and applicable for nucleotide, protein and codon sequences; and (iv) a graphical representation of the results is provided, including, for example, a graphical logo of the inferred ancestral sequences.

maxAlike

Reconstructs a genomic sequence for a specific taxon based on sequence homologs in other species. The input is a multiple sequence alignment and a phylogenetic tree that also contains the target species. For this target species, the algorithm computes nucleotide probabilities at each sequence position. Consensus sequences are then reconstructed based on a certain confidence level.

MLGO: Maximum Likelihood for Gene Order Analysis

A web tool for the reconstruction of phylogeny and/or ancestral genomes from gene-order data. MLGO was designed for analysis of large-scale genomic changes including not only rearrangements but also gene insertions, deletions and duplications. MLGO can be used to infer a phylogeny from genome rearrangement and gene order data, and can also obtain an estimation of ancestral genomes, given an input tree. MLGO takes the advantage of binary encoding on gene-order data, supports a fairly general model of genomic evolution (rearrangements plus duplications, insertions, and losses of genomic regions), and successfully accommodates itself into the framework of maximized likelihood.

Image Reference : Wiki

FSA: Fast Statistical Alignment

Jit — Mon, 06 Feb 2017 04:26:01 -0600

FSA is a probabilistic multiple sequence alignment algorithm which uses a "distance-based" approach to aligning homologous protein, RNA or DNA sequences. Much as distance-based phylogenetic reconstruction methods like Neighbor-Joining build a phylogeny using only pairwise divergence estimates, FSA builds a multiple alignment using only pairwise estimations of homology. This is made possible by the sequence annealing technique for constructing a multiple alignment from pairwise comparisons, developed by Ariel Schwartz in "Posterior Decoding Methods for Optimization and Control of Multiple Alignments."

FSA brings the high accuracies previously available only for small-scale analyses of proteins or RNAs to large-scale problems such as aligning thousands of sequences or megabase-long sequences. FSA introduces several novel methods for constructing better alignments:

FSA uses machine-learning techniques to estimate gap and substitution parameters on the fly for each set of input sequences. This "query-specific learning" alignment method makes FSA very robust: it can produce superior alignments of sets of homologous sequences which are subject to very different evolutionary constraints.
FSA is capable of aligning hundreds or even thousands of sequences using a randomized inference algorithm to reduce the computational cost of multiple alignment. This randomized inference can be over ten times faster than a direct approach with little loss of accuracy.
FSA can quickly align very long sequences using the "anchor annealing" technique for resolving anchors and projecting them with transitive anchoring. It then stitches together the alignment between the anchors using the methods described above.
The included GUI, MAD (Multiple Alignment Display), can display the intermediate alignments produced by FSA, where each character is colored according to the probability that it is correctly aligned (see the picture and movie at the top of the page).

You can see more information on the FAQ.

Address of the bookmark: http://fsa.sourceforge.net/

FOGSAA: Fast Optimal Global Sequence Alignment Algorithm

Jit — Fri, 08 Dec 2017 14:41:08 -0600

Sequence alignment algorithms are widely used to infer similarirty and the point of differences between pair of sequences. FOGSAA is a fast Global alignment algorithm. It is basically a branch and bound approach which starts branch expansion in a greedy way taking the symbols from the given pair of sequences (protein or nucleotide) and results in an optimal alignment faster than conventional dymanic programming techniques. It is also better than the heuristic methods with respect to alignment quality.

Address of the bookmark: http://www.isical.ac.in/~bioinfo_miu/FOGSAA.htm

Kalign: fast multiple sequence alignment program for biological sequences.

BioStar — Fri, 01 Nov 2019 00:20:41 -0500

Kalign is a fast multiple sequence alignment program for biological sequences.

Align sequences and output the alignment in MSF format:

kalign -i BB11001.tfa -f msf  -o out.msf

Align sequences and output the alignment in clustal format:

kalign -i BB11001.tfa -f clu -o out.clu

Re-align sequences in an existing alignment:

kalign -i BB11001.msf  -o out.afa

Reformat existing alignment:

kalign -i BB11001.msf -r afa -o out.afa

Address of the bookmark: https://github.com/TimoLassmann/kalign

Bioinformatics Algorithms

Jitendra Narayan — Tue, 16 Jul 2013 03:35:15 -0500

An algorithm is a computable set of steps to achieve a desired result.

We use algorithms every day. For example, a recipe for baking a cake is an algorithm. Most programs, with the exception of some artificial intelligence applications, consist of algorithms. Inventing elegant algorithms -- algorithms that are simple and require the fewest steps possible -- is one of the principal challenges in programming. An algorithm is a description of a procedure which terminates with a result. In other words an algorithm is a set of instructions, sometimes called a procedure or a function, that is used to perform a certain task. This can be a simple process, such as adding two numbers together, or a complex function, such as adding effects to an image. For example, in order to sharpen a digital photo, the algorithm would need to process each pixel in the image and determine which ones to change and how much to change them in order to make the image look sharper.

In mathematics, computer science, and related subjects, an algorithm is an effective method for solving a problem using a finite sequence of instructions. Algorithms are used for calculation, data processing, and many other fields.
Each algorithm is a list of well-defined instructions for completing a task. Starting from an initial state, the instructions describe a computation that proceeds through a well-defined series of successive states, eventually terminating in a final ending state. The transition from one state to the next is not necessarily deterministic; some algorithms, known as randomized algorithms, incorporate randomness.

History

The origin of the term comes from the ancients. The concept becomes more precise with the use of variables in mathematics. Algorithm in the sense of what is now used by computers appeared as soon as first mechanical engines were invented.
The word algorithm comes from the name of the 9th century Persian Muslim mathematician Abu Abdullah Muhammad ibn Musa Al-Khwarizmi. The word algorism originally referred only to the rules of performing arithmetic using Hindu-Arabic numerals but evolved via European Latin translation of Al-Khwarizmi's name into algorithm by the 18th century. The use of the word evolved to include all definite procedures for solving problems or performing tasks.
The algorithm of Archimedes gives an approximation of the Pi number.
Eratosthenes has defined an algorithim for retrieving prime numbers.
Averroès (1126-1198) was using algorithmic methods for calculations.
Adelard de Bath (12 th) introduces the algorismus term, from Al-Khwarizmi.
During the 1800's up to the mid-1900's:

- George Boole (1847) has invented the binary algebra, the basis of computers. Actually he has unified logic and calculation in a common symbolism.

- Gottlob Frege (1879) formula language's, that is a lingua characterica, a language written with special symbols, "for pure thought", that is free from rhetorical embellishments... constructed from specific symbols that are manipulated according to definite rules.

- Giuseppe Peano (1888) It's The principles of arithmetic, presented by a new method was the first attempt at an axiomatization of mathematics in a symbolic language.

- Alfred North Whitehead and Bertrand Russell in their Principia Mathematica (1910-1913) has further simplified and amplified the work of Frege.

- Kurt Goëdel (1931) cites the paradox of the liar that completely reduces rules of recursion to numbers.

The concept of algorithm was formalized in 1936 through Alan Turing's Turing machines and Alonzo Church's lambda calculus, which in turn formed the foundation of computer science.
Stephen C. Kleene (1943) defined his now-famous thesis known as the "Church-Turing Thesis". In this context:

" Algorithmic theories... In setting up a complete algorithmic theory, what we do is to describe a procedure, performable for each set of values of the independent variables, which procedure necessarily terminates and in such manner that from the outcome we can read a definite answer, "yes" or "no," to the question, "is the predicate value true?"

Classification

Classification by purpose

Each algorithm has a goal, for example, the purpose of the Quick Sort algorithm is to sort data in ascending or descending order. But the number of goals is infinite, and we have to group them by kind of purposes:

Classification by implementation

An algorithm may be implemeted according to different basical principles.

Recursive or iterative

A recursive algorithm is one that calls itself repeatedly until a certain condition matches. It is a method common to functional programming.
Iterative algorithms use repetitive constructs like loops.
Some problems are better suited for one implementation or the other. For example, the towers of hanoi problem is well understood in recursive implementation. Every recursive version has an iterative equivalent iterative, and vice versa.

Logical or procedural

An algorithm may be viewed as controlled logical deduction.
A logic component expresses the axioms which may be used in the computation and a control component determines the way in which deduction is applied to the axioms.
This is the basis of the logic programming. In pure logic programming languages the control component is fixed and algorithms are specified by supplying only the logic component.

Serial or parallel

Algorithms are usually discussed with the assumption that computers execute one instruction of an algorithm at a time. This is a serial algorithm, as opposed to parallel algorithms, which take advantage of computer architectures to process several instructions at once. They divide the problem into sub-problems and pass them to several processors. Iterative algorithms are generally parallelizable. Sorting algorithms can be parallelized efficiently.

Deterministic or non-deterministic

Deterministic algorithms solve the problem with a predefined process whereas non-deterministic algorithm must perform guesses of best solution at each step through the use of heuristics.

Classification by design paradigm

A design paradigm is a domain in research or class of problems that requires a dedicated kind of algorithm:

Divide and conquer

A divide and conquer algorithm repeatedly reduces an instance of a problem to one or more smaller instances of the same problem (usually recursively), until the instances are small enough to solve easily. One such example of divide and conquer is merge sorting. Sorting can be done on each segment of data after dividing data into segments and sorting of entire data can be obtained in conquer phase by merging them.
The binary search algorithm is an example of a variant of divide and conquer called decrease and conquer algorithm, that solves an identical subproblem and uses the solution of this subproblem to solve the bigger problem.

Dynamic programming

The shortest path in a weighted graph can be found by using the shortest path to the goal from all adjacent vertices.
When the optimal solution to a problem can be constructed from optimal solutions to subproblems, using dynamic programming avoids recomputing solutions that have already been computed.
- The main difference with the "divide and conquer" approach is, subproblems are independent in divide and conquer, where as the overlap of subproblems occur in dynamic programming.
- Dynamic programming and memoization go together. The difference with straightforward recursion is in caching or memoization of recursive calls. Where subproblems are independent, this is useless. By using memoization or maintaining a table of subproblems already solved, dynamic programming reduces the exponential nature of many problems to polynomial complexity.

The greedy method

A greedy algorithm is similar to a dynamic programming algorithm, but the difference is that solutions to the subproblems do not have to be known at each stage. Instead a "greedy" choice can be made of what looks the best solution for the moment.
The most popular greedy algorithm is finding the minimal spanning tree as given by Kruskal.

Linear programming

The problem is expressed as a set of linear inequalities and then an attempt is made to maximize or minimize the inputs. This can solve many problems such as the maximum flow for directed graphs, notably by using the simplex algorithm.
A complex variant of linear programming is called integer programming, where the solution space is restricted to all integers.

Reduction also called transform and conquer

Solve a problem by transforming it into another problem. A simple example: finding the median in an unsorted list is first translating this problem into sorting problem and finding the middle element in sorted list. The main goal of reduction is finding the simplest transformation possible.

Using graphs

Many problems, such as playing chess, can be modeled as problems on graphs. A graph exploration algorithms are used.
This category also includes the search algorithms and backtracking.

The probabilistic and heuristic paradigm

Probabilistic

Those that make some choices randomly.

Genetic

Attempt to find solutions to problems by mimicking biological evolutionary processes, with a cycle of random mutations yielding successive generations of "solutions". Thus, they emulate reproduction and "survival of the fittest".

Heuristic

Whose general purpose is not to find an optimal solution, but an approximate solution where the time or resources to find a perfect solution are not practical.

Classification by complexity

Some algorithms complete in linear time, and some complete in exponential amount of time, and some never complete.

Algorithms resources on net.

Graph Algorithms in Bioinformatics

Bioinformatics Algorithms Description

Bioinformatics Algorithms Course Page

Bioinformatics Algorithm Demonstrations

Introduction to Bioinformatics Algorithms Lectures 1-2 by Dr. Max Alekseyev USC, 2009

Online Lectures on Bioinformatics

Sequence Alignment Algorithms

Algorithm for sequence alignment: dynamic programming

Network Protocol Analysis using Bioinformatics Algorithms