BOL: Edit distance application in bioinformatics !: Revision

Pages
Neel
Edit distance application in bioinformatics !
Revision

Edit distance application in bioinformatics !: Revision

Edit distance application in bioinformatics !

Last updated 2682 days ago by Neel

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,

the Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters;
the longest common subsequence (LCS) distance allows only insertion and deletion, not substitution;
the Hamming distance allows only substitution, hence, it only applies to strings of the same length.
the Jaro distance allows only transposition.

use Text::Levenshtein qw(distance);

 print distance("foo","four");
 # prints "2"

 my @words     = qw/ four foo bar /;
 my @distances = distance("foo",@words);

 print "@distances";
 # prints "2 0 3"

use Algorithm::LCSS qw( LCSS CSS CSS_Sorted );
    my $lcss_ary_ref = LCSS( \@SEQ1, \@SEQ2 );  # ref to array
    my $lcss_string  = LCSS( $STR1, $STR2 );    # string
    my $css_ary_ref = CSS( \@SEQ1, \@SEQ2 );    # ref to array of arrays
    my $css_str_ref = CSS( $STR1, $STR2 );      # ref to array of strings
    my $css_ary_ref = CSS_Sorted( \@SEQ1, \@SEQ2 );  # ref to array of arrays
    my $css_str_ref = CSS_Sorted( $STR1, $STR2 );    # ref to array of strings

There are many different modules on CPAN for calculating the edit distance between two strings. Here's just a selection.

Text::LevenshteinXS and Text::Levenshtein::XS are both versions of the Levenshtein algorithm that require a C compiler, but will be a lot faster than this module.

The Damerau-Levenshtein edit distance is like the Levenshtein distance, but in addition to insertion, deletion and substitution, it also considers the transposition of two adjacent characters to be a single edit. The module Text::Levenshtein::Damerau defaults to using a pure perl implementation, but if you've installed Text::Levenshtein::Damerau::XS then it will be a lot quicker.

Text::WagnerFischer is an implementation of the Wagner-Fischer edit distance, which is similar to the Levenshtein, but applies different weights to each edit type.

Text::Brew is an implementation of the Brew edit distance, which is another algorithm based on edit weights.

Text::Fuzzy provides a number of operations for partial or fuzzy matching of text based on edit distance. Text::Fuzzy::PP is a pure perl implementation of the same interface.

String::Similarity takes two strings and returns a value between 0 (meaning entirely different) and 1 (meaning identical). Apparently based on edit distance.

Text::Dice calculates Dice's coefficient for two strings. This formula was originally developed to measure the similarity of two different populations in ecological research.

BOL

Neel

History

Edit distance application in bioinformatics !

Edit distance application in bioinformatics !

Edit distance application in bioinformatics !

Our Sponsors

Edit distance application in bioinformatics !: Revision

Edit distance application in bioinformatics !