BOL: Related items

Pattern Matching Problem Solution with Perl

Jit — Tue, 09 Jun 2015 23:58:45 -0500

Problem at http://rosalind.info/problems/1c/

#Find all occurrences of a pattern in a string.
#Given: Strings Pattern and Genome.
#Return: All starting positions in Genome where Pattern appears as a substring. Use 0-based indexing.

use strict;
use warnings;

my $string="GATATATGCATATACTT";
my $subStr="ATAT";
my $kmer=length($subStr);

kmerMatch ($string, $subStr, $kmer);

sub kmerMatch { #Check the exact matching kmers with sliding window
my ($string, $myStr, $kmer)=@_;
for (my $aa=0; $aa<=(length($string)-$kmer); $aa++) {
    my $myWin=substr $string, $aa,$kmer;
    if ($myWin eq $myStr) {
        #print "$myWin eq $myStr\n";
        print $aa;
    }
}
}

BioScripts

Rahul Nayak — Sun, 28 Jun 2015 07:46:14 -0500

You are requested to please bookmark collection of bioinformatics tools, scripts, codes that can be pieced together in a very easy and flexible manner to perform both simple and complex bioinformatics tasks.

The next-generation sequencing included whole genome sequencing(WGS), transcriptome sequencing (whole cDNA sequencing, RNA-seq), digital gene expression sequencing (Tag-Seq), ChIP-Seq, and so on. And there are many sequencing platform to generate sequece, as well know Sanger/ABi(the frist generation), Solexa/illumina, SOLiD/ABi, 454/Roche. But thier sequence format is different, also they have different error type. High quality data is very important for further analysis or data mining. There are many pipeline for raw sequence quality analysis and control with few of process for reporting reads quality statistical details, trimming, filtering, and error correction. Please bookmarks them for the benefits of bioinformatics community.

https://code.google.com/p/biowiki/

https://code.google.com/p/ngs-pipeline/source/browse/#svn%2Ftrunk

NGSand Perl scripts https://code.google.com/hosting/search?q=NGS+perl&projectsearch=Search+projects

NGS and Python scripts https://code.google.com/hosting/search?q=NGS+Python&projectsearch=Search+projects

Address of the bookmark: https://code.google.com/hosting/search?q=bioinformatics&sa=Search

BioToolbox

Jit — Fri, 19 Feb 2016 09:14:44 -0600

This is a collection of libraries and high-quality end-user scripts for bioinformatic analysis, including working with gene annotation, collecting data scores from a variety of modern file formats, and conversion between file formats. The Bio::ToolBox libraries provide a unified, abstracted interface to multiple common gene annotation formats and the collection of data from multiple data files. They rely on BioPerl SeqFeature libraries and related adaptors to access binary file formats including Bam, BigWig, BigBed, and USeq. The Bio::ToolBox package includes scripts for setting up databases of annotation, collecting annotated features, collecting genomic data relative to features, manipulating and analyzing data, and data format conversion.

More at http://cpansearch.perl.org/src/TJPARNELL/

Address of the bookmark: http://cpansearch.perl.org/src/TJPARNELL/

How to install Perl modules on Mac OS X in easy steps !!

Jit — Thu, 20 Oct 2016 07:26:29 -0500

Today at work, I learned how to install Perl modules using CPAN. It’s a lot easier than I thought.

You see, for the past couple of years, I’ve been a bit frustrated because OS X does not come with a whole lot of Perl modules pre-installed, and for all I googled, I couldn’t find an “idiot’s” guide for moderately-savvy-but-not-expert users like myself to install modules and dependencies on demand.

The only instructions I could find point to Fink, which basically installs modules in a path that isn’t included in the Perl @INC variable, meaning you have to manually specify the full path to the modules in every script — which is not a lot of fun if you’re developing on OS X and deploying on Red Hat, for instance.

Moreover, Fink doesn’t seem to make every module available, and it’s not very easy to determine which Fink package you need to install if you need a particular module.

So, with a script that called on several apparently unavailable modules, and a deadline looming, I finally decided to suck it up and figure out how to use CPAN to install them:

1) Make sure you have the Apple Developer Tools (XCode) installed.

These are on one of your install discs, or available as a huge but free download from the Apple Developer Connection [free registration required] or the Mac App Store. I thought I had them, but apparently when we upgraded that computer to Tiger, they went missing.

If you don’t have this stuff installed, your installation will fail with errors about unavailable commands.

1.5) Install Command Line Tools (Recent XCode versions only)

(Thank you to Tom Marchioro for informing me about this step.)

Older versions of XCode installed the command line tools (which are required to properly install CPAN modules) by default, but apparently newer ones do not. To check whether you have the command line tools already installed, run the following from the Terminal:

$ which make

This command checks the system for the “make” tool. If it spits out something like /usr/bin/make you’re golden and can skip ahead to Step 2. If you just get a new prompt and no output, you’ll need to install the tools:

Launch XCode and bring up the Preferences panel.
Click on the Downloads tab
Click to install the Command Line Tools

If you like, you can run which make again to confirm that everything’s installed correctly.

2) Configure CPAN.

$ sudo perl -MCPAN -e shell

perl> o conf init

This will prompt you for some settings. You can accept the defaults for almost everything (just hit “return”). The two things you must fill in are the path to make (which should be /usr/bin/make or the value returned when you run which make from the command line) and your choice of CPAN mirrors (which you actually choose don’t really matter, but it won’t let you finish until you select at least one). If you use a proxy or a very restrictive firewall, you may have to configure those settings as well.

If you skip Step 2, you may get errors about make being unavailable.

3) Upgrade CPAN

$ sudo perl -MCPAN -e 'install Bundle::CPAN'

Don’t forget the sudo, or it’ll fail with permissions errors, probably when doing something relatively unimportant like installing man files.

This will spend a long time downloading, testing, and compiling various files and dependencies. Bear with it. It will prompt you a few times about dependencies. You probably want to enter “yes”. I agreed to everything it asked me, and everything turned out fine. YMMV of course. If everything installs properly, it’ll give you an “OK” at the end.

4) Install your modules. For each module….

$ sudo perl -MCPAN -e 'install Bundle::Name'

$ sudo perl -MCPAN -e 'install Module::Name'

This will install the module and its dependencies. Nice, eh? Again, don’t forget the sudo.

The first time you run this after upgrading CPAN, it may prompt you to configure again (see Step 2). If you accept its offer to try to configure itself automatically, it may just run through everything without a problem.

There are a couple of potential pitfalls with specific modules (such as theLWP::UserAgent / HEAD issue), but most have workarounds, and I haven’t run into anything that wasn’t easily recoverable.

And that’s it!

Did you find this useful? Is there anything I missed?

Edit distance application in bioinformatics !

Neel — Thu, 07 Dec 2017 08:46:51 -0600

There are other popular measures of edit distance, which are calculated using a different set of allowable edit operations. For instance,

the Damerau–Levenshtein distance allows insertion, deletion, substitution, and the transposition of two adjacent characters;
the longest common subsequence (LCS) distance allows only insertion and deletion, not substitution;
the Hamming distance allows only substitution, hence, it only applies to strings of the same length.
the Jaro distance allows only transposition.

use Text::Levenshtein qw(distance);

 print distance("foo","four");
 # prints "2"

 my @words     = qw/ four foo bar /;
 my @distances = distance("foo",@words);

 print "@distances";
 # prints "2 0 3"

use Algorithm::LCSS qw( LCSS CSS CSS_Sorted );
    my $lcss_ary_ref = LCSS( \@SEQ1, \@SEQ2 );  # ref to array
    my $lcss_string  = LCSS( $STR1, $STR2 );    # string
    my $css_ary_ref = CSS( \@SEQ1, \@SEQ2 );    # ref to array of arrays
    my $css_str_ref = CSS( $STR1, $STR2 );      # ref to array of strings
    my $css_ary_ref = CSS_Sorted( \@SEQ1, \@SEQ2 );  # ref to array of arrays
    my $css_str_ref = CSS_Sorted( $STR1, $STR2 );    # ref to array of strings

There are many different modules on CPAN for calculating the edit distance between two strings. Here's just a selection.

Text::LevenshteinXS and Text::Levenshtein::XS are both versions of the Levenshtein algorithm that require a C compiler, but will be a lot faster than this module.

The Damerau-Levenshtein edit distance is like the Levenshtein distance, but in addition to insertion, deletion and substitution, it also considers the transposition of two adjacent characters to be a single edit. The module Text::Levenshtein::Damerau defaults to using a pure perl implementation, but if you've installed Text::Levenshtein::Damerau::XS then it will be a lot quicker.

Text::WagnerFischer is an implementation of the Wagner-Fischer edit distance, which is similar to the Levenshtein, but applies different weights to each edit type.

Text::Brew is an implementation of the Brew edit distance, which is another algorithm based on edit weights.

Text::Fuzzy provides a number of operations for partial or fuzzy matching of text based on edit distance. Text::Fuzzy::PP is a pure perl implementation of the same interface.

String::Similarity takes two strings and returns a value between 0 (meaning entirely different) and 1 (meaning identical). Apparently based on edit distance.

Text::Dice calculates Dice's coefficient for two strings. This formula was originally developed to measure the similarity of two different populations in ecological research.

EFS: an ensemble feature selection tool implemented as R-package and web-application

Jit — Tue, 28 Jan 2020 05:12:23 -0600

The software EFS (Ensemble Feature Selection) makes use of multiple feature selection methods and combines their normalized outputs to a quantitative ensemble importance. Currently, eight different feature selection methods have been integrated in EFS, which can be used separately or combined in an ensemble.

https://biodatamining.biomedcentral.com/articles/10.1186/s13040-017-0142-8

Address of the bookmark: http://efs.heiderlab.de/

Five unique traits of effective computational biologist

Jitendra Narayan — Thu, 11 Jul 2013 13:12:51 -0500

Bioinformatics research is driven by large set of software, scripts, and tools to analyse gigantic biological data. Being a great biological programmer or bioinformatician involves more than writing code that works. The biological programmers who rise to the top ranks of their profession are not only good programmer but also expert in biological stuff. Moreover, In order to be a good and effective biological programmer, you need to possess a combination of traits that allow your computational as well as biological skill, experience, and knowledge to produce working code. There are some technically skilled biological programmers who will never be effective because they lack the other important traits needed. Here are top five traits that are necessary to become a great biological programmer.

1. Learn and get updated

Some of the bad biological programmers only learn new technical or non-technical things when it’s absolutely necessary. The good biological programmers learn new technical skills proactively. But great biological programmers not only learn new technical skills on their own but also learn non-technical skills, and have an open mind to sources of knowledge that others may shut out.

In other concrete term, the bad biological programmer learn Perl's regular expression when they started a project on comparative genomics; the good biological programmer learned it a year before because it looked interesting; and the great biological programmer also read about the BioPerl packages, genomics, DNA string, genomic theories, or some similar course of study so that they could understand the results and explain it biologically.

2. Not a merely coder!!!

I often encountered with biological programmer who call themself a hard-core computer programmer and avoid biology. I can almost guarantee that if you are one of them then you are not doing research but merely writing "dry" codes.

According to my supervisor most of the computational biologist, don't know what they are doing biologically. Even they struggle to explain their own programs output and results. Therefore, It is highly advisable to learn basic of biology which can assist you to explain the result and understand your discovery. Always remember you are a researcher not a coder.

3. Be Social with biologist

The computational biologist spends most of the time in from of computers, writing codes. They always think their job is to produce working codes, not technical research perfections. But, they are completely wrong. You should not forget that apart from your computational skills you also need some biologist, other than your supervisor, to explain and make you understand the complex biological mechanism.

I highly recommend your to interact with biotech researchers and learn how do they explain their one graph (which they generally produce after one year of work) biologically. Remember, the origin of your research project is complex biological phenomenon, which is more complex than that of your limited programming rules.

4. Do not search, research for answers

Researching for answers means more than typing several keywords into a search engine or posting a question at Stack Overflow or the BioStars forums. I have entered problems into search engines that generate no results, and every question I posted on Stack Overflow or the BioStars forums never got anything resembling an answer, yet I solved the issues and moved on. I’m not a magician — I just know how to find answers or discover root causes.

Many problems are situational, and if you depend on search engines and forums, you can waste a lot of time going down a rabbit hole and possibly never getting a solution. Learn to perform root cause analysis, learn enough about the underlying system to look for other clues and solutions, and learn to take a long distance view of an issue before deep diving into it.

5. Love and defend your research

You cannot rise to the top in this research profession without loving your work. There are some very good “it’s just a job” biological programmers (I’ve been one at times), but if that is your outlook, you won’t be willing to do whatever it takes to succeed. This idea gets a lot of folks in a huff, because they feel it is a personal insult. “I’m a good programmer, but I have other priorities and can’t make work my life.” I understand completely; I have other priorities too. As much as I hate to say it, when I am passionate about my work, I am willing (though not eager) to abandon my other priorities to finish the job. It is not an insult to say that if you aren’t willing to pull out all the stops you can’t be the best, it is a fact.

You must be passionate about more than programming — you must also be excited about your research, the tools and technology you are using, and so on. I have seen very good and even great biological programmers operating at mediocre levels because something was not a good fit, such as they hated the project or were using a technology they disliked. Therefore, like your research project and get excited about your discoveries. You have not only to discover but also defend your finding with scientific words.

Thanks to all of you for reading.

Perl one-liner for bioinformatician !!!

Abhimanyu Singh — Fri, 30 May 2014 05:49:07 -0500

With the emergence of NGS technologies, and sequencing data most of the bioinformaticians mung and wrangle around massive amounts of genomics text. There are several "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Perl onliner is extremely helpful.

Perl one-liners are small and awesome Perl programs that fit in a single line of code and they do one thing really well. These things include changing line spacing, numbering lines, doing calculations, converting and substituting text, deleting and printing certain lines, parsing logs, editing files in-place, doing statistics, carrying out system administration tasks, updating a bunch of files at once, and many more. Perl one-liners will make you the shell warrior. Anything that took you minutes to solve, will now take you seconds!

perl -pe '$\="\n"'
#double space a file

perl -pe '$_ .= "\n" unless /^$/'
#double space a file except blank lines

perl -pe '$_.="\n"x7'
#7 space in a line.

perl -ne 'print unless /^$/'
#remove all blank lines

perl -lne 'print if length($_) < 20'
#print all lines with length less than 20.

perl -00 -pe ''
#If there are multiple spaces, delete all leaving one(make the file a single spaced file).

perl -00 -pe '$_.="\n"x4'
#Expand single blank lines into 4 consecutive blank lines

perl -pe '$_ = "$. $_"'
#Number all lines in a file

perl -pe '$_ = ++$a." $_" if /./'
#Number only non-empty lines in a file

perl -ne 'print ++$a." $_" if /./'
#Number and print only non-empty lines in a file

perl -pe '$_ = ++$a." $_" if /regex/'
#Number only lines that match a pattern

perl -ne 'print ++$a." $_" if /regex/'
#Number and print only lines that match a pattern

perl -ne 'printf "%-5d %s", $., $_ if /regex/'
#Left align lines with 5 white spaces if matches a pattern (perl -ne 'printf "%-5d %s", $., $_' : for all the lines)

perl -le 'print scalar(grep{/./}<>)'
#prints the total number of non-empty lines in a file

perl -lne '$a++ if /regex/; END {print $a+0}'
#print the total number of lines that matches the pattern

perl -alne 'print scalar @F'
#print the total number fields(words) in each line.

perl -alne '$t += @F; END { print $t}'
#Find total number of words in the file

perl -alne 'map { /regex/ && $t++ } @F; END { print $t }'
#find total number of fields that match the pattern

perl -lne '/regex/ && $t++; END { print $t }'
#Find total number of lines that match a pattern

perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m'
#will calculate the GCD of two numbers.

perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m'
#will calculate lcd of 20 and 35.

perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n'
#Generates 10 random numbers between 5 and 15.

perl -le 'print map { ("a".."z",”0”..”9”)[rand 36] } 1..8'
#Generates a 8 character password from a to z and number 0 – 9.

perl -le 'print map { ("a",”t”,”g”,”c”)[rand 4] } 1..20'
#Generates a 20 nucleotide long random residue.

perl -le 'print "a"x50'
#generate a string of ‘x’ 50 character long

perl -le 'print join ", ", map { ord } split //, "hello world"'
#Will print the ascii value of the string hello world.

perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)'
#converts ascii values into character strings.

perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"'
#Generates an array of odd numbers.

perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"'
#Generate an array of even numbers

perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file
#Convert the entire file into 13 characters offset(ROT13)

perl -nle 'print uc'
#Convert all text to uppercase:

perl -nle 'print lc'
#Convert text to lowercase:

perl -nle 'print ucfirst lc'
#Convert only first letter of first word to uppercas

perl -ple 'y/A-Za-z/a-zA-Z/'
#Convert upper case to lower case and vice versa

perl -ple 's/(\w+)/\u$1/g'
#Camel Casing

perl -pe 's|\n|\r\n|'
#Convert unix new lines into DOS new lines:

perl -pe 's|\r\n|\n|'
#Convert DOS newlines into unix new line

perl -pe 's|\n|\r|'
#Convert unix newlines into MAC newlines:

perl -pe '/regexp/ && s/foo/bar/'
#Substitute a foo with a bar in a line with a regexp.

Reference/Sources:

http://genomics-array.blogspot.in/2010/11/some-unixperl-oneliners-for.html

http://genomespot.blogspot.com/2013/08/a-selection-of-useful-bash-one-liners.html

http://biowize.wordpress.com/2012/06/15/command-line-magic-for-your-gene-annotations/

http://genomics-array.blogspot.com/2010/11/some-unixperl-oneliners-for.html

http://bioexpressblog.wordpress.com/2013/04/05/split-multi-fasta-sequence-file/

mojolicious: a next generation web framework for the Perl programming language.

Jit — Fri, 12 Jan 2018 16:48:10 -0600

Back in the early days of the web, many people learned Perl because of a wonderful Perl library called CGI. It was simple enough to get started without knowing much about the language and powerful enough to keep you going, learning by doing was much fun. While most of the techniques used are outdated now, the idea behind it is not. Mojolicious is a new endeavor to implement this idea using bleeding edge technologies.

Features

An amazing real-time web framework, allowing you to easily grow single file prototypes into well-structured MVC web applications.
- Powerful out of the box with RESTful routes, plugins, commands, Perl-ish templates, content negotiation, session management, form validation, testing framework, static file server, CGI/PSGI detection, first class Unicode support and much more for you to discover.
A powerful web development toolkit, that you can use for all kinds of applications, independently of the web framework.
- Full stack HTTP and WebSocket client/server implementation with IPv6, TLS, SNI, IDNA, HTTP/SOCKS5 proxy, UNIX domain socket, Comet (long polling), Promises/A+, keep-alive, connection pooling, timeout, cookie, multipart and gzip compression support.
- Built-in non-blocking I/O web server, supporting multiple event loops as well as optional pre-forking and hot deployment, perfect for building highly scalable web services.
- JSON and HTML/XML parser with CSS selector support.
Very clean, portable and object-oriented pure-Perl API with no hidden magic and no requirements besides Perl 5.24.0 (versions as old as 5.10.1 can be used too, but may require additional CPAN modules to be installed)
Fresh code based upon years of experience developing Catalyst, free and open source.
Hundreds of 3rd party extensions and high quality spin-off projects like the Minion job queue.

http://mojolicious.org/

Address of the bookmark: http://mojolicious.org/

VCFtools: perform common tasks with VCF files such as file validation, file merging, intersecting, complements

Rahul Nayak — Tue, 07 Aug 2018 10:01:46 -0500

VCFtools contains a Perl API (Vcf.pm) and a number of Perl scripts that can be used to perform common tasks with VCF files such as file validation, file merging, intersecting, complements, etc. The Perl tools support all versions of the VCF specification (3.2, 3.3, 4.0, 4.1 and 4.2), nevertheless, the users are encouraged to use the latest versions VCFv4.1 or VCFv4.2. The VCFtools in general have been used mainly with diploid data, but the Perl tools aim to support polyploid data as well. Run any of the Perl scripts with the --help switch to obtain more help.

Many of the Perl scripts require that the VCF files are compressed by bgzip and indexed by tabix (both tools are part of the tabix package, available for download here). The VCF files can be compressed and indexed using the following commands

bgzip my_file.vcf
tabix -p vcf my_file.vcf.gz

http://vcftools.sourceforge.net/perl_module.html

Address of the bookmark: http://vcftools.sourceforge.net/perl_module.html