BOL: Related items

Parallel Processing with Perl !

Rahul Nayak — Sat, 25 Aug 2018 11:32:40 -0500

Here is a small tutorial on how to make best use of multiple processors for bioinformatics analysis. One best way is using perl threads and forks. Knowing how these threads and forks work is very important before implementing them. Getting to know how these work would be really useful before reading this tutorial.

Many times in bioinformatics we need to deal with huge datasets which are more than 100GB size. The traditional way to analysis a file is using the while loop

while (FILE){

Do something;

}

This is very slow(since we are using only one processor) and if we have 500 million lines in the dataset it takes more than a day to iterate through the whole dataset. So how do we make best use of all our processors and get the work done quickly?

Here is a very simple and efficient technique with perl which i have been using. I am more inclined towards using perl fork than perl threads.

One of the oldest way to fork is

my $fork = fork();
if($fork){
push (@childs,$fork);
}
elseif($fork==0){
your code here;
exit(0);
}
else{die “Couldnt fork : $!”;}
## wait for the child process to finish
foreach(@childs){
my $tmp=waitid($_,0);
}

what a fork does is it creates a child process and takes the variables and code with it to analyze it separately (detached from the parent process) and thus a separate process is created( which usually runs on a separate processor). Thats it!! One big disadvantage of forking is its very difficult to share variables among the different processes. I will show you how to do it easily but still it has its own drawbacks.

Okie, now if you really do not want to use fork in your code, that’s okie too..There are many useful modules which do it for you very efficiently. One really useful module is Parallel::ForkManager. You can use Parallel::ForkManager to manage the number of forks you want to generate (number of processors you want to use).
Simple usage:
use Parallel::ForkManager;
my $max_processors=8;
my $fork= new Parallel::ForkManager($max_processors);
foreach (@dna) {
$fork->start and next; # do the fork
you code here;
$fork->finish; # do the exit in the child process
}
$pm->wait_all_children;

so you will be generating 8 forks which do the same thing for your each element of array. when one child finishes, Parallel::ForkManager generates a new one and thus you will be using all your processors to analyze the data. Now, if you have generated 8 child processes and want to write the data to one file. You need to lock the file to do this, because you will have problems with the buffering. You can lock the file using flock command.

open (my $QUAL, “myfile.txt”);
flock $QUAL, LOCK_EX or die “cant lock file $!”;
print $QUAL “$output”;
flock $QUAL, LOCK_UN or die “$!”;
close $QUAL;

I would not suggest using flock when dealing with multiple processes because it will decrease the processing efficiency( each child process must wait for the lock to be released by the other child process). Instead, I would suggest each fork writing to a separate file and after the processing just concatenating them.

Putting it all together, If you have 100GB data you can do this

step 1 : split the dataset equally according to number of processors you have. this may take a few hours(about 2-3 hrs for 100GB file)
You can use unix “split” command for this
for example:
my $number_split=int($number_of_entries_in_your_dataset/$max_processors);
my $split_Files=`split -l $number_split “your_file.fasta” “file_name”`;
step2: open you directory comtaining you split files and start Parallel::ForkManager.
For example:
opendir(DIRECTORY, $split_files_directory) or die $!; ### open the directory
my $fork= new Parallel::ForkManager($max_processors);
while (my $file = readdir(DIRECTORY)) { ### read the directory
if($file=~/^\./){next;}
print $file,”\n”;
########## Start fork ##########
my $pid= $super_fork->start and next;
Whatever you want to do with the split file ;
analyze my piece of $file;
######### end fork ###############
$super_fork->finish;
}
$super_fork->wait_all_children;

So basically each processor will be active with its piece of data (split file) and thus you have created 8 processes at one time which run without interfering with the other process. I again will not suggest writing output from each child process to one file(for reasons above). Write output from each fork to a separate file and finally concatenate them. Thats it, you have just increased your program speed by 8 times!! Isnt it easy?

Note:
You may worry about concatenation of the output each child generates, since it does take some time(remember 100GB). I think now you can use a mysql database LOAD DATA LOCAL INFILE command to load all the files into a single table(Should take about 3hrs for 100Gb dataset) and then export the whole table into one file. This should be faster than just concatenating them using “cat” command.(correct me if I am wrong)

Or much simpler way is to use pipes

cat output_dir/* | my_pipe or my_pipe <(file1) final_file;

Thats it guys!! Enjoy programming and please do comment. I am not a computer scientist so forgive me for any mistakes and if any please report them. Thank you.

Rosalind Problem Solution with Perl

Jit — Tue, 09 Jun 2015 23:35:18 -0500

Rosalind is a platform for learning bioinformatics and programming through problem solving. Take a tour to get the hang of how Rosalind works.

Bioinformatics Textbook Track

Find more about Rosalind puzzle at http://rosalind.info/problems/list-view/?location=bioinformatics-textbook-track

I will provide solution of all the Rosalind problem with Perl for community.

Check out the right sidebar for more links ...

Bpipe - a tool for running and managing bioinformatics pipelines

Radha Agarkar — Sat, 21 May 2016 22:42:16 -0500

Bpipe provides a platform for running big bioinformatics jobs that consist of a series of processing stages - known as 'pipelines'.

January 20th, 2016 - New! Bpipe 0.9.9 released!
Download latest, all
Documentation
Mailing List (Google Group)

Bpipe has been published in Bioinformatics! If you use Bpipe, please cite:

Sadedin S, Pope B & Oshlack A, Bpipe: A Tool for Running and Managing Bioinformatics Pipelines, Bioinformatics

Address of the bookmark: http://docs.bpipe.org/

Bioinformatics: Introduction to PERL

Archana Malhotra — Thu, 11 Jul 2013 09:49:37 -0500

This course is aimed at those new to programming and provides an introduction to programming using Perl. By the end of this course, attendees should be able to write simple Perl programs and to understand more complex Perl programs written by others. The course will be taught using the online Learning Perl materials created by Sofia Robb of the University of California Riverside. Further information is available.

Bioinformatics Scientist, Production Bioinformatics @ South San Francisco, CA

Thu, 19 Aug 2021 08:45:24 -0500

wist is looking for a Bioinformatics Scientist to join our Production Bioinformatics Team. You will work alongside research scientists, software engineers and data scientists to further deliver on our mission to expand access to best-in-class synthetic biology and next-generation sequencing applications. You will be developing and engineering tools to better evaluate and build hardened, production quality pipelines, optimize data quality, and automate lab and bioinformatics processes. Our ideal candidate is an organized problem solver with a background in developing and building novel production-quality bioinformatics tools and packages. Equally excellent communication skills and a proven ability to work independently are required.

More at https://boards.greenhouse.io/twistbioscience/jobs/3135495?gh_src=9ecc0b941us

Type Hinting

Pranjali Yadav — Fri, 09 Jan 2015 22:26:13 -0600

Python creator Guido van Rossum’s proposal for static type-checking annotations is inching closer to reality, and the feature has taken on a new name: type hinting.

Back in August, van Rossum published a proposal on the Python mailing list recommending type-checking annotations as a valuable feature for the next version of Python to improve the performance of editors and IDEs, linter capabilities, standard notation, and refactoring. Van Rossum’s latest proposal, posted late last month, outlined plans to publish a Python Enhancement Proposal (PEP) in early January to put the feature now known as type hinting on track for inclusion in Python 3.5, slated for release this September.

Reference

https://quip.com/r69HA9GhGa7J

Pattern Matching Problem Solution with Perl

Jit — Tue, 09 Jun 2015 23:58:45 -0500

Problem at http://rosalind.info/problems/1c/

#Find all occurrences of a pattern in a string.
#Given: Strings Pattern and Genome.
#Return: All starting positions in Genome where Pattern appears as a substring. Use 0-based indexing.

use strict;
use warnings;

my $string="GATATATGCATATACTT";
my $subStr="ATAT";
my $kmer=length($subStr);

kmerMatch ($string, $subStr, $kmer);

sub kmerMatch { #Check the exact matching kmers with sliding window
my ($string, $myStr, $kmer)=@_;
for (my $aa=0; $aa<=(length($string)-$kmer); $aa++) {
    my $myWin=substr $string, $aa,$kmer;
    if ($myWin eq $myStr) {
        #print "$myWin eq $myStr\n";
        print $aa;
    }
}
}

BioScripts

Rahul Nayak — Sun, 28 Jun 2015 07:46:14 -0500

You are requested to please bookmark collection of bioinformatics tools, scripts, codes that can be pieced together in a very easy and flexible manner to perform both simple and complex bioinformatics tasks.

The next-generation sequencing included whole genome sequencing(WGS), transcriptome sequencing (whole cDNA sequencing, RNA-seq), digital gene expression sequencing (Tag-Seq), ChIP-Seq, and so on. And there are many sequencing platform to generate sequece, as well know Sanger/ABi(the frist generation), Solexa/illumina, SOLiD/ABi, 454/Roche. But thier sequence format is different, also they have different error type. High quality data is very important for further analysis or data mining. There are many pipeline for raw sequence quality analysis and control with few of process for reporting reads quality statistical details, trimming, filtering, and error correction. Please bookmarks them for the benefits of bioinformatics community.

https://code.google.com/p/biowiki/

https://code.google.com/p/ngs-pipeline/source/browse/#svn%2Ftrunk

NGSand Perl scripts https://code.google.com/hosting/search?q=NGS+perl&projectsearch=Search+projects

NGS and Python scripts https://code.google.com/hosting/search?q=NGS+Python&projectsearch=Search+projects

Address of the bookmark: https://code.google.com/hosting/search?q=bioinformatics&sa=Search

BioToolbox

Jit — Fri, 19 Feb 2016 09:14:44 -0600

This is a collection of libraries and high-quality end-user scripts for bioinformatic analysis, including working with gene annotation, collecting data scores from a variety of modern file formats, and conversion between file formats. The Bio::ToolBox libraries provide a unified, abstracted interface to multiple common gene annotation formats and the collection of data from multiple data files. They rely on BioPerl SeqFeature libraries and related adaptors to access binary file formats including Bam, BigWig, BigBed, and USeq. The Bio::ToolBox package includes scripts for setting up databases of annotation, collecting annotated features, collecting genomic data relative to features, manipulating and analyzing data, and data format conversion.

More at http://cpansearch.perl.org/src/TJPARNELL/

Address of the bookmark: http://cpansearch.perl.org/src/TJPARNELL/

How to install Perl modules on Mac OS X in easy steps !!

Jit — Thu, 20 Oct 2016 07:26:29 -0500

Today at work, I learned how to install Perl modules using CPAN. It’s a lot easier than I thought.

You see, for the past couple of years, I’ve been a bit frustrated because OS X does not come with a whole lot of Perl modules pre-installed, and for all I googled, I couldn’t find an “idiot’s” guide for moderately-savvy-but-not-expert users like myself to install modules and dependencies on demand.

The only instructions I could find point to Fink, which basically installs modules in a path that isn’t included in the Perl @INC variable, meaning you have to manually specify the full path to the modules in every script — which is not a lot of fun if you’re developing on OS X and deploying on Red Hat, for instance.

Moreover, Fink doesn’t seem to make every module available, and it’s not very easy to determine which Fink package you need to install if you need a particular module.

So, with a script that called on several apparently unavailable modules, and a deadline looming, I finally decided to suck it up and figure out how to use CPAN to install them:

1) Make sure you have the Apple Developer Tools (XCode) installed.

These are on one of your install discs, or available as a huge but free download from the Apple Developer Connection [free registration required] or the Mac App Store. I thought I had them, but apparently when we upgraded that computer to Tiger, they went missing.

If you don’t have this stuff installed, your installation will fail with errors about unavailable commands.

1.5) Install Command Line Tools (Recent XCode versions only)

(Thank you to Tom Marchioro for informing me about this step.)

Older versions of XCode installed the command line tools (which are required to properly install CPAN modules) by default, but apparently newer ones do not. To check whether you have the command line tools already installed, run the following from the Terminal:

$ which make

This command checks the system for the “make” tool. If it spits out something like /usr/bin/make you’re golden and can skip ahead to Step 2. If you just get a new prompt and no output, you’ll need to install the tools:

Launch XCode and bring up the Preferences panel.
Click on the Downloads tab
Click to install the Command Line Tools

If you like, you can run which make again to confirm that everything’s installed correctly.

2) Configure CPAN.

$ sudo perl -MCPAN -e shell

perl> o conf init

This will prompt you for some settings. You can accept the defaults for almost everything (just hit “return”). The two things you must fill in are the path to make (which should be /usr/bin/make or the value returned when you run which make from the command line) and your choice of CPAN mirrors (which you actually choose don’t really matter, but it won’t let you finish until you select at least one). If you use a proxy or a very restrictive firewall, you may have to configure those settings as well.

If you skip Step 2, you may get errors about make being unavailable.

3) Upgrade CPAN

$ sudo perl -MCPAN -e 'install Bundle::CPAN'

Don’t forget the sudo, or it’ll fail with permissions errors, probably when doing something relatively unimportant like installing man files.

This will spend a long time downloading, testing, and compiling various files and dependencies. Bear with it. It will prompt you a few times about dependencies. You probably want to enter “yes”. I agreed to everything it asked me, and everything turned out fine. YMMV of course. If everything installs properly, it’ll give you an “OK” at the end.

4) Install your modules. For each module….

$ sudo perl -MCPAN -e 'install Bundle::Name'

$ sudo perl -MCPAN -e 'install Module::Name'

This will install the module and its dependencies. Nice, eh? Again, don’t forget the sudo.

The first time you run this after upgrading CPAN, it may prompt you to configure again (see Step 2). If you accept its offer to try to configure itself automatically, it may just run through everything without a problem.

There are a couple of potential pitfalls with specific modules (such as theLWP::UserAgent / HEAD issue), but most have workarounds, and I haven’t run into anything that wasn’t easily recoverable.

And that’s it!

Did you find this useful? Is there anything I missed?