BOL: Related items

Parallel Processing with Perl !

Rahul Nayak — Sat, 25 Aug 2018 11:32:40 -0500

Here is a small tutorial on how to make best use of multiple processors for bioinformatics analysis. One best way is using perl threads and forks. Knowing how these threads and forks work is very important before implementing them. Getting to know how these work would be really useful before reading this tutorial.

Many times in bioinformatics we need to deal with huge datasets which are more than 100GB size. The traditional way to analysis a file is using the while loop

while (FILE){

Do something;

}

This is very slow(since we are using only one processor) and if we have 500 million lines in the dataset it takes more than a day to iterate through the whole dataset. So how do we make best use of all our processors and get the work done quickly?

Here is a very simple and efficient technique with perl which i have been using. I am more inclined towards using perl fork than perl threads.

One of the oldest way to fork is

my $fork = fork();
if($fork){
push (@childs,$fork);
}
elseif($fork==0){
your code here;
exit(0);
}
else{die “Couldnt fork : $!”;}
## wait for the child process to finish
foreach(@childs){
my $tmp=waitid($_,0);
}

what a fork does is it creates a child process and takes the variables and code with it to analyze it separately (detached from the parent process) and thus a separate process is created( which usually runs on a separate processor). Thats it!! One big disadvantage of forking is its very difficult to share variables among the different processes. I will show you how to do it easily but still it has its own drawbacks.

Okie, now if you really do not want to use fork in your code, that’s okie too..There are many useful modules which do it for you very efficiently. One really useful module is Parallel::ForkManager. You can use Parallel::ForkManager to manage the number of forks you want to generate (number of processors you want to use).
Simple usage:
use Parallel::ForkManager;
my $max_processors=8;
my $fork= new Parallel::ForkManager($max_processors);
foreach (@dna) {
$fork->start and next; # do the fork
you code here;
$fork->finish; # do the exit in the child process
}
$pm->wait_all_children;

so you will be generating 8 forks which do the same thing for your each element of array. when one child finishes, Parallel::ForkManager generates a new one and thus you will be using all your processors to analyze the data. Now, if you have generated 8 child processes and want to write the data to one file. You need to lock the file to do this, because you will have problems with the buffering. You can lock the file using flock command.

open (my $QUAL, “myfile.txt”);
flock $QUAL, LOCK_EX or die “cant lock file $!”;
print $QUAL “$output”;
flock $QUAL, LOCK_UN or die “$!”;
close $QUAL;

I would not suggest using flock when dealing with multiple processes because it will decrease the processing efficiency( each child process must wait for the lock to be released by the other child process). Instead, I would suggest each fork writing to a separate file and after the processing just concatenating them.

Putting it all together, If you have 100GB data you can do this

step 1 : split the dataset equally according to number of processors you have. this may take a few hours(about 2-3 hrs for 100GB file)
You can use unix “split” command for this
for example:
my $number_split=int($number_of_entries_in_your_dataset/$max_processors);
my $split_Files=`split -l $number_split “your_file.fasta” “file_name”`;
step2: open you directory comtaining you split files and start Parallel::ForkManager.
For example:
opendir(DIRECTORY, $split_files_directory) or die $!; ### open the directory
my $fork= new Parallel::ForkManager($max_processors);
while (my $file = readdir(DIRECTORY)) { ### read the directory
if($file=~/^\./){next;}
print $file,”\n”;
########## Start fork ##########
my $pid= $super_fork->start and next;
Whatever you want to do with the split file ;
analyze my piece of $file;
######### end fork ###############
$super_fork->finish;
}
$super_fork->wait_all_children;

So basically each processor will be active with its piece of data (split file) and thus you have created 8 processes at one time which run without interfering with the other process. I again will not suggest writing output from each child process to one file(for reasons above). Write output from each fork to a separate file and finally concatenate them. Thats it, you have just increased your program speed by 8 times!! Isnt it easy?

Note:
You may worry about concatenation of the output each child generates, since it does take some time(remember 100GB). I think now you can use a mysql database LOAD DATA LOCAL INFILE command to load all the files into a single table(Should take about 3hrs for 100Gb dataset) and then export the whole table into one file. This should be faster than just concatenating them using “cat” command.(correct me if I am wrong)

Or much simpler way is to use pipes

cat output_dir/* | my_pipe or my_pipe <(file1) final_file;

Thats it guys!! Enjoy programming and please do comment. I am not a computer scientist so forgive me for any mistakes and if any please report them. Thank you.

Python and BioPython Tutorial

Manshi Raghubanshi — Fri, 23 Aug 2013 06:47:40 -0500

A quickstart tutorial that allows to become familiar with the Python language. The exercises expect knowledge of basic concepts of programming. A group of 2nd year computer science students with no previous Python knowledge required 60'-90' to complete the exercises. With about 3 hours time, the exercise is suitable for non-programmers as well.

Address of the bookmark: http://www.biotnet.org/training-materials/python-programmers

Research Associate Bioinformatics in IISc Recruitment 2020

Tue, 23 Jun 2020 21:53:34 -0500

Research Associate Bioinformatics in IISc Recruitment 2020

Essential Qualifications: Ph.D. (Bioinformatics/ Biophysics/ Biotechnology or any other stream of biological/ physical sciences) with a minimum of two publications in reputed peer reviewed journals in the area of structural bioinformatics or biophysics or biomolecular modeling/ simulation.

Job description: Development of bioinformatics tools and algorithms/software for structure based analysis of biomolecular systems. Programmatic access to major biomolecular databases using APIs Knowledge based prediction and analysis of biomolecular structure, function and interactions. Docking/simulations for inhibitor design.

Desirable Qualifications (Research Associate/s): i) Strong computer programming skills (in Python/PERL/PHP or C++ or object oriented database management systems like MySQL etc or scripting languages under LINUX/UNIX environment).

ii) Extensive experience in computational analysis of biomolecular structure/interactions and usage of advanced biomolecular simulation softwares. iii) Adequate knowledge of major databases, webservers and softwares in the area of biomolecular structure/function and drug design. iv) Familiarity with Parallel Programming environments and experience in usage of high-end HPC clusters.

The candidates must highlight their experience in above mentioned fields/topics in their CV. Initial appointment will be for a period of 1 year, subject to extension after review of performance.

Emoluments: As per DST, GOI norms and commensurate with experience.

More at https://www.iisc.ac.in/positions-open/

Bioinformatics Scripts

Jit — Thu, 22 Jan 2015 22:29:39 -0600

Some of the useful bioinformatics scripts.

For example ... contig-stats.pl is a Perl script that will automatically describe features of a sequence assembly.

http://milkweedgenome.org/?q=scripts

Address of the bookmark: http://milkweedgenome.org/?q=scripts

Coding Ground

Jitendra Narayan — Tue, 17 Mar 2015 00:47:20 -0500

Online coding group for most of the programming languages.

Code in almost all popular languages using Coding Ground. Edit, compile, execute and share your projects, 100% cloud.

http://www.tutorialspoint.com/codingground.htm

Address of the bookmark: http://www.tutorialspoint.com/codingground.htm

Many-Core Engine (MCE) for Perl example

Jit — Tue, 31 Jan 2017 05:37:50 -0600

MCE spawns a pool of workers and therefore does not fork a new process per each element of data. Instead, MCE follows a bank queuing model. Imagine the line being the data and bank-tellers the parallel workers. MCE enhances that model by adding the ability to chunk the next n elements from the input stream to the next available worker.

CORE MODULES

Three modules make up the core engine for MCE.

MCE::Core: Provides the Core API for Many-Core Engine. The various MCE options are described here.
MCE::Signal: Temporary directory creation, cleanup, and signal handling.
MCE::Util: Utility functions for Many-Core Engine.

MCE EXTRAS

There are 4 add-on modules for use with MCE.

MCE::Candy: Provides a collection of sugar methods and output iterators for preserving output order.
MCE::Mutex: Provides a simple semaphore implementation supporting threads and processes.
MCE::Queue: Provides a hybrid queuing implementation for MCE supporting normal queues and priority queues from a single module. MCE::Queue exchanges data via the core engine to enable queuing to work for both children (spawned from fork) and threads.
MCE::Relay: Enables workers to receive and pass on information orderly with zero involvement by the manager process while running.

MCE MODELS

The models take Many-Core Engine to a new level for ease of use. Two options (chunk_size and max_workers) are configured automatically as well as spawning and shutdown.

MCE::Loop: Provides a parallel loop utilizing MCE for building creative loops.
MCE::Flow: A parallel flow model for building creative applications. This makes use of user_tasks in MCE. The author has full control when utilizing this model. MCE::Flow is similar to MCE::Loop, but allows for multiple code blocks to run in parallel with a slight change to syntax.
MCE::Grep: Provides a parallel grep implementation similar to the native grep function.
MCE::Map: Provides a parallel map model similar to the native map function.
MCE::Step: Provides a parallel step implementation utilizing MCE::Queue between user tasks. MCE::Step is a spin off from MCE::Flow with a touch of MCE::Stream. This model, introduced in 1.506, allows one to pass data from one sub-task into the next transparently.
MCE::Stream: Provides an efficient parallel implementation for chaining multiple maps and greps together through user_tasks and MCE::Queue. Like with MCE::Flow, MCE::Stream can run multiple code blocks in parallel with a slight change to syntax from MCE::Map and MCE::Grep.

MISCELLANEOUS

Miscellaneous additions included with the distribution.

MCE::Examples: Describes various demonstrations for MCE including a Monte Carlo simulation.
MCE::Subs: Exports functions mapped directly to MCE methods; e.g. mce_wid. The module allows 3 options; :manager, :worker, and :getter.

REQUIREMENTS

Perl 5.8.0 or later. PDL::IO::Storable is required in scripts running PDL.

SOURCE AND FURTHER READING

The source, cookbook, and examples are hosted at GitHub.

Python vs Perl

Rahul Agarwal — Thu, 11 Jul 2013 14:39:19 -0500

Why bioinformatician still using Perl when Python is easy to code, good in ReXp and faster than perl?

BioPython Cookbook

Jitendra Narayan — Thu, 08 Aug 2013 06:43:02 -0500

If you are planning to start learning BioPython ( it does not bite but swallow :P just kidding) then this online cookbook will be really helpful for you.

http://biopython.org/DIST/docs/tutorial/Tutorial.html

Post Doc Computational Biology, Bioinformatics - Network Biology & Data Science, NGS (m/f/d)

Sat, 15 Feb 2020 06:13:35 -0600

https://www.jobvector.de/jobs-stellenangebote/biologie-life-sciences/forschung-entwicklung/post-doc-computational-biology-bioinformatics-network-biology-data-science-ngs-129867.html?suid=e522e9793b41817e52ac58d6963b94e2519920df

Requirements
Doctoral degree in Bioinformatics, Computational Biology, (Bio)physics/-mathematics, Biochemistry/Biology or similar with strong quantitative and numeric focus
Ability to numerically process complex and large data sets
Good programming skills (R/Bioconductor and/or Python preferred, Linux is a plus)
Experience in analyzing next-generation sequencing data sets using network biology
Scientific publication record in applied bioinformatics
Familiarity with single cell NGS analyses and other –omics techniques is a plus, but not essential

Bioinformatics Head (Bioinformatics Manager III), Cancer Genomics Research Laboratory at Frederick National Laboratory

Wed, 18 Aug 2021 00:19:48 -0500

Frederick National Laboratory seeking an enthusiastic, creative, and seasoned bioinformatics professional to join our leadership team and direct the exceptional Bioinformatics Group at the Cancer Genomics Research Laboratory (CGR). CGR has a diverse team of bioinformatics and computational scientists that support all areas of bioinformatics and data analysis (infrastructure, data QC, pipeline development and maintenance, data curation and sharing, methodology development, statistical analyses, machine learning approaches, and scientific interpretation).

More at https://leidosbiomed.csod.com/ats/careersite/jobdetails.aspx?site=4&c=leidosbiomed&id=2040