BOL: Related items

Parallel Processing with Perl !

Rahul Nayak — Sat, 25 Aug 2018 11:32:40 -0500

Here is a small tutorial on how to make best use of multiple processors for bioinformatics analysis. One best way is using perl threads and forks. Knowing how these threads and forks work is very important before implementing them. Getting to know how these work would be really useful before reading this tutorial.

Many times in bioinformatics we need to deal with huge datasets which are more than 100GB size. The traditional way to analysis a file is using the while loop

while (FILE){

Do something;

}

This is very slow(since we are using only one processor) and if we have 500 million lines in the dataset it takes more than a day to iterate through the whole dataset. So how do we make best use of all our processors and get the work done quickly?

Here is a very simple and efficient technique with perl which i have been using. I am more inclined towards using perl fork than perl threads.

One of the oldest way to fork is

my $fork = fork();
if($fork){
push (@childs,$fork);
}
elseif($fork==0){
your code here;
exit(0);
}
else{die “Couldnt fork : $!”;}
## wait for the child process to finish
foreach(@childs){
my $tmp=waitid($_,0);
}

what a fork does is it creates a child process and takes the variables and code with it to analyze it separately (detached from the parent process) and thus a separate process is created( which usually runs on a separate processor). Thats it!! One big disadvantage of forking is its very difficult to share variables among the different processes. I will show you how to do it easily but still it has its own drawbacks.

Okie, now if you really do not want to use fork in your code, that’s okie too..There are many useful modules which do it for you very efficiently. One really useful module is Parallel::ForkManager. You can use Parallel::ForkManager to manage the number of forks you want to generate (number of processors you want to use).
Simple usage:
use Parallel::ForkManager;
my $max_processors=8;
my $fork= new Parallel::ForkManager($max_processors);
foreach (@dna) {
$fork->start and next; # do the fork
you code here;
$fork->finish; # do the exit in the child process
}
$pm->wait_all_children;

so you will be generating 8 forks which do the same thing for your each element of array. when one child finishes, Parallel::ForkManager generates a new one and thus you will be using all your processors to analyze the data. Now, if you have generated 8 child processes and want to write the data to one file. You need to lock the file to do this, because you will have problems with the buffering. You can lock the file using flock command.

open (my $QUAL, “myfile.txt”);
flock $QUAL, LOCK_EX or die “cant lock file $!”;
print $QUAL “$output”;
flock $QUAL, LOCK_UN or die “$!”;
close $QUAL;

I would not suggest using flock when dealing with multiple processes because it will decrease the processing efficiency( each child process must wait for the lock to be released by the other child process). Instead, I would suggest each fork writing to a separate file and after the processing just concatenating them.

Putting it all together, If you have 100GB data you can do this

step 1 : split the dataset equally according to number of processors you have. this may take a few hours(about 2-3 hrs for 100GB file)
You can use unix “split” command for this
for example:
my $number_split=int($number_of_entries_in_your_dataset/$max_processors);
my $split_Files=`split -l $number_split “your_file.fasta” “file_name”`;
step2: open you directory comtaining you split files and start Parallel::ForkManager.
For example:
opendir(DIRECTORY, $split_files_directory) or die $!; ### open the directory
my $fork= new Parallel::ForkManager($max_processors);
while (my $file = readdir(DIRECTORY)) { ### read the directory
if($file=~/^\./){next;}
print $file,”\n”;
########## Start fork ##########
my $pid= $super_fork->start and next;
Whatever you want to do with the split file ;
analyze my piece of $file;
######### end fork ###############
$super_fork->finish;
}
$super_fork->wait_all_children;

So basically each processor will be active with its piece of data (split file) and thus you have created 8 processes at one time which run without interfering with the other process. I again will not suggest writing output from each child process to one file(for reasons above). Write output from each fork to a separate file and finally concatenate them. Thats it, you have just increased your program speed by 8 times!! Isnt it easy?

Note:
You may worry about concatenation of the output each child generates, since it does take some time(remember 100GB). I think now you can use a mysql database LOAD DATA LOCAL INFILE command to load all the files into a single table(Should take about 3hrs for 100Gb dataset) and then export the whole table into one file. This should be faster than just concatenating them using “cat” command.(correct me if I am wrong)

Or much simpler way is to use pipes

cat output_dir/* | my_pipe or my_pipe <(file1) final_file;

Thats it guys!! Enjoy programming and please do comment. I am not a computer scientist so forgive me for any mistakes and if any please report them. Thank you.

Computer Theory & Genetics: George Chao at TEDxUMNSalon

Thu, 15 Aug 2013 22:08:10 -0500

George Chao is an undergraduate senior studying Genetics and Computer Science at the University of Minnesota. Having started genetics research as soon as he entered the university, he has worked in labs spanning multiple disciplines as well as in Japan. Some of these researches include developmental genetics in Drosophila, computational techniques for analyzing protein interactions, and helping with the development of algorithms to analyze motion capture data of patients with neck pain. During this time, George steadily developed a fascination with the field of bioinformatics, the study of using computational techniques to learn from genetic data. He would like to go into a career of research into the application of bioinformatics in various fields. ---- The individuals involved with TEDxUMN have a passion for bringing together the great thinkers at the University of Minnesota and giving them the opportunity to share their ideas worth spreading and to discuss our shared future. We provide these great people the opportunity to share these ideas on a global stage and with an incredibly diverse audience. We believe in the power of ideas to change attitudes, lives and ultimately the world. Check out TEDxUMN at http://www.TEDxUMN.com/ In the spirit of ideas worth spreading, TEDx is a program of local, self-organized events that bring people together to share a TED-like experience. At a TEDx event, TEDTalks video and live speakers combine to spark deep discussion and connection in a small group. These local, self-organized events are branded TEDx, where x = independently organized TED event. The TED Conference provides general guidance for the TEDx program, but individual TEDx events are self-organized.* (*Subject to certain rules and regulations)

Senior Bioinformatics Scientist at Elucidata

Tue, 27 Nov 2018 04:05:57 -0600

Key Responsibilities
- Process and analyse metabolomic, transcriptional, genomics, proteomics
and any other kind of biological data.
- Interpret the data in the context of relevant biological literature to generate
actionable insights.
- Communicate the findings from data and literature to biologists and use the
biological insights to derive next steps/analyses.
- Communicate work through blogs, meet-ups, research papers, posters, etc.
- Identify, troubleshoot, and implement improvements to existing pipelines
and algorithms.
- Identify and implement new tools and pipelines to use for different types of
biological data.
- Work in a multi-disciplinary team with biologists, data scientists and data
analysts.
- Help with any other requirements (from database design to generating
prototypes for the product team).

Requirements
- 3-5 years of relevant bioinformatics experience such as public data mining,
processing, analysing and visualising omics data, etc.
- Ph.D., Masters or Bachelors in Bioinformatics, Biotechnology,
Computational Biology, or related field.
- Understanding of molecular biology and biochemistry.
- Comfort and experience with biological research and data.
- Proficient in a programming language used for bioinformatics such as R or
python.
- Excellent communication skills.
- Ability to summarise and simplify complex analyses for a non-technical
audience.
- Strong analytical skills, curiosity and a knack to solve difficult problems.
- Work well in multi-disciplinary teams with people of vastly different
backgrounds.
- Demonstrated success in collaboration and independent work.

More at https://angel.co/elucidata/jobs/460104-senior-bioinformatics-scientist

4 positions in high throughput computational metagenomics and systems biology of natural products

Tue, 20 Aug 2013 08:42:29 -0500

The Research and Innovation Centre at the Fondazione Edmund Mach (CRI-FEM) is a major international research institution with strong and expanding research interests in Fruit Genomics, Quality Health and Nutrition of Agricultural Products, Agro-ecosystems Sustainability, Biodiversity and Molecular Ecology.

CRI-FEM hosts GMPF, an International PhD Program in Genomics and Molecular Physiology of Fruit Crops and Fox-Lab, an international initiative in forest and wood research.
4 positions in high throughput computational metagenomics and systems biology of natural products - deadline September 30th, 2013

To support interdisciplinary research, CRI-FEM has established the Computational Biology Centre (CBC).

The mission of CBC is to develop systems-level integrative approaches connecting genotype to phenotype with a special focus on genome-wide analyses and next generation sequencing technologies.

CRI-FEM is seeking to attract 4 high calibre scientists in the areas of high throughput computational metagenomics and systems biology of natural products.

Here below the list of the 4 positions:

http://www.fmach.it/eng/Servizi-Generali/Lavora-con-noi/Annunci-lavoro-e-borse-di-studio/Details-of-the-5-positions-in-high-throughput-computational-metagenomics-and-systems-biology-of-natural-products-deadline-September-30th-2013/Post-doc-in-Metagenomics-screening-and-characterization-of-bioactive-microbial-compounds-130_CRI_MSC

http://www.fmach.it/eng/Servizi-Generali/Lavora-con-noi/Annunci-lavoro-e-borse-di-studio/Details-of-the-5-positions-in-high-throughput-computational-metagenomics-and-systems-biology-of-natural-products-deadline-September-30th-2013/Post-doc-in-Modeling-transcriptional-control-programs-at-a-genome-wide-scale-131_CRI_TCP

http://www.fmach.it/eng/Servizi-Generali/Lavora-con-noi/Annunci-lavoro-e-borse-di-studio/Details-of-the-5-positions-in-high-throughput-computational-metagenomics-and-systems-biology-of-natural-products-deadline-September-30th-2013/Technologist-in-Purification-of-plant-bioactive-molecules-from-complex-matrixes-132_CRI_PBM

http://www.fmach.it/eng/Servizi-Generali/Lavora-con-noi/Annunci-lavoro-e-borse-di-studio/Details-of-the-5-positions-in-high-throughput-computational-metagenomics-and-systems-biology-of-natural-products-deadline-September-30th-2013/Researcher-in-Methods-for-algorithmic-and-integrative-genomics-for-metagenomics-134_CRI_AIG

For more information on the CBC or informal inquiries on the advertised positions please contact Dr Duccio Cavalieri (e-mail duccio.cavalieri@fmach.it).

BINC Exam merged with DBT- BET JRF Exam

Jit — Thu, 21 Feb 2019 09:37:36 -0600

Another breaking news received has been received from the Department of biotechnology – DBT. As per a notification released by DBT, Bioinformatics National Certification (BINC) Exam conducted once per year by DBT has been now merged with DBT- BET JRF Exam.

Also, Bioinformatics Industrial Training Program (BIITP) is merged with the HRD Biotechnology Industrial Training Programme (BITP).

While this comes as a surprise for a lot of participants. We believe this is a good attempt to unify and create a national benchmark for talent. And we appreciate this endeavor from Department of biotechnology.

However, such last-minute announcements can create confusion. Thus candidates are advised to go through the complete notification DBT-BET JRF 2019 via the link below.If you have any kind of doubts, you must contact DBT JRF or Biotecnika for any kind of help & assistance.

Attention:-Bioinformatics Programs (BINC and BIITP)

1. Bioinformatics National Certification (BINC) has been merged with DBT-Junior
Research Fellow (BET Exam)

2. Bioinformatics Industrial Training Program (BIITP) is merged with HRDBiotechnology Industrial Training Programme (BITP).

Students of Bioinformatics, who are interested to apply for Fellowship or Industrial
Training may keep track of the advertisement of DBT-JRF (BET Exam) and BITP
of DBT.

More at http://www.bcil.nic.in/files/Attention_Bioinformatics_Programs_(BINC_and_BIITP).pdf

Dynamic Programming Alignment

Thu, 22 Aug 2013 09:38:28 -0500

lecture 9, Chem. C100, Spring 2013, UCLA

Bioinformatics web development course

Jit — Wed, 06 Nov 2019 20:42:48 -0600

This web development course, targeted at Biology and Bioinformatics students, aims at teaching from scratch all the skills needed to setup a fully working Linux web server and to develop and deploy web applications for Bioinformatics.

No previous programming knowledge is assumed. By following this tutorial you will learn the fundamental concepts of programming by using scripting languages: variables, types, arrays, cycles, conditional statements, functions, objects, regular expressions, files reading and manipulation et-cetera.

Address of the bookmark: http://www.cellbiol.com/bioinformatics_web_development/introduction/

PhD position in biochemistry towards bioinformatics at the Department of Biochemistry and Biophysics.

Tue, 03 Sep 2013 06:09:03 -0500

PhD position in biochemistry towards bioinformatics at the Department of Biochemistry and Biophysics. Reference number: SU FV-2293-13. Deadline for application: September 10, 2013.

Project title: Functional Inference from Domain Architecture and Orthology

Requirements

To be accepted as a PhD student, credits corresponding to four years of full-time studies at the undergraduate level are required, including credits corresponding to at least two years of fulltime studies in chemistry, life sciences or physics, depending on the program. The credits should include courses at the advanced level (second cycle) corresponding to one year and of these one semester should be a degree thesis. In order to facilitate the evaluation of merits and suitability for the PhD studies the curriculum vitae (CV) should contain information about the extent and focus of the academic studies. The quantity (as part of an academic year) and the quality mark of courses in chemistry and physics are of particular interest. The title, number of credits and the length in full-time months of undergraduate thesis and project work, should be specified.

Information

More information about the project can be provided by the project leader. General information about the PhD training program may be requested from the director of graduate studies, Stefan Nordlund, stefan@dbb.su.se or from Lena Mäler, Head of Department (prefekt), lenam@dbb.su.se

Further information on the web:

The Department of Biochemistry and Biophysics: www.dbb.su.se

Stockholm University: www.su.se/english

Faculty of Science: www.science.su.se/english

The handbook for postgraduate students: www.doktorandhandboken.nu/english

Application The application should contain a personal letter (a letter of intent explaining why you are interested in the specific project, why you are interested in studying for a PhD, what you hope to accomplish during your PhD studies, and what skills you can bring to this project), a curriculum vitae, a list of two persons who may act as referees (with telephone numbers and e-mail addresses), copies of degree certificates and transcripts of academic records, and a copy of your undergraduate thesis and articles, if any.

In order to apply for this position, please use the Stockholm University web-based application form (where it is possible to select language):

To the application form for this position.

Welcome with your application no later than September 10, 2013.

Project leader: Erik Sonnhammer, Erik.Sonnhammer@sbc.su.se,
www.sonnhammer.sbc.su.se

Advertisement: http://www.su.se/english/about/vacancies/phd-studies/phd-position-in-biochemistry-towards-bioinformatics-1.143446

The Clark Lab

Fri, 07 Feb 2020 13:57:24 -0600

Study the process of Adaptive Evolution, during which species adopt novel traits to overcome challenges. We retrace the evolutionary histories of genomic elements to determine the changes underlying adaptation and to discover previously unknown genetic networks. These discoveries have already led to advances in human health, species conservation, and molecular biology.

More at http://clark.genetics.utah.edu/

Senior Bioinformatics Programmer and SRF at BIOTECH PARK Lucknow

Fri, 23 Aug 2013 04:55:51 -0500

BIOTECH PARK

Advt. No. 3 (8)/BP/13

A walk-in-interview will be held in the Biotech Park Office at Sector G, Jankipuram, Kursi Road, Lucknow (U.P.) August 27, 2013 at 11.00 a.m. for the following posts of DBT sponsored project tenable at Biotech Park. Interested candidates fulfilling the requisite qualifications, experience and age as given below, may appear on the date of interview, before the Selection Committee. The candidate will have to join immediately.

INTERVIEW ON August 27, 2013 at 11.00 A.M.

2. SENIOR PROGRAMMER (ONE POST)

a) Educational Qualification M.Sc. Bioinformatics with minimum 60% marks with two years of relevant experience or B.Tech. Bioinformatics or Biotechnology with minimum 60% marks with two years experience in Bioinformatics.

b) Job Requirement Development of databases in multi user environment and application softwares, maintenance of website, Drug designing and QSAR study etc.

c) Desirable Knowledge of Bioinformatics tools, Windows, Linux, C++, JAVA / JAVA Script, Visual Basic, CGI, DBMS/RDBMS and HTML. Experience in various domains of bioinformatics such as structure based drug designing, Newtonian dynamics and OSAR studies.

d) Age Below 35 years (as on the date of interview)

e) Emoluments Rs. 12,000/- per month fixed.

Appointment will be made initially for one year extendable on satisfactory performance till the duration of the project.

3. SENIOR RESEARCH FELLOW: (ONE POST)

a) Educational Qualification M.Sc. in Biotechnology/Botany with minimum 60% marks and knowledge of handling database & database searching.

b) Essential Qualification Expertise in windows, Microsoft excel.

c) Desirable Good knowledge of statistical software packages like SPSS.

d) Age Below 35 years ( as on the date of interview)

e) Job Requirement: Management of database & website in multi user environment, computation of biological field data and generation of reports.

f) Emoluments

18000+ HRA for Net/GATE qualified
14000+ HRA for others

The appointment will be made till the duration of project.

Note: All the candidates should report for interview on or before 10.45 A.M.

General Conditions

The aforesaid positions are purely temporary and do not give the incumbent any right whatsoever for appointment on regular basis.
More Advertisement: http://www.biotechpark.org.in/html/jobs%20in%20Biotech%20Park/Job_2013_04.htm