BOL: Related items

Awk for Bioinformatician and computational biologist

Poonam Mahapatra — Tue, 06 Feb 2018 14:54:35 -0600

Awk is a programming language which allows easy manipulation of structured data and is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that match with the specified patterns and then perform associated actions. The basic syntax is:

awk '/pattern1/ {Actions}
/pattern2/ {Actions}' file

The working of Awk is as follows
Awk reads the input files one line at a time.
For each line, it matches with given pattern in the given order, if matches performs the corresponding action.
If no pattern matches, no action will be performed.
In the above syntax, either search pattern or action are optional, But not both.
If the search pattern is not given, then Awk performs the given actions for each line of the input.
If the action is not given, print all that lines that matches with the given patterns which is the default action.
Empty braces with out any action does nothing. It wont perform default printing operation.
Each statement in Actions should be delimited by semicolon.
Say you have data.tsv with the following contents:

$ cat data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3 ACTTATATATATATA
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
By default Awk prints every line from the file.

$ awk '{print;}' data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3 ACTTATATATATATA
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
We print the line which matches the pattern contig3

$ awk '/contig3/' data/test.tsv
contig3 ACTTATATATATATA
Awk has number of builtin variables. For each record i.e line, it splits the record delimited by whitespace character by default and stores it in the $n variables. If the line has 5 words, it will be stored in $1, $2, $3, $4 and $5. $0 represents the whole line. NF is a builtin variable which represents the total number of fields in a record.

$ awk '{print $1","$2;}' data/test.tsv
contig1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2,ACTTTATATATT
contig3,ACTTATATATATATA
contig4,ACTTATATATATATA
contig5,ACTTTATATATT

$ awk '{print $1","$NF;}' data/test.tsv
contig1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2,ACTTTATATATT
contig3,ACTTATATATATATA
contig4,ACTTATATATATATA
contig5,ACTTTATATATT

Awk has two important patterns which are specified by the keyword called BEGIN and END. The syntax is as follows:

BEGIN { Actions before reading the file}
{Actions for everyline in the file}
END { Actions after reading the file }

For example,
$ awk 'BEGIN{print "Header,Sequence"}{print $1","$2;}END{print "-------"}' data/test.tsv
Header,Sequence
contig1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2,ACTTTATATATT
contig3,ACTTATATATATATA
contig4,ACTTATATATATATA
contig5,ACTTTATATATT
-------
We can also use the concept of a conditional operator in print statement of the form print CONDITION ? PRINT_IF_TRUE_TEXT : PRINT_IF_FALSE_TEXT. For example, in the code below, we identify sequences with lengths > 14:

$ awk '{print (length($2)>14) ? $0">14" : $0"<=14";}' data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG>14
contig2 ACTTTATATATT<=14
contig3 ACTTATATATATATA>14
contig4 ACTTATATATATATA>14
contig5 ACTTTATATATT<=14
We can also use 1 after the last block {} to print everything (1 is a shorthand notation for {print $0} which becomes {print} as without any argument print will print $0 by default), and within this block, we can change $0, for example to assign the first field to $0 for third line (NR==3), we can use:

$ awk 'NR==3{$0=$1}1' data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
You can have as many blocks as you want and they will be executed on each line in the order they appear, for example, if we want to print $1 three times (here we are using printf instead of print as the former doesn't put end-of-line character),

$ awk '{printf $1"\t"}{printf $1"\t"}{print $1}' data/test.tsv
contig1 contig1 contig1
contig2 contig2 contig2
contig3 contig3 contig3
contig4 contig4 contig4
contig5 contig5 contig5
Although, we can also skip executing later blocks for a given line by using next keyword:

$ awk '{printf $1"\t"}NR==3{print "";next}{print $1}' data/test.tsv
contig1 contig1
contig2 contig2
contig3
contig4 contig4
contig5 contig5

$ awk 'NR==3{print "";next}{printf $1"\t"}{print $1}' data/test.tsv
contig1 contig1
contig2 contig2

contig4 contig4
contig5 contig5
You can also use getline to load the contents of another file in addition to the one you are reading, for example, in the statement given below, the while loop will load each line from test.tsv into k until no more lines are to be read:

$ awk 'BEGIN{while((getline k <"data/test.tsv")>0) print "BEGIN:"k}{print}' data/test.tsv
BEGIN:contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
BEGIN:contig2 ACTTTATATATT
BEGIN:contig3 ACTTATATATATATA
BEGIN:contig4 ACTTATATATATATA
BEGIN:contig5 ACTTTATATATT
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3 ACTTATATATATATA
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
You can also store data in the memory with the syntax VARIABLE_NAME[KEY]=VALUE which you can later use through for (INDEX in VARIABLE_NAME) command:

$ awk '{i[$1]=1}END{for (j in i) print j"<="i[j]}' data/test.tsv
contig1<=1
contig2<=1
contig3<=1
contig4<=1
contig5<=1

Prime Minister’s 100k Genome Project

Jitendra Narayan — Thu, 08 Aug 2013 09:40:39 -0500

Genomics Ebgland is destined to sequence 100,000 patients over the next five year in England. A landmark project by british government.

Genomics England will play a key role in building on the UK’s long track record as leader in medical science advances to push the boundaries by unlocking the power of DNA data. The UK will become the first ever country to introduce this technology in its mainstream health system – leading the global race for better tests, better drugs and above all better, more personalised care.

http://www.genomicsengland.co.uk/100k-genome-project/

Project-based approach to improve bioinformatics education with skilled and meaningful access to omics data

eliabrodsky — Wed, 11 Apr 2018 13:31:42 -0500

Pine Biotech has been collaborating with Loyola University of New Orleans on piloting a new approach to bioinformatics education using the intuitive and logic-drive bioinformatics platform T-BioInfo.

https://edu.t-bio.info/collaborative-model-bioinformatics-education-combining-biologically-inspired-bioinformatics-project-based-learning/

2013 NextGen Genomics & Bioinformatics Technologies (NGBT) Conference, New Delhi, INDIA

Thu, 08 Aug 2013 16:21:16 -0500

2013 NextGen Genomics & Bioinformatics Technologies (NGBT) Conference

SciGenom Research Foundation (SGRF) and Institute of Genomics and Integrative Biology (IGIB) are pleased to host the Next-Generation Sequencing and Bioinformatics for Genomics & Healthcare conference.

In the ten years since the first human reference genome was completed for US$3 billion the sequencing technologies have radically changed leading to great reduction in sequencing cost. Today a human genome can be sequenced for under US$ 5000 in less than two weeks. It is expected that by the end of 2015 the cost of sequencing a human genome will drop to below thousand dollars. The next generation sequencing technologies over the past five years have enabled a large number of genomic studies that impact human health and disease. Also, this has made possible the growth of microbial, animal and plant genomics studies. While the data production has increased at a rapid pace challenges remain in analyzing and understanding the data. The conference will cover the next generation sequencing (NGS) technologies, bioinformatics for NGS and applications of NGS in many areas including personalized medicine.

For more info : http://www.scigenomconferences.com/2013/default.php

Biologist versus computational biologist !

Abhimanyu Singh — Mon, 29 Oct 2018 04:23:24 -0500

This is how it work :)

Postdoctoral Associate - Bioinformatics at Duke University Medical Center

Sat, 10 Aug 2013 18:38:38 -0500

The Department of Biostatistics and Bioinformatics at Duke University Medical Center is seeking a Postdoctoral Associate for a one year appointment to work on several high-dimensional research projects. The specific goals of the project are to identify genes or molecular markers that are predictive of clinical outcomes in renal and prostate cancer.

Candidates must have: a PhD degree in statistics, biostatistics or bioinformatics, extensive experience in analyzing high-dimensional data (microarray, SNP, CNVs) and of validation approaches. In addition, experience in penalized regression methods, data base manipulation; and strong programming skills in order to conduct Monte Carlo studies and applications (R). Candidate must have excellent communication skills (verbal, written and presentation), a strong proficiency in Linux system.

This position is available immediately and will be filled as soon as possible. Appointment could be extended beyond the first year based on additional funding.

For more information about the Department of Biostatistics and Bioinformatics, please visit our website: http://www.biostat.duke.edu.

For more info: http://biostat.duke.edu/sites/biostat.duke.edu/files/Halabi%20-%20Postdoc%20Job%20Posting%202013%20updated.pdf

Duke University is an Equal Opportunity/Affirmative Action Employer.

Bioinformatics for Precision Oncology - Online Training Program, Summer 2019

eliabrodsky — Wed, 05 Jun 2019 15:04:41 -0500

The bioinforamtics for precision oncology online course provides an opportunity to learn about bioinformatics methods used in precision oncology research and practice. As a subset of precision medicine, precision oncology deals with molecular factors involved in the biological rpocesses that lead to cancer and can help diagnose, treat or prevent this disease. Oncology is driven by data, often times generated using Next Generation Sequencing (NGS) that helps us study the genomic and transcriptomic sub-cellular processes. Learn more and register: https://edu.t-bio.info/bioinformatics-training-precision-oncology/

What are the difference between BioRuby and BioGem?

Neel — Mon, 12 Aug 2013 09:27:57 -0500

I came across two diferent but matching term BioRuby and BioGem. What are the difference between these two term? If both are using same Ruby language for development then why did they develope two different biological packages.

Computational Biology in the 21st Century: Making Sense out of Massive Data

Thu, 29 Aug 2013 08:32:26 -0500

Computational Biology in the 21st Century: Making Sense out of Massive Data Air date: Wednesday, February 01, 2012, 3:00:00 PM Category: Wednesday Afternoon Lectures Description: The last two decades have seen an exponential increase in genomic and biomedical data, which will soon outstrip advances in computing power to perform current methods of analysis. Extracting new science from these massive datasets will require not only faster computers; it will require smarter algorithms. We show how ideas from cutting-edge algorithms, including spectral graph theory and modern data structures, can be used to attack challenges in sequencing, medical genomics and biological networks. The NIH Wednesday Afternoon Lecture Series includes weekly scientific talks by some of the top researchers in the biomedical sciences worldwide. Author: Dr. Bonnie Berger Runtime: 00:58:06 Permanent link: http://videocast.nih.gov/launch.asp?17563

Multiple PhD positions

Sun, 09 Feb 2020 03:10:28 -0600

14 PhD positions in the EU Horizon 2020 Marie Skłodowska-Curie Project PRECODE:
International training network which sets a joint research programme to train a new generation of leading scientists in model systems and methods for the development of new therapies for pancreatic cancer (PaCa)

http://precode-project.eu/jobs-board/#1572451761376-39d75f63-c6fb