BOL: Related items

Awk for Bioinformatician and computational biologist

Poonam Mahapatra — Tue, 06 Feb 2018 14:54:35 -0600

Awk is a programming language which allows easy manipulation of structured data and is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that match with the specified patterns and then perform associated actions. The basic syntax is:

awk '/pattern1/ {Actions}
/pattern2/ {Actions}' file

The working of Awk is as follows
Awk reads the input files one line at a time.
For each line, it matches with given pattern in the given order, if matches performs the corresponding action.
If no pattern matches, no action will be performed.
In the above syntax, either search pattern or action are optional, But not both.
If the search pattern is not given, then Awk performs the given actions for each line of the input.
If the action is not given, print all that lines that matches with the given patterns which is the default action.
Empty braces with out any action does nothing. It wont perform default printing operation.
Each statement in Actions should be delimited by semicolon.
Say you have data.tsv with the following contents:

$ cat data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3 ACTTATATATATATA
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
By default Awk prints every line from the file.

$ awk '{print;}' data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3 ACTTATATATATATA
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
We print the line which matches the pattern contig3

$ awk '/contig3/' data/test.tsv
contig3 ACTTATATATATATA
Awk has number of builtin variables. For each record i.e line, it splits the record delimited by whitespace character by default and stores it in the $n variables. If the line has 5 words, it will be stored in $1, $2, $3, $4 and $5. $0 represents the whole line. NF is a builtin variable which represents the total number of fields in a record.

$ awk '{print $1","$2;}' data/test.tsv
contig1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2,ACTTTATATATT
contig3,ACTTATATATATATA
contig4,ACTTATATATATATA
contig5,ACTTTATATATT

$ awk '{print $1","$NF;}' data/test.tsv
contig1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2,ACTTTATATATT
contig3,ACTTATATATATATA
contig4,ACTTATATATATATA
contig5,ACTTTATATATT

Awk has two important patterns which are specified by the keyword called BEGIN and END. The syntax is as follows:

BEGIN { Actions before reading the file}
{Actions for everyline in the file}
END { Actions after reading the file }

For example,
$ awk 'BEGIN{print "Header,Sequence"}{print $1","$2;}END{print "-------"}' data/test.tsv
Header,Sequence
contig1,ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2,ACTTTATATATT
contig3,ACTTATATATATATA
contig4,ACTTATATATATATA
contig5,ACTTTATATATT
-------
We can also use the concept of a conditional operator in print statement of the form print CONDITION ? PRINT_IF_TRUE_TEXT : PRINT_IF_FALSE_TEXT. For example, in the code below, we identify sequences with lengths > 14:

$ awk '{print (length($2)>14) ? $0">14" : $0"<=14";}' data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG>14
contig2 ACTTTATATATT<=14
contig3 ACTTATATATATATA>14
contig4 ACTTATATATATATA>14
contig5 ACTTTATATATT<=14
We can also use 1 after the last block {} to print everything (1 is a shorthand notation for {print $0} which becomes {print} as without any argument print will print $0 by default), and within this block, we can change $0, for example to assign the first field to $0 for third line (NR==3), we can use:

$ awk 'NR==3{$0=$1}1' data/test.tsv
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
You can have as many blocks as you want and they will be executed on each line in the order they appear, for example, if we want to print $1 three times (here we are using printf instead of print as the former doesn't put end-of-line character),

$ awk '{printf $1"\t"}{printf $1"\t"}{print $1}' data/test.tsv
contig1 contig1 contig1
contig2 contig2 contig2
contig3 contig3 contig3
contig4 contig4 contig4
contig5 contig5 contig5
Although, we can also skip executing later blocks for a given line by using next keyword:

$ awk '{printf $1"\t"}NR==3{print "";next}{print $1}' data/test.tsv
contig1 contig1
contig2 contig2
contig3
contig4 contig4
contig5 contig5

$ awk 'NR==3{print "";next}{printf $1"\t"}{print $1}' data/test.tsv
contig1 contig1
contig2 contig2

contig4 contig4
contig5 contig5
You can also use getline to load the contents of another file in addition to the one you are reading, for example, in the statement given below, the while loop will load each line from test.tsv into k until no more lines are to be read:

$ awk 'BEGIN{while((getline k <"data/test.tsv")>0) print "BEGIN:"k}{print}' data/test.tsv
BEGIN:contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
BEGIN:contig2 ACTTTATATATT
BEGIN:contig3 ACTTATATATATATA
BEGIN:contig4 ACTTATATATATATA
BEGIN:contig5 ACTTTATATATT
contig1 ACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG
contig2 ACTTTATATATT
contig3 ACTTATATATATATA
contig4 ACTTATATATATATA
contig5 ACTTTATATATT
You can also store data in the memory with the syntax VARIABLE_NAME[KEY]=VALUE which you can later use through for (INDEX in VARIABLE_NAME) command:

$ awk '{i[$1]=1}END{for (j in i) print j"<="i[j]}' data/test.tsv
contig1<=1
contig2<=1
contig3<=1
contig4<=1
contig5<=1

Learning Python Programming - a bioinformatician perspective !

Rahul Nayak — Mon, 14 May 2018 16:33:03 -0500

Python Programming is a general purpose programming language that is open source, flexible, powerful and easy to use. One of the most important features of python is its rich set of utilities and libraries for data processing and analytics tasks. In the current era of big biological data, python and biopython is getting more popularity due to its easy-to-use features which supports big data processing.

In this tutorial series article, I will explore features and packages of python which are widely used in the big data, NGS, and bioinformatics. I will also walk through a real biological example which shows NGS data processing with the help of python packages and programming.

Python has a couple of points to recommend it to biologists and scientists specifically:

It's widely used in the scientific community
It has a couple of very well designed libraries for doing complex scientific computing (although we won't encounter them in this book)
It lend itself well to being integrated with other, existing tools
It has features which make it easy to manipulate strings of characters (for example, strings of DNA bases and protein amino acid residues, which we as biologists are particularly fond of)

In general, following are some of the important features of python which makes it a perfect fit for rapid application development.

Python is interpreted language so the program does not need to be compiled. Interpreter parses the program code and generates the output.
Python is dynamically typed, so the variables types are defined automatically.
Python is strongly typed. So the developers need to cast the type manually.
Less code and more use makes it more acceptable.
Python is portable, extendable and scalable.

There are two major Python versions, Python 2 and Python 3. Python 2 and 3 are quite different. This tutorial uses Python 3, because it more semantically correct and supports newer features.

I will post tutorial on daily basis on this page. Check the sub-pages on right side.

Golden Rules of Bioinformatics

Jitendra Narayan — Wed, 14 Aug 2013 21:11:33 -0500

All constant are variable.
Copy and paste is a genetic error.
First solve the problem, then write the code.
No matter what goes wrong, it will probably look right.
Any simple problem can be insoluble if enough metting are held to discuss it. :P
Stastics is a systematic method of comming to the wrong conclusion with confidence.
Bug is a undocumented feature in programming languages.
Good biological programmer goes on summer holiday with raincoat. [because see 1]
Thanks god Google know python is not a python and multiplication and division are the same thing.
Don' be clever, complex biology will trick you.

The Best Bioinformatics / Computational Biology Quotes

Jit — Wed, 26 Feb 2014 17:50:59 -0600

Bioinformatician are not anti-social; We are just genome friendly.

Bioinformatician would love to change the biological world, but they won't give us the genetic code :P

If at first you don't succeed; call it version 1.0

The glass is neither half-full nor half-empty: it's actually have several genomes.

I'm BioGeek.

Fedup with LIPS, try God script.

Idiot, Go ahead, make my data!

Thank god, my genome just compiled.

Error message: "Out of space on genome drive:"

Shut up mobile elements, or i'll flush you out.

Never underestimate the internet bandwidth, u gotta incomplete.

Applied fuzzy logic to understand God's logic?

Warning! Overflow, delete chromosome !

Be nice to the BioGeek, for all you know they might be the next curator!

Beware of computational biologist they screw genes and protein.

Warning! Your genome is full of garbage, delete it !

Bad or missing mouse genome. Spank the cat? (Y/N)

Genome make very fast, very accurate mistakes.

Let's BLAST it.

Some genome never has transposons. It just develops random features.

Go watch CINEMA and have BLAST.

Bioinformatics Jokes !!

Jitendra Prajapati — Fri, 21 Aug 2015 01:26:54 -0500

Why was the Bioinformatics fired from his job?

A: He was getting too Sassy.

What did the bioinformatician say when he found out his team stopped using version control?

A: Y’all better Git!

Why did the computational biologist stay home from work?

A: He had a code!

Why was the bioinformatician's paper was rejected?

A: Journal thought it seemed scripted.

How can you tell that a Bioinformatics is working?

A: You can hear him Grunting!

Why bioinformatician always silence?

A: Because bioinformatician calmly whisper, “SSH”

Why was the bioinformatician always so sleepy?

A: He/She wasn’t given any Java.

Why did the program/software hanged?

A: Because genome float.

Why was the class upset that its parent died?

A: Because it wouldn’t be getting the inheritance!

Why did bioinformatician always works on the command line?

A: Because they don't want to scare you with huge amount of data!

Why did the bioinformatician attend the gay pride parade?

A: They supported polymorphism.

Why did bioinformatician prefer awk, PerlOneliner?

A: Because even computer can't handle to load the data.

Why don’t bioinformatician get along with others?

A: They’re too MEAN.

Why computational biologist are cool?

A: Because they are scripted!!

Why they talk $ unzip; strip; touch; finger; grep; mount; fsck; more; yes; fsck; fsck; umount; clean; sleep;

A: Ah, Ohhh, dude, these are *NIX commands

Did they really hack genome?

A: Yes, I guess so.

Bioinformatics Made Easy Search: Bioinformatics tools and run genomic analysis in the cloud

Rahul Nayak — Thu, 20 Aug 2015 02:21:20 -0500

InsideDNA makes hundreds of bioinformatics tools immediately available to run via an easy-to-use web interface and allows an accurate search across all functions, tools and pipelines.

With InsideDNA, you can upload and store your own genomic/genetic datasets in a limitless cloud space, and instantly analyze it with a powerful compute instance, without any tool installation or set up hassle.

More at https://insidedna.me/

Address of the bookmark: https://insidedna.me/

Bioinformatician - Purdue Cancer Center

Wed, 03 Feb 2021 22:54:14 -0600

The Center for Cancer Research is an NCI-designated cancer center. The center is a catalyst for collaborative cancer research around Purdue University. In this role, the selected individual will have the opportunity to cooperate with Purdue faculty and students in performing cutting-edge research and analyses, with opportunities for professional development, and the possibility of co-authorship in faculty research publications.
Projects will be challenging, including various model organisms, and we are looking for an individual who is excited about interacting with multi-disciplinary cancer research groups and the development of new tools, techniques, and workflows. Independently perform both routine and project-specific analyses, advise faculty on the design of experiments, writing manuscripts for publication, and writing grant proposals. Interact and collaborate with bioinformatics services (i.e. Statistical Consulting Center to provide relevant services to the campus research community), where applicable. Support all of the bioinformatics activities of the Center for Cancer Research at Purdue University
Required:

Master's degree in bioinformatics, computer science, molecular biology, or related field
One year of experience in analyzing RNA-Seq data
In lieu of a degree, consideration will be given to an equivalent combination of related education and required work experience.
Understanding of molecular biology, biochemistry, and genetics
Proficiency in writing scripts using Perl, Python, Java, or equivalent languages
Proficiency in R and UNIX/LINUX
Knowledge of genomics, alignment, annotation, bioinformatics, concepts of sequence assembly
Highly motivated and detail-oriented
Ability, interest, and curiosity to learn new skills
Must possess strong communication skills to work effectively with users across disciplines
Ability to work independently and as part of a multi-disciplinary team
Strong visual, verbal, and written communication skills
Excellent time organizational skills
Preferred:

Experience writing software or building software pipelines
Experience with oncology-specific public databases including TCGA
Experience with deploying and/or running software on high-performance computational systems
Statistical and experimental design knowledge
Additional Information:

This position is contingent on the availability of funding
Purdue will not sponsor employment authorization for this position
A background check will be required for employment in this position
FLSA: Exempt (Not Eligible For Overtime)
Retirement Eligibility: Defined Contribution Waiting Period
Purdue University is an EOE/AA employer. All individuals, including minorities, women, individuals with disabilities, and veterans are encouraged to apply

More at https://careers.purdue.edu/job/West-Lafayette-Bioinformatician-Purdue-Cancer-Center-IN-47906/686617600/

Senior Statistician - Manchester or Belfast UK

Fri, 03 Jul 2015 08:06:04 -0500

The Role

My client provide innovative biomarker discovery and development services to the pharmaceutical industry. They partner with the pharmaceutical industry to develop and implement biomarker strategies, providing a full range of biomarker services from pre-clinical biomarker discovery, assay development, right through to the delivery of clinical tests in their CLIA lab.

As a Senior Statistician you would support this effort and be responsible for the management of technical experimental study design and data handling processes required for the discovery, development and commercial delivery of multiplex clinical diagnostic assays; You will:

Develop analytical experimental designs for multiplex clinical diagnostic assays in accordance with regulatory requirements (e.g. CLIA, FDA)
Lead and coordinate the evaluation of analytical studies including characterization, verification, and validation studies
Lead specification setting and specification alterations
Ensure DOE methodology is routinely used in analytical studies.
Work with the Operations Department to ensure robust, reproducible and precise assay development
Provide expertise of general aspects for Statistical Process Control
Provide statistical expertise for R&D, Quality, and Manufacturing
You will work in a fast-paced, project orientated environment and the ability to plan and execute objectives under tight timelines is a must. This is a unique opportunity suited for a qualified statistician with an interest in working to deliver first class data analysis support and solutions in a clinical setting.

Requirements

MSc or PhD in statistics or a related discipline
In depth knowledge of DOE methods to analytically validate, monitor and trouble shoot multiplex clinical diagnostic assays, ideally in a commercial/industrial setting
Experienced in the analysis of statistical technology evaluation, independent data and dependent data analysis, medical diagnostic accuracy, statistical graphics and reproducible reporting.
Excellent interpersonal, communication (including written and spoken English)
Ability to independently manage multiple projects and to deliver results on time per project deadlines
Proficient programming and analysis skills in one or more statistical package (e.g. R, Stata, SAS)
The following skills, while not mandatory, are highly desirable:

Development and validation of predictive models
Experience of clinical epidemiology, survival analysis, biomarker research, Bayesian methods, quantifying predictive accuracy.
Knowledge of regulatory standards for CLIA and/or FDA IVD tests
Reward

An attractive remuneration package will reflect the importance of this role and will include 6.8 weeks annual leave (pro rata, including fixed closure days), company pension scheme, enhanced sick pay and maternity entitlements, healthcare plan and opportunities for learning and development, as well as access to a company restaurant and parking facilities

BioGeek Fun

Jit — Sun, 16 Mar 2014 06:33:31 -0500

1. A futuristic computational biology student was told to write "It is in my gene!!!" on the board 100 times as a punishment. here's his response -

use warnings;
for ($count=1; $count <=100; $count++) { print "It is in my gene!!!";}

I guess, he is gonna to be a real biogeek. Nice try though. Smart kid.

2. In some perl script I found this
. . . . . .
. . . . . .
# It works for me, only God understood how it is working
while (/(<\/[^>]+>)|(<[^>]+>)|(<[^>]+>)$|([^><]+)/go) {
            $startGene=$1;
            $beginChromosome=$2;

. . . . . .
.. . . . . .
}

3. One more interesting message in Perl found …. It will must tickle you bone :)
open(my $fh, "<", "gene.txt") or kill " Me if you think this is a mistake :$!";

4. From the Perl

while () { # "The Mothership Connection is here!"
print “$_\n”; # Printing the offspring :)

5. Perl message
if ($1) { print “Just found a the error in chromosome !!!, yahoo…”; else { “That is not error, but mutation you moron!”;

6. One genome database curator walk in wine bar asked the bartender:
CREATE TABLE gene IF NOT EXISTS SexOnTheBeach;

A Bioinformatician’s Lament

LEGE — Thu, 29 May 2025 01:33:31 -0500

"I have a presentation tomorrow," they say,

With hopeful eyes, like it’s all child's play.
As if results bloom overnight, full-grown—
Not wrangled from chaos, and error-prone.

Oh brave soul, sit, let’s walk through the tale,
Of pipelines broken and servers that fail.
The journey starts: “The data? It’s there—
Just fetch it from S3, easy, I swear.”

Now I summon awscli with dread,
Reset my keys, credentials fed.
Configure regions, IAM roles too—
All this, and still no peek at the view.

Next up, the tool: “It’s open source!”
On GitHub, rotting, no sign of remorse.
Python 2.7, some GCC trick—
The install alone might make you sick.

Finally, progress! The pipeline runs…
Till RAM collapses and error stuns.
Oh, and the metadata? A crime,
Merged cells, font soup, out of time.

Sample IDs—what a cryptic game:
Sample_1, S1, sample-1... the same?
Controls mislabeled, cases flipped,
No wonder my sanity's starting to slip.

Then QC plots, PCA joy—
Wait, that’s a tumor labeled as a boy?
Clusters cross, and axes lie,
And I still don’t know which sample’s "guy."

But the clock ticks on, and it’s half-past doom,
They want the final UMAP soon.
With pastel colors, labeled clear—
"Can we move that legend to right here?"

Tweak by tweak, I adjust each frame,
Resize Panel B, annotate a name.
Export the plot—it starts to gleam…
Then my laptop crashes. I scream.

This is the grind, the long-haul game,
Where science hides behind code and flame.
No “Export to Nature” button to press,
Just toil and logic and hope for success.

So next time you whisper that fated line—
“I have a talk, can you make it shine?”
Know: bioinformatics is craft, not a click,
It’s science with scars, not just a quick fix.

To all who debug at 3AM light,
Who ghostwrite figures through sleepless night—
You are the backbone, silent and true,
First-author-worthy, if only they knew.

"कल मेरी प्रेज़ेंटेशन है," वो कहते हैं,

आशा भरी आँखों से, जैसे सब सहज है।
जैसे परिणाम रातोंरात प्रकट हो जाएं—
ना कि डेटा की भूलभुलैया से उखाड़े जाएं।

आओ बैठो, एक किस्सा सुनाता हूँ,
जहाँ पाइपलाइन टूटती है, और सर्वर भी थक जाते हैं।
कहानी शुरू होती है: “डेटा तो है—
बस S3 बकेट में, एकदम पास में कहीं।”

अब awscli बुलाता हूँ डरते हुए,
कुंजी सेट करूँ, क्रेडेंशियल जोड़ूं, रीजन भरूँ।
इतनी मशक्कत, फिर भी डेटा नहीं मिला,
बस सेटअप में ही पूरा दिन चला।

फिर आता है टूल: “ओपन-सोर्स है!”
GitHub पर है, 2019 से सूखा पड़ा है।
Python 2.7 चाहिए, एक पुराना कम्पाइलर,
और साथ में थोड़ी सी दुआ की ताकत।

आख़िरकार टूल चला, खुशी सी हुई,
लेकिन रन करते ही, मेमोरी ने हार मानी।
और मेटाडेटा? एक एक्सेल की आफ़त,
मर्ज़ किए हुए सेल, बस और क्या चाहिए काफ़ियत?

सैंपल आईडी? बस भगवान ही जाने—
Sample_1, sample-1, S1, और control1—
ये सब एक ही सैंपल हैं क्या?
पता तब चलता है जब पूछो दो-तीन बार।

काउंट मैट्रिक्स तैयार, अब R या Python की बारी,
QC करो, PCA प्लॉट—पर कुछ गड़बड़ भारी।
ट्यूमर और नॉर्मल का अदला-बदली खेल,
बार-बार, वही पुरानी झमेल।

आख़िर में आया मॉडलिंग का समय,
स्टैट्स, प्लॉट्स, डिफरेंशियल एक्सप्रेशन का श्रम।
लेकिन घड़ी में 5 बज चुके हैं जनाब,
और 8 बजे तक UMAP चाहिए, साफ़-सुथरा जबाब।

तो मैं कोड लिखता हूँ रात भर बैठ कर,
कलर पैलेट, जीन लेबल, लीजेंड बाहर रख कर।
फ़ॉन्ट, पैनल, एक्सिस सब सुधार,
एक्सपोर्ट करता हूँ... और लैपटॉप कहता है—"अब नहीं यार!"

इसीलिए बायोइन्फॉर्मेटिक्स में लगता है समय,
ये “बस सीरत चलाओ” या “वोल्कैनो प्लॉट बनाओ” नहीं है।
ये है सिस्टम एडमिन का काम, डेटा की सफ़ाई,
QC, डिबगिंग, और सांइस की सच्ची लड़ाई।

तो कुछ सीखें इस व्यथा से आप भी आज:
24 घंटे पहले चमत्कार मत माँगिए।
अच्छे फ़िगर साफ़ डेटा से बनते हैं।
बायोइन्फॉर्मेटिक्स जादू नहीं, विज्ञान है।
समय से बात कीजिए, प्रक्रिया का सम्मान कीजिए।

और उन सभी बायोइन्फॉर्मेटिशियनों को सलाम,
जो दूसरों की प्रेज़ेंटेशन के लिए रातों में जागते हैं—
तुम हो फ़िगर्स के भूत लेखक,
तुम हो बिना नाम के सह-लेखक।
तुम पहले लेखक बनने के हक़दार हो—
और एक लंबी नींद के भी।

Note: Written with the help of AI/LLM Tools !