BOL: Most Commonly used Awk by Bioinformatician

Most Commonly used Awk by Bioinformatician

Neel — Mon, 19 Aug 2013 01:12:38 -0500

Awk is a programming language that is specifically designed for quickly manipulating space delimited data. Although you can achieve all its functionality with Perl, awk is simpler in many practical cases.

Why awk? You can replace a pipeline of 'stuff | grep | sed | cut...' with a single call to awk. For a simple script, most of the timelag is in loading these apps into memory, and it's much faster to do it all with one. This is ideal for something like an openbox pipe menu where you want to generate something on the fly. You can use awk to make a neat one-liner for some quick job in the terminal, or build an awk section into a shell script. You can find a lot of online tutorials, but here I will only show a few examples which cover most of bioinformatician daily uses of awk.

choose rows where column 3 is larger than column 5:

awk '$3>$5' input.txt > output.txt

extract column 2,4,5:

awk '{print $2,$4,$5}' input.txt > output.txt

awk 'BEGIN{OFS="\t"}{print $2,$4,$5}' input.txt

show rows between 20th and 80th:

awk 'NR>=20&&NR<=80' input.txt > output.txt

calculate the average of column 2:

awk '{x+=$2}END{print x/NR}' input.txt

regex (egrep):

awk '/^test[0-9]+/' input.txt

calculate the sum of column 2 and 3 and put it at the end of a row or replace the first column:

awk '{print $0,$2+$3}' input.txt

awk '{$1=$2+$3;print}' input.txt

join two files on column 1:

awk 'BEGIN{while((getline<"file1.txt")>0)l[$1]=$0}$1 in l{print $0"\t"l[$1]}' file2.txt > output.txt

count number of occurrence of column 2 (uniq -c):

awk '{l[$2]++}END{for (x in l) print x,l[x]}' input.txt

apply "uniq" on column 2, only printing the first occurrence (uniq):

awk '!($2 in l){print;l[$2]=1}' input.txt

count different words (wc):

awk '{for(i=1;i!=NF;++i)c[$i]++}END{for (x in c) print x,c[x]}' input.txt

deal with simple CSV:

awk -F, '{print $1,$2}'

substitution (sed is simpler in this case):

awk '{sub(/test/, "no", $0);print}' input.txt

OK now here's where to read this stuff properly explained. roll

Two thorough tutorials:

http://www.gnu.org/software/gawk/manual/gawk.html

http://www.grymoire.com/Unix/Awk.html

A famous list of useful one-liners - though they're short, many are quite tricky:

http://www.pement.org/awk/awk1line.txt

And some nice explanations of those one-liners. After reading this you'll have a pretty good grasp!

http://www.catonmat.net/blog/awk-one-li … -part-one/

http://www.catonmat.net/blog/ten-awk-ti … -pitfalls/

Comment by Neel

Neel — Tue, 08 Mar 2016 11:09:38 -0600

Rename the name of multi fasta sequesnces with awk

awk '/^>/{print ">chromosome" ++i; next}{print}' < file.fasta

Comment by John Parker

John Parker — Sun, 21 Sep 2014 16:55:38 -0500

One of the best known cheat sheet for AWKians www.catonmat.net/download/awk.cheat.sheet.txt

Comment by Rahul Nayak

Rahul Nayak — Sat, 31 May 2014 15:44:31 -0500

To deal with simple CSV: awk -F, '{print $1,$2}'

Comment by Shruti Paniwala

Shruti Paniwala — Fri, 25 Apr 2014 20:23:51 -0500

Some of the useful Unix onliner http://genomics-array.blogspot.in/2010/11/some-unixperl-oneliners-for.html

Comment by Alok Prajapati

Alok Prajapati — Thu, 24 Apr 2014 21:33:39 -0500

Most commonly used Unix/Linux command for bioinformatics http://rous.mit.edu/index.php/Unix_commands_applied_to_bioinformatics

Comment by Aaryan Lokwani

Aaryan Lokwani — Mon, 07 Apr 2014 01:29:59 -0500

I love awk but recommend you to try bioawk . Bioawk is a modified version of awk which will parse some common sequence formats. https://github.com/lh3/bioawk

Comment by Rahul Nayak

Rahul Nayak — Mon, 10 Mar 2014 03:37:10 -0500

Print line of a tab-delimited file when the 8th field is 10090:

awk -F "\t" '$8 == 10090 { print $0 }' myFile

Print fields 1, 2, 3 from a tab-delimited file where the 4th field contains a '99':

awk -F "\t" '$4 ~ /99/ {print $1"\t"$2"\t"$3}' myFile

Comment by Archana Malhotra

Archana Malhotra — Thu, 28 Nov 2013 18:44:02 -0600

Some of the commonly used bioinformatics one-liner by Stephen Turner @ https://github.com/stephenturner/oneliners

Comment by Rahul Nayak

Rahul Nayak — Sun, 10 Nov 2013 12:49:25 -0600

Awk, Linux, R tutorial by EMBL

http://www.embl.de/~rausch/primer.pdf

Comment by Poonam Mahapatra

Poonam Mahapatra — Thu, 29 Aug 2013 08:38:32 -0500

To double space a file;
$ awk '1; { print "" }' :

To prints the number of words in a file;
$ awk '{ total = total + NF }; END { print total+0 }' :

Comment by Jitendra Narayan

Jitendra Narayan — Fri, 23 Aug 2013 10:21:00 -0500

There is BioAwk, specially designed for bioinformatician by ialbert. .... https://github.com/ialbert/bioawk-tools

Njoy

Comment by Archana Malhotra

Archana Malhotra — Fri, 23 Aug 2013 10:15:24 -0500

Aakhsyan webpage explain commonly used sed and awk for bioinformatics at

http://raunakms.wordpress.com/2013/06/08/sed-and-awk-for-bioinformatics/

Handy OneLiner at http://bioinformatics.whatheblog.com/2010/03/handy-one-liners-awk/

BioUnix toolbox for bioinformatician

http://lh3lh3.users.sourceforge.net/biounix.shtml

Comment by Jitendra Narayan

Jitendra Narayan — Fri, 23 Aug 2013 10:10:32 -0500

This MIT wiki page demonstration shows how to perform some basic bioinformatics tasks using simple UNIX commands.

http://rous.mit.edu/index.php/Unix_commands_applied_to_bioinformatics