BOL: Most Commonly used Awk by Bioinformatician

Pages
Awk
Most Commonly used Awk by Bioinformatician

Most Commonly used Awk by Bioinformatician

Last updated 4112 days ago by Neel Comments (13)

Awk is a programming language that is specifically designed for quickly manipulating space delimited data. Although you can achieve all its functionality with Perl, awk is simpler in many practical cases.

Why awk? You can replace a pipeline of 'stuff | grep | sed | cut...' with a single call to awk. For a simple script, most of the timelag is in loading these apps into memory, and it's much faster to do it all with one. This is ideal for something like an openbox pipe menu where you want to generate something on the fly. You can use awk to make a neat one-liner for some quick job in the terminal, or build an awk section into a shell script. You can find a lot of online tutorials, but here I will only show a few examples which cover most of bioinformatician daily uses of awk.

choose rows where column 3 is larger than column 5:

awk '$3>$5' input.txt > output.txt

extract column 2,4,5:

awk '{print $2,$4,$5}' input.txt > output.txt

awk 'BEGIN{OFS="\t"}{print $2,$4,$5}' input.txt

show rows between 20th and 80th:

awk 'NR>=20&&NR<=80' input.txt > output.txt

calculate the average of column 2:

awk '{x+=$2}END{print x/NR}' input.txt

regex (egrep):

awk '/^test[0-9]+/' input.txt

calculate the sum of column 2 and 3 and put it at the end of a row or replace the first column:

awk '{print $0,$2+$3}' input.txt

awk '{$1=$2+$3;print}' input.txt

join two files on column 1:

awk 'BEGIN{while((getline<"file1.txt")>0)l[$1]=$0}$1 in l{print $0"\t"l[$1]}' file2.txt > output.txt

count number of occurrence of column 2 (uniq -c):

awk '{l[$2]++}END{for (x in l) print x,l[x]}' input.txt

apply "uniq" on column 2, only printing the first occurrence (uniq):

awk '!($2 in l){print;l[$2]=1}' input.txt

count different words (wc):

awk '{for(i=1;i!=NF;++i)c[$i]++}END{for (x in c) print x,c[x]}' input.txt

deal with simple CSV:

awk -F, '{print $1,$2}'

substitution (sed is simpler in this case):

awk '{sub(/test/, "no", $0);print}' input.txt

OK now here's where to read this stuff properly explained. roll

Two thorough tutorials:

http://www.gnu.org/software/gawk/manual/gawk.html

http://www.grymoire.com/Unix/Awk.html

A famous list of useful one-liners - though they're short, many are quite tricky:

http://www.pement.org/awk/awk1line.txt

And some nice explanations of those one-liners. After reading this you'll have a pretty good grasp!

http://www.catonmat.net/blog/awk-one-li … -part-one/

http://www.catonmat.net/blog/ten-awk-ti … -pitfalls/

Comments

- Jitendra Narayan@admin
Jitendra Narayan 4108 days ago
This MIT wiki page demonstration shows how to perform some basic bioinformatics tasks using simple UNIX commands.
http://rous.mit.edu/index.php/Unix_commands_applied_to_bioinformatics
- Archana Malhotra@archana
Archana Malhotra 4108 days ago
Aakhsyan webpage explain commonly used sed and awk for bioinformatics at
http://raunakms.wordpress.com/2013/06/08/sed-and-awk-for-bioinformatics/
Handy OneLiner at http://bioinformatics.whatheblog.com/2010/03/handy-one-liners-awk/
BioUnix toolbox for bioinformatician
http://lh3lh3.users.sourceforge.net/biounix.shtml
- Jitendra Narayan@admin
Jitendra Narayan 4108 days ago
There is BioAwk, specially designed for bioinformatician by ialbert. .... https://github.com/ialbert/bioawk-tools
Njoy
- Poonam Mahapatra@poonam
Poonam Mahapatra 4102 days ago
To double space a file;
$ awk '1; { print "" }' :
To prints the number of words in a file;
$ awk '{ total = total + NF }; END { print total+0 }' :
- Rahul Nayak@rahul
Rahul Nayak 4029 days ago
Awk, Linux, R tutorial by EMBL
http://www.embl.de/~rausch/primer.pdf
- Archana Malhotra@archana
Archana Malhotra 4011 days ago
Some of the commonly used bioinformatics one-liner by Stephen Turner @ https://github.com/stephenturner/oneliners
- Rahul Nayak@rahul
Rahul Nayak 3909 days ago
Print line of a tab-delimited file when the 8th field is 10090:

awk -F "\t" '$8 == 10090 { print $0 }' myFile

Print fields 1, 2, 3 from a tab-delimited file where the 4th field contains a '99':

awk -F "\t" '$4 ~ /99/ {print $1"\t"$2"\t"$3}' myFile
- Aaryan Lokwani@aaryan
Aaryan Lokwani 3881 days ago
I love awk but recommend you to try bioawk . Bioawk is a modified version of awk which will parse some common sequence formats. https://github.com/lh3/bioawk
- Alok Prajapati@Alok
Alok Prajapati 3863 days ago
Most commonly used Unix/Linux command for bioinformatics http://rous.mit.edu/index.php/Unix_commands_applied_to_bioinformatics
- Shruti Paniwala@shruti
Shruti Paniwala 3863 days ago
Some of the useful Unix onliner http://genomics-array.blogspot.in/2010/11/some-unixperl-oneliners-for.html
- Rahul Nayak@rahul
Rahul Nayak 3827 days ago
To deal with simple CSV: awk -F, '{print $1,$2}'
- John Parker@parker
John Parker 3714 days ago
One of the best known cheat sheet for AWKians www.catonmat.net/download/awk.cheat.sheet.txt
- Neel@neelam
Neel 3180 days ago
Rename the name of multi fasta sequesnces with awk
awk '/^>/{print ">chromosome" ++i; next}{print}' < file.fasta

BOL

Awk

Navigation

Our Sponsors

Most Commonly used Awk by Bioinformatician

Most Commonly used Awk by Bioinformatician

Comments