Alternative content
Awk is a programming language that is specifically designed for quickly manipulating space delimited data. Although you can achieve all its functionality with Perl, awk is simpler in many practical cases.
Why awk? You can replace a pipeline of 'stuff | grep | sed | cut...' with a single call to awk. For a simple script, most of the timelag is in loading these apps into memory, and it's much faster to do it all with one. This is ideal for something like an openbox pipe menu where you want to generate something on the fly. You can use awk to make a neat one-liner for some quick job in the terminal, or build an awk section into a shell script. You can find a lot of online tutorials, but here I will only show a few examples which cover most of bioinformatician daily uses of awk.
choose rows where column 3 is larger than column 5:
awk '$3>$5' input.txt > output.txt
extract column 2,4,5:
awk '{print $2,$4,$5}' input.txt > output.txt
awk 'BEGIN{OFS="\t"}{print $2,$4,$5}' input.txt
show rows between 20th and 80th:
awk 'NR>=20&&NR<=80' input.txt > output.txt
calculate the average of column 2:
awk '{x+=$2}END{print x/NR}' input.txt
regex (egrep):
awk '/^test[0-9]+/' input.txt
calculate the sum of column 2 and 3 and put it at the end of a row or replace the first column:
awk '{print $0,$2+$3}' input.txt
awk '{$1=$2+$3;print}' input.txt
join two files on column 1:
awk 'BEGIN{while((getline<"file1.txt")>0)l[$1]=$0}$1 in l{print $0"\t"l[$1]}' file2.txt > output.txt
count number of occurrence of column 2 (uniq -c):
awk '{l[$2]++}END{for (x in l) print x,l[x]}' input.txt
apply "uniq" on column 2, only printing the first occurrence (uniq):
awk '!($2 in l){print;l[$2]=1}' input.txt
count different words (wc):
awk '{for(i=1;i!=NF;++i)c[$i]++}END{for (x in c) print x,c[x]}' input.txt
deal with simple CSV:
awk -F, '{print $1,$2}'
substitution (sed is simpler in this case):
awk '{sub(/test/, "no", $0);print}' input.txt
OK now here's where to read this stuff properly explained. roll
Two thorough tutorials:
http://www.gnu.org/software/gawk/manual/gawk.html
http://www.grymoire.com/Unix/Awk.html
A famous list of useful one-liners - though they're short, many are quite tricky:
http://www.pement.org/awk/awk1line.txt
And some nice explanations of those one-liners. After reading this you'll have a pretty good grasp!
http://www.catonmat.net/blog/awk-one-li … -part-one/
http://www.catonmat.net/blog/ten-awk-ti … -pitfalls/
Comments
This MIT wiki page demonstration shows how to perform some basic bioinformatics tasks using simple UNIX commands.
http://rous.mit.edu/index.php/Unix_commands_applied_to_bioinformatics
Aakhsyan webpage explain commonly used sed and awk for bioinformatics at
http://raunakms.wordpress.com/2013/06/08/sed-and-awk-for-bioinformatics/
Handy OneLiner at http://bioinformatics.whatheblog.com/2010/03/handy-one-liners-awk/
BioUnix toolbox for bioinformatician
http://lh3lh3.users.sourceforge.net/biounix.shtml
There is BioAwk, specially designed for bioinformatician by ialbert. .... https://github.com/ialbert/bioawk-tools
Njoy
To double space a file;
$ awk '1; { print "" }' :
To prints the number of words in a file;
$ awk '{ total = total + NF }; END { print total+0 }' :
Awk, Linux, R tutorial by EMBL
http://www.embl.de/~rausch/primer.pdf
Some of the commonly used bioinformatics one-liner by Stephen Turner @ https://github.com/stephenturner/oneliners
Print line of a tab-delimited file when the 8th field is 10090:
awk -F "\t" '$8 == 10090 { print $0 }' myFile
Print fields 1, 2, 3 from a tab-delimited file where the 4th field contains a '99':
awk -F "\t" '$4 ~ /99/ {print $1"\t"$2"\t"$3}' myFile
I love awk but recommend you to try bioawk . Bioawk is a modified version of awk which will parse some common sequence formats. https://github.com/lh3/bioawk
Most commonly used Unix/Linux command for bioinformatics http://rous.mit.edu/index.php/Unix_commands_applied_to_bioinformatics
Some of the useful Unix onliner http://genomics-array.blogspot.in/2010/11/some-unixperl-oneliners-for.html
To deal with simple CSV: awk -F, '{print $1,$2}'
One of the best known cheat sheet for AWKians www.catonmat.net/download/awk.cheat.sheet.txt
Rename the name of multi fasta sequesnces with awk
awk '/^>/{print ">chromosome" ++i; next}{print}' < file.fasta