<?xml version='1.0'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:atom="http://www.w3.org/2005/Atom" >
<channel>
	<title><![CDATA[BOL: April 2019]]></title>
	<link>https://bioinformaticsonline.com/blog/archive/biojoker/1554094800/1556686800?</link>
	<atom:link href="https://bioinformaticsonline.com/blog/archive/biojoker/1554094800/1556686800?" rel="self" type="application/rss+xml" />
	<description><![CDATA[]]></description>
	
	<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/blog/view/39307/awk-for-beginners</guid>
	<pubDate>Fri, 26 Apr 2019 16:19:41 -0500</pubDate>
	<link>https://bioinformaticsonline.com/blog/view/39307/awk-for-beginners</link>
	<title><![CDATA[AWK for beginners !]]></title>
	<description><![CDATA[<p>AWK is a standard tool on every POSIX-compliant UNIX system. It&rsquo;s like flex/lex, from the command-line, perfect for text-processing tasks and other scripting needs. It has a C-like syntax, but without mandatory semicolons (although, you should use them anyway, because they are required when you&rsquo;re writing one-liners, something AWK excels at), manual memory management, or static typing. It excels at text processing. You can call to it from a shell script, or you can use it as a stand-alone scripting language.</p><p>Why use AWK instead of Perl? Readability. AWK is easier to read than Perl. For simple text-processing scripts, particularly ones that read files line by line and split on delimiters, AWK is probably the right tool for the job.</p><div><pre><span>#!/usr/bin/awk -f</span>

<span># Comments are like this</span>


<span># AWK programs consist of a collection of patterns and actions.</span>
<span>pattern1</span> <span>{</span> <span>action</span><span>;</span> <span>}</span> <span># just like lex</span>
<span>pattern2</span> <span>{</span> <span>action</span><span>;</span> <span>}</span>

<span># There is an implied loop and AWK automatically reads and parses each</span>
<span># record of each file supplied. Each record is split by the FS delimiter,</span>
<span># which defaults to white-space (multiple spaces,tabs count as one)</span>
<span># You can assign FS either on the command line (-F C) or in your BEGIN</span>
<span># pattern</span>

<span># One of the special patterns is BEGIN. The BEGIN pattern is true</span>
<span># BEFORE any of the files are read. The END pattern is true after</span>
<span># an End-of-file from the last file (or standard-in if no files specified)</span>
<span># There is also an output field separator (OFS) that you can assign, which</span>
<span># defaults to a single space</span>

<span>BEGIN</span> <span>{</span>

    <span># BEGIN will run at the beginning of the program. It's where you put all</span>
    <span># the preliminary set-up code, before you process any text files. If you</span>
    <span># have no text files, then think of BEGIN as the main entry point.</span>

    <span># Variables are global. Just set them or use them, no need to declare..</span>
    <span>count</span> <span>=</span> <span>0</span><span>;</span>

    <span># Operators just like in C and friends</span>
    <span>a</span> <span>=</span> <span>count</span> <span>+</span> <span>1</span><span>;</span>
    <span>b</span> <span>=</span> <span>count</span> <span>-</span> <span>1</span><span>;</span>
    <span>c</span> <span>=</span> <span>count</span> <span>*</span> <span>1</span><span>;</span>
    <span>d</span> <span>=</span> <span>count</span> <span>/</span> <span>1</span><span>;</span> <span># integer division</span>
    <span>e</span> <span>=</span> <span>count</span> <span>%</span> <span>1</span><span>;</span> <span># modulus</span>
    <span>f</span> <span>=</span> <span>count</span> <span>^</span> <span>1</span><span>;</span> <span># exponentiation</span>

    <span>a</span> <span>+=</span> <span>1</span><span>;</span>
    <span>b</span> <span>-=</span> <span>1</span><span>;</span>
    <span>c</span> <span>*=</span> <span>1</span><span>;</span>
    <span>d</span> <span>/=</span> <span>1</span><span>;</span>
    <span>e</span> <span>%=</span> <span>1</span><span>;</span>
    <span>f</span> <span>^=</span> <span>1</span><span>;</span>

    <span># Incrementing and decrementing by one</span>
    <span>a</span><span>++</span><span>;</span>
    <span>b</span><span>--</span><span>;</span>

    <span># As a prefix operator, it returns the incremented value</span>
    <span>++</span><span>a</span><span>;</span>
    <span>--</span><span>b</span><span>;</span>

    <span># Notice, also, no punctuation such as semicolons to terminate statements</span>

    <span># Control statements</span>
    <span>if</span> <span>(</span><span>count</span> <span>==</span> <span>0</span><span>)</span>
        <span>print</span> <span>"Starting with count of 0"</span><span>;</span>
    <span>else</span>
        <span>print</span> <span>"Huh?"</span><span>;</span>

    <span># Or you could use the ternary operator</span>
    <span>print</span> <span>(</span><span>count</span> <span>==</span> <span>0</span><span>)</span> <span>?</span> <span>"Starting with count of 0"</span> <span>:</span> <span>"Huh?"</span><span>;</span>

    <span># Blocks consisting of multiple lines use braces</span>
    <span>while</span> <span>(</span><span>a</span> <span>&lt;</span> <span>10</span><span>)</span> <span>{</span>
        <span>print</span> <span>"String concatenation is done"</span> <span>" with a series"</span> <span>" of"</span>
            <span>" space-separated strings"</span><span>;</span>
        <span>print</span> <span>a</span><span>;</span>

        <span>a</span><span>++</span><span>;</span>
    <span>}</span>

    <span>for</span> <span>(</span><span>i</span> <span>=</span> <span>0</span><span>;</span> <span>i</span> <span>&lt;</span> <span>10</span><span>;</span> <span>i</span><span>++</span><span>)</span>
        <span>print</span> <span>"Good ol' for loop"</span><span>;</span>

    <span># As for comparisons, they're the standards:</span>
    <span># a &lt; b   # Less than</span>
    <span># a &lt;= b  # Less than or equal</span>
    <span># a != b  # Not equal</span>
    <span># a == b  # Equal</span>
    <span># a &gt; b   # Greater than</span>
    <span># a &gt;= b  # Greater than or equal</span>

    <span># Logical operators as well</span>
    <span># a &amp;&amp; b  # AND</span>
    <span># a || b  # OR</span>

    <span># In addition, there's the super useful regular expression match</span>
    <span>if</span> <span>(</span><span>"foo"</span> <span>~</span> <span>"^fo+$"</span><span>)</span>
        <span>print</span> <span>"Fooey!"</span><span>;</span>
    <span>if</span> <span>(</span><span>"boo"</span> <span>!~</span> <span>"^fo+$"</span><span>)</span>
        <span>print</span> <span>"Boo!"</span><span>;</span>

    <span># Arrays</span>
    <span>arr</span><span>[</span><span>0</span><span>]</span> <span>=</span> <span>"foo"</span><span>;</span>
    <span>arr</span><span>[</span><span>1</span><span>]</span> <span>=</span> <span>"bar"</span><span>;</span>

    <span># You can also initialize an array with the built-in function split()</span>

    <span>n</span> <span>=</span> <span>split</span><span>(</span><span>"foo:bar:baz"</span><span>,</span> <span>arr</span><span>,</span> <span>":"</span><span>);</span>

    <span># You also have associative arrays (actually, they're all associative arrays)</span>
    <span>assoc</span><span>[</span><span>"foo"</span><span>]</span> <span>=</span> <span>"bar"</span><span>;</span>
    <span>assoc</span><span>[</span><span>"bar"</span><span>]</span> <span>=</span> <span>"baz"</span><span>;</span>

    <span># And multi-dimensional arrays, with some limitations I won't mention here</span>
    <span>multidim</span><span>[</span><span>0</span><span>,</span><span>0</span><span>]</span> <span>=</span> <span>"foo"</span><span>;</span>
    <span>multidim</span><span>[</span><span>0</span><span>,</span><span>1</span><span>]</span> <span>=</span> <span>"bar"</span><span>;</span>
    <span>multidim</span><span>[</span><span>1</span><span>,</span><span>0</span><span>]</span> <span>=</span> <span>"baz"</span><span>;</span>
    <span>multidim</span><span>[</span><span>1</span><span>,</span><span>1</span><span>]</span> <span>=</span> <span>"boo"</span><span>;</span>

    <span># You can test for array membership</span>
    <span>if</span> <span>(</span><span>"foo"</span> <span>in</span> <span>assoc</span><span>)</span>
        <span>print</span> <span>"Fooey!"</span><span>;</span>

    <span># You can also use the 'in' operator to traverse the keys of an array</span>
    <span>for</span> <span>(</span><span>key</span> <span>in</span> <span>assoc</span><span>)</span>
        <span>print</span> <span>assoc</span><span>[</span><span>key</span><span>];</span>

    <span># The command line is in a special array called ARGV</span>
    <span>for</span> <span>(</span><span>argnum</span> <span>in</span> <span>ARGV</span><span>)</span>
        <span>print</span> <span>ARGV</span><span>[</span><span>argnum</span><span>];</span>

    <span># You can remove elements of an array</span>
    <span># This is particularly useful to prevent AWK from assuming the arguments</span>
    <span># are files for it to process</span>
    <span>delete</span> <span>ARGV</span><span>[</span><span>1</span><span>];</span>

    <span># The number of command line arguments is in a variable called ARGC</span>
    <span>print</span> <span>ARGC</span><span>;</span>

    <span># AWK has several built-in functions. They fall into three categories. I'll</span>
    <span># demonstrate each of them in their own functions, defined later.</span>

    <span>return_value</span> <span>=</span> <span>arithmetic_functions</span><span>(</span><span>a</span><span>,</span> <span>b</span><span>,</span> <span>c</span><span>);</span>
    <span>string_functions</span><span>();</span>
    <span>io_functions</span><span>();</span>
<span>}</span>

<span># Here's how you define a function</span>
<span>function</span> <span>arithmetic_functions</span><span>(</span><span>a</span><span>,</span> <span>b</span><span>,</span> <span>c</span><span>,</span>     <span>d</span><span>)</span> <span>{</span>

    <span># Probably the most annoying part of AWK is that there are no local</span>
    <span># variables. Everything is global. For short scripts, this is fine, even</span>
    <span># useful, but for longer scripts, this can be a problem.</span>

    <span># There is a work-around (ahem, hack). Function arguments are local to the</span>
    <span># function, and AWK allows you to define more function arguments than it</span>
    <span># needs. So just stick local variable in the function declaration, like I</span>
    <span># did above. As a convention, stick in some extra whitespace to distinguish</span>
    <span># between actual function parameters and local variables. In this example,</span>
    <span># a, b, and c are actual parameters, while d is merely a local variable.</span>

    <span># Now, to demonstrate the arithmetic functions</span>

    <span># Most AWK implementations have some standard trig functions</span>
    <span>localvar</span> <span>=</span> <span>sin</span><span>(</span><span>a</span><span>);</span>
    <span>localvar</span> <span>=</span> <span>cos</span><span>(</span><span>a</span><span>);</span>
    <span>localvar</span> <span>=</span> <span>atan2</span><span>(</span><span>b</span><span>,</span> <span>a</span><span>);</span> <span># arc tangent of b / a</span>

    <span># And logarithmic stuff</span>
    <span>localvar</span> <span>=</span> <span>exp</span><span>(</span><span>a</span><span>);</span>
    <span>localvar</span> <span>=</span> <span>log</span><span>(</span><span>a</span><span>);</span>

    <span># Square root</span>
    <span>localvar</span> <span>=</span> <span>sqrt</span><span>(</span><span>a</span><span>);</span>

    <span># Truncate floating point to integer</span>
    <span>localvar</span> <span>=</span> <span>int</span><span>(</span><span>5.34</span><span>);</span> <span># localvar =&gt; 5</span>

    <span># Random numbers</span>
    <span>srand</span><span>();</span> <span># Supply a seed as an argument. By default, it uses the time of day</span>
    <span>localvar</span> <span>=</span> <span>rand</span><span>();</span> <span># Random number between 0 and 1.</span>

    <span># Here's how to return a value</span>
    <span>return</span> <span>localvar</span><span>;</span>
<span>}</span>

<span>function</span> <span>string_functions</span><span>(</span>    <span>localvar</span><span>,</span> <span>arr</span><span>)</span> <span>{</span>

    <span># AWK, being a string-processing language, has several string-related</span>
    <span># functions, many of which rely heavily on regular expressions.</span>

    <span># Search and replace, first instance (sub) or all instances (gsub)</span>
    <span># Both return number of matches replaced</span>
    <span>localvar</span> <span>=</span> <span>"fooooobar"</span><span>;</span>
    <span>sub</span><span>(</span><span>"fo+"</span><span>,</span> <span>"Meet me at the "</span><span>,</span> <span>localvar</span><span>);</span> <span># localvar =&gt; "Meet me at the bar"</span>
    <span>gsub</span><span>(</span><span>"e+"</span><span>,</span> <span>"."</span><span>,</span> <span>localvar</span><span>);</span> <span># localvar =&gt; "m..t m. at th. bar"</span>

    <span># Search for a string that matches a regular expression</span>
    <span># index() does the same thing, but doesn't allow a regular expression</span>
    <span>match</span><span>(</span><span>localvar</span><span>,</span> <span>"t"</span><span>);</span> <span># =&gt; 4, since the 't' is the fourth character</span>

    <span># Split on a delimiter</span>
    <span>n</span> <span>=</span> <span>split</span><span>(</span><span>"foo-bar-baz"</span><span>,</span> <span>arr</span><span>,</span> <span>"-"</span><span>);</span> <span># a[1] = "foo"; a[2] = "bar"; a[3] = "baz"; n = 3</span>

    <span># Other useful stuff</span>
    <span>sprintf</span><span>(</span><span>"%s %d %d %d"</span><span>,</span> <span>"Testing"</span><span>,</span> <span>1</span><span>,</span> <span>2</span><span>,</span> <span>3</span><span>);</span> <span># =&gt; "Testing 1 2 3"</span>
    <span>substr</span><span>(</span><span>"foobar"</span><span>,</span> <span>2</span><span>,</span> <span>3</span><span>);</span> <span># =&gt; "oob"</span>
    <span>substr</span><span>(</span><span>"foobar"</span><span>,</span> <span>4</span><span>);</span> <span># =&gt; "bar"</span>
    <span>length</span><span>(</span><span>"foo"</span><span>);</span> <span># =&gt; 3</span>
    <span>tolower</span><span>(</span><span>"FOO"</span><span>);</span> <span># =&gt; "foo"</span>
    <span>toupper</span><span>(</span><span>"foo"</span><span>);</span> <span># =&gt; "FOO"</span>
<span>}</span>

<span>function</span> <span>io_functions</span><span>(</span>    <span>localvar</span><span>)</span> <span>{</span>

    <span># You've already seen print</span>
    <span>print</span> <span>"Hello world"</span><span>;</span>

    <span># There's also printf</span>
    <span>printf</span><span>(</span><span>"%s %d %d %d\n"</span><span>,</span> <span>"Testing"</span><span>,</span> <span>1</span><span>,</span> <span>2</span><span>,</span> <span>3</span><span>);</span>

    <span># AWK doesn't have file handles, per se. It will automatically open a file</span>
    <span># handle for you when you use something that needs one. The string you used</span>
    <span># for this can be treated as a file handle, for purposes of I/O. This makes</span>
    <span># it feel sort of like shell scripting, but to get the same output, the string</span>
    <span># must match exactly, so use a variable:</span>

    <span>outfile</span> <span>=</span> <span>"/tmp/foobar.txt"</span><span>;</span>

    <span>print</span> <span>"foobar"</span> <span>&gt;</span> <span>outfile</span><span>;</span>

    <span># Now the string outfile is a file handle. You can close it:</span>
    <span>close</span><span>(</span><span>outfile</span><span>);</span>

    <span># Here's how you run something in the shell</span>
    <span>system</span><span>(</span><span>"echo foobar"</span><span>);</span> <span># =&gt; prints foobar</span>

    <span># Reads a line from standard input and stores in localvar</span>
    <span>getline</span> <span>localvar</span><span>;</span>

    <span># Reads a line from a pipe (again, use a string so you close it properly)</span>
    <span>cmd</span> <span>=</span> <span>"echo foobar"</span><span>;</span>
    <span>cmd</span> <span>|</span> <span>getline</span> <span>localvar</span><span>;</span> <span># localvar =&gt; "foobar"</span>
    <span>close</span><span>(</span><span>cmd</span><span>);</span>

    <span># Reads a line from a file and stores in localvar</span>
    <span>infile</span> <span>=</span> <span>"/tmp/foobar.txt"</span><span>;</span>
    <span>getline</span> <span>localvar</span> <span>&lt;</span> <span>infile</span><span>;</span> 
    <span>close</span><span>(</span><span>infile</span><span>);</span>
<span>}</span>

<span># As I said at the beginning, AWK programs consist of a collection of patterns</span>
<span># and actions. You've already seen the BEGIN pattern. Other</span>
<span># patterns are used only if you're processing lines from files or standard</span>
<span># input.</span>
<span>#</span>
<span># When you pass arguments to AWK, they are treated as file names to process.</span>
<span># It will process them all, in order. Think of it like an implicit for loop,</span>
<span># iterating over the lines in these files. these patterns and actions are like</span>
<span># switch statements inside the loop. </span>

<span>/^fo+bar$/</span> <span>{</span>

    <span># This action will execute for every line that matches the regular</span>
    <span># expression, /^fo+bar$/, and will be skipped for any line that fails to</span>
    <span># match it. Let's just print the line:</span>

    <span>print</span><span>;</span>

    <span># Whoa, no argument! That's because print has a default argument: $0.</span>
    <span># $0 is the name of the current line being processed. It is created</span>
    <span># automatically for you.</span>

    <span># You can probably guess there are other $ variables. Every line is</span>
    <span># implicitly split before every action is called, much like the shell</span>
    <span># does. And, like the shell, each field can be access with a dollar sign</span>

    <span># This will print the second and fourth fields in the line</span>
    <span>print</span> <span>$</span><span>2</span><span>,</span> <span>$</span><span>4</span><span>;</span>

    <span># AWK automatically defines many other variables to help you inspect and</span>
    <span># process each line. The most important one is NF</span>

    <span># Prints the number of fields on this line</span>
    <span>print</span> <span>NF</span><span>;</span>

    <span># Print the last field on this line</span>
    <span>print</span> <span>$</span><span>NF</span><span>;</span>
<span>}</span>

<span># Every pattern is actually a true/false test. The regular expression in the</span>
<span># last pattern is also a true/false test, but part of it was hidden. If you</span>
<span># don't give it a string to test, it will assume $0, the line that it's</span>
<span># currently processing. Thus, the complete version of it is this:</span>

<span>$</span><span>0</span> <span>~</span> <span>/^fo+bar$/</span> <span>{</span>
    <span>print</span> <span>"Equivalent to the last pattern"</span><span>;</span>
<span>}</span>

<span>a</span> <span>&gt;</span> <span>0</span> <span>{</span>
    <span># This will execute once for each line, as long as a is positive</span>
<span>}</span>

<span># You get the idea. Processing text files, reading in a line at a time, and</span>
<span># doing something with it, particularly splitting on a delimiter, is so common</span>
<span># in UNIX that AWK is a scripting language that does all of it for you, without</span>
<span># you needing to ask. All you have to do is write the patterns and actions</span>
<span># based on what you expect of the input, and what you want to do with it.</span>

<span># Here's a quick example of a simple script, the sort of thing AWK is perfect</span>
<span># for. It will read a name from standard input and then will print the average</span>
<span># age of everyone with that first name. Let's say you supply as an argument the</span>
<span># name of a this data file:</span>
<span>#</span>
<span># Bob Jones 32</span>
<span># Jane Doe 22</span>
<span># Steve Stevens 83</span>
<span># Bob Smith 29</span>
<span># Bob Barker 72</span>
<span>#</span>
<span># Here's the script:</span>

<span>BEGIN</span> <span>{</span>

    <span># First, ask the user for the name</span>
    <span>print</span> <span>"What name would you like the average age for?"</span><span>;</span>

    <span># Get a line from standard input, not from files on the command line</span>
    <span>getline</span> <span>name</span> <span>&lt;</span> <span>"/dev/stdin"</span><span>;</span>
<span>}</span>

<span># Now, match every line whose first field is the given name</span>
<span>$</span><span>1</span> <span>==</span> <span>name</span> <span>{</span>

    <span># Inside here, we have access to a number of useful variables, already</span>
    <span># pre-loaded for us:</span>
    <span># $0 is the entire line</span>
    <span># $3 is the third field, the age, which is what we're interested in here</span>
    <span># NF is the number of fields, which should be 3</span>
    <span># NR is the number of records (lines) seen so far</span>
    <span># FILENAME is the name of the file being processed</span>
    <span># FS is the field separator being used, which is " " here</span>
    <span># ...etc. There are plenty more, documented in the man page.</span>

    <span># Keep track of a running total and how many lines matched</span>
    <span>sum</span> <span>+=</span> <span>$</span><span>3</span><span>;</span>
    <span>nlines</span><span>++</span><span>;</span>
<span>}</span>

<span># Another special pattern is called END. It will run after processing all the</span>
<span># text files. Unlike BEGIN, it will only run if you've given it input to</span>
<span># process. It will run after all the files have been read and processed</span>
<span># according to the rules and actions you've provided. The purpose of it is</span>
<span># usually to output some kind of final report, or do something with the</span>
<span># aggregate of the data you've accumulated over the course of the script.</span>

<span>END</span> <span>{</span>
    <span>if</span> <span>(</span><span>nlines</span><span>)</span>
        <span>print</span> <span>"The average age for "</span> <span>name</span> <span>" is "</span> <span>sum</span> <span>/</span> <span>nlines</span><span>;</span>
<span>}</span>
</pre><p><span>&nbsp;</span></p></div>]]></description>
	<dc:creator>BioJoker</dc:creator>
</item>

</channel>
</rss>