<?xml version='1.0'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:atom="http://www.w3.org/2005/Atom" >
<channel>
	<title><![CDATA[BOL: Rahul Nayak's pages]]></title>
	<link>https://bioinformaticsonline.com/pages/owner/rahul?</link>
	<atom:link href="https://bioinformaticsonline.com/pages/owner/rahul?" rel="self" type="application/rss+xml" />
	<description><![CDATA[]]></description>
	
	<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/40485/world-promising-health-companies</guid>
	<pubDate>Tue, 31 Dec 2019 19:10:13 -0600</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/40485/world-promising-health-companies</link>
	<title><![CDATA[World promising health companies !]]></title>
	<description><![CDATA[<p>The health care industry is expected to sustain stable growth over the next decade for a variety of reasons. Advances in medicine have prolonged the average lifespans of most people, requiring more health care treatments over longer terms. In years past, once people turned 65 and enrolled in Medicare, they were expected to live another 10 to 20 years.</p><p>Biohub&nbsp;is a joint collaborative effort by Berkeley, UCSF and Stanford for a&nbsp;medical&nbsp;science research center funded by a $600 million commitment from&nbsp;Facebook&nbsp;CEO and founder Mark Zuckerberg and his wife Priscilla Chan. It is trademarked as well as CZ&nbsp;Biohub. It is currently&nbsp;co-led by Stephen Quake and Joseph DeRisi.</p><p>More at&nbsp;<a href="https://www.czbiohub.org/">https://www.czbiohub.org/</a></p><p><span>Calico LLC is an American research and development biotech company founded on September 18, 2013 by Bill Maris and backed by Google with the goal of combating aging and associated diseases. In Google's 2013 Founders' Letter, Larry Page described Calico as a company focused on "health, well-being, and longevity".</span></p><p><span>More at&nbsp;<a href="https://www.calicolabs.com/">https://www.calicolabs.com/</a></span></p><p><span><span>UnitedHealth Group, Inc. (</span><a href="https://www.investopedia.com/markets/quote?tvwidgetsymbol=unh">UNH</a><span>) is the largest health care services company in the world, serving over 50 million individuals in the United States as of late 2018 and 5 million in Brazil. The company provides a wide range of health care products and services, such as health maintenance organizations (HMOs), point of service plans (POS),&nbsp;</span><a href="https://www.investopedia.com/terms/p/preferred-provider-organization.asp">preferred provider organizations (PPOs)</a><span>, and managed fee-for-service programs.</span></span></p><p><span>More at&nbsp;<a href="https://www.unitedhealthgroup.com/">https://www.unitedhealthgroup.com/</a></span></p>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/37590/parallel-processing-with-perl</guid>
	<pubDate>Sat, 25 Aug 2018 11:32:40 -0500</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/37590/parallel-processing-with-perl</link>
	<title><![CDATA[Parallel Processing with Perl !]]></title>
	<description><![CDATA[<p>Here is a small tutorial on how to make best use of multiple processors for bioinformatics analysis. One best way is using perl threads and forks. Knowing how these threads and forks work is very important before implementing them. Getting to know how these work would be really useful before reading this tutorial.</p><p>Many times in bioinformatics we need to deal with huge datasets which&nbsp; are more than 100GB size. The traditional way to analysis a file is using the while loop</p><p>while (FILE){</p><p>Do something;</p><p>}</p><p>This is very slow(since we are using only one processor) and if we have 500 million lines in the dataset it takes more than a day to iterate through the whole dataset. So how do we make best use of all our processors and get the work done quickly?</p><p>Here is a very simple and efficient technique with perl which i have been using. I am&nbsp; more inclined towards using perl fork than perl threads.</p><p>One of the oldest way to fork is</p><blockquote><p>my $fork = fork();<br />if($fork){&nbsp;&nbsp;&nbsp;<br />push (@childs,$fork);&nbsp;<br />}<br />elseif($fork==0){<br /><strong>your code here;</strong><br />exit(0);<br />}<br />else{die &ldquo;Couldnt fork : $!&rdquo;;}</p><p>## wait for the child process to finish<br />foreach(@childs){<br />my $tmp=waitid($_,0);<br />}</p></blockquote><p>what a fork does is it creates a child process and takes the variables and code with it to analyze it separately (detached from the parent process) and thus a separate process is created( which usually runs on a separate processor). Thats it!! One big disadvantage of forking is its very difficult to share variables among the different processes. I will show you how to do it easily but still it has its own drawbacks.</p><blockquote><p>Okie, now if you really do not want to use fork in your code, that&rsquo;s okie too..There are many useful modules which do it for you very efficiently. One really useful module is Parallel::ForkManager. You can use Parallel::ForkManager to manage the number of forks you want to generate (number of processors you want to use).</p><p><strong>Simple usage:</strong><br />use Parallel::ForkManager;<br />my $max_processors=8;<br />my $fork= new Parallel::ForkManager($max_processors);<br />foreach (@dna) {<br />$fork-&gt;start and next; # do the fork<br /><strong>you code here;</strong><br />$fork-&gt;finish; # do the exit in the child process<br />}<br />$pm-&gt;wait_all_children;</p></blockquote><p>so you will be generating 8 forks which do the same thing for your each element of array. when one child finishes, Parallel::ForkManager generates a new one and thus you will be using all your processors to analyze the data. Now, if you have generated 8 child processes and want to write the data to one file. You need to lock the file to do this, because you will have problems with the buffering. You can lock the file using flock command.</p><blockquote><p>open (my $QUAL, &ldquo;myfile.txt&rdquo;);<br />flock $QUAL, LOCK_EX or die &ldquo;cant lock file $!&rdquo;;<br />print $QUAL &ldquo;$output&rdquo;;<br />flock $QUAL, LOCK_UN or die &ldquo;$!&rdquo;;<br />close $QUAL;</p></blockquote><p>I would not suggest using flock when dealing with multiple processes because it will decrease the processing efficiency( each child process must wait for the lock to be released by the other child process). Instead, I would suggest each fork writing to a separate file and after the processing just concatenating them.</p><p><strong>Putting it all together, If you have 100GB data you can do this</strong></p><blockquote><p><strong>step 1</strong>&nbsp;: split the dataset equally according to number of processors you have. this may take a few hours(about 2-3 hrs for 100GB file)<br />You can use unix &ldquo;split&rdquo; command for this<br />for example:<br />my $number_split=int($number_of_entries_in_your_dataset/$max_processors);<br />my $split_Files=`split -l $number_split &ldquo;your_file.fasta&rdquo; &ldquo;file_name&rdquo;`;</p><p><strong>step2</strong>: open you directory comtaining you split files and start Parallel::ForkManager.<br /><strong>For example:</strong><br />opendir(DIRECTORY, $split_files_directory) or die $!; ### open the directory<br />my $fork= new Parallel::ForkManager($max_processors);<br />while (my $file = readdir(DIRECTORY)) { ### read the directory<br />if($file=~/^\./){next;}<br />print $file,&rdquo;\n&rdquo;;<br />########## Start fork ##########<br />my $pid= $super_fork-&gt;start and next;<br /><strong>Whatever you want to do with the split file ;</strong><br /><strong>analyze my piece of $file;</strong><br />######### end fork ###############<br />$super_fork-&gt;finish;<br />}<br />$super_fork-&gt;wait_all_children;</p></blockquote><p>So basically each processor will be active with its piece of data (split file) and thus you have created 8 processes at one time which run without interfering with the other process. I again will not suggest writing output from each child process to one file(for reasons above). Write output from each fork to a separate file and finally concatenate them. Thats it, you have just increased your program speed by 8 times!! Isnt it easy?</p><p><strong>Note:</strong><br />You may worry about concatenation of the output each child generates, since it does take some time(remember 100GB). I think now you can use a mysql database LOAD DATA LOCAL INFILE command to load all the files into a single table(Should take about 3hrs for 100Gb dataset) and then export the whole table into one file. This should be faster than just concatenating them using &ldquo;cat&rdquo; command.(correct me if I am wrong)</p><p>Or much simpler way is to use pipes</p><p>cat output_dir/* | my_pipe or my_pipe &lt;(file1) final_file;</p><p>Thats it guys!! Enjoy programming and please do comment. I am not a computer scientist so forgive me for any mistakes and if any please report them. Thank you.</p>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/37411/my-commonly-used-commands-in-bioinformatics</guid>
	<pubDate>Thu, 26 Jul 2018 04:58:45 -0500</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/37411/my-commonly-used-commands-in-bioinformatics</link>
	<title><![CDATA[My commonly used commands in Bioinformatics]]></title>
	<description><![CDATA[<p>FYI, I've found it useful to use MUMmer to extract the specific changes that Racon makes, so I can evaluate them individually:</p><pre><code>minimap -t 24 assembly.fasta long_reads.fastq.gz | racon -t 24 long_reads.fastq.gz - assembly.fasta racon_assembly.fasta
nucmer -p nucmer assembly.fasta racon_assembly.fasta
show-snps -C -T -r nucmer.delta
</code></pre><p>This reports Racon's changes in a table. You can exclude indels with the&nbsp;<code>-I</code>&nbsp;option in&nbsp;<code>show-snps</code>.&nbsp;</p><p>This process (Racon -&gt; MUMmer -&gt; SNP table) solves the problem I originally raised in this issue. So as far as I'm concerned, you can close this issue (or keep it open if you still want to implement some kind of variant table).</p>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/37317/interview-puzzles-for-bioinformatician</guid>
	<pubDate>Tue, 17 Jul 2018 05:26:18 -0500</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/37317/interview-puzzles-for-bioinformatician</link>
	<title><![CDATA[Interview Puzzles for Bioinformatician !]]></title>
	<description><![CDATA[<p>These are some of the most famous Interview Puzzles being asked in top tech companies.<br /><br />Here is a list of Top 25 puzzles which have been asked in top Tech Interview.</p><ol>
<li><span><a href="http://puzzlefry.com/puzzles/2-eggs-and-100-floor-google-classic-question/" target="_blank">2 Eggs and 100 Floor Classic Puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/gold-coins-puzzle/" target="_blank">Five pirates and gold coin Puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/gold-puzzle/" target="_blank">Six pirates and Gold Coin puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/probability-of-having-boy/" target="_blank">Probability of having boy</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/random-airplane-seats/" target="_blank">Random Airplane Seats</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/inverted-cards-puzzle/" target="_blank">Inverted playing card puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/flipping-coins/" target="_blank">Flipping Coins Puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/three-hat-colors/" target="_blank">Three hat colors Microsoft Puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/25-horses-5-tracks-puzzle/" target="_blank">25 horses 5 tracks Puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/gold-bar-puzzle-2/" target="_blank">Gold Bar Puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/crossing-the-bridge-puzzle/" target="_blank">Crossing the Bridge Puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/interview-questions/" target="_blank">Will you accept the bet?</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/the-line-of-persons-with-hats/" target="_blank">The Puzzle of 100 Hats</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/how-many-days/" target="_blank">Man fell in Well Puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/minimum-number-of-weigths/" target="_blank">Minimum Number of Weigths</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/one-bulb-with-3-switches/" target="_blank">One Bulb with 3 Switches</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/find-the-minimum-number-of-aircraft/" target="_blank">Find the minimum number of aircraft</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/burning-ropes-to-measure-time/" target="_blank">Burning ropes to measure time</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/connect-3-houses-with-3-wells/" target="_blank">Connect 3 houses with 3 wells</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/measure-9-minutes-from-2-hourglasses-puzzle/" target="_blank">Measure 9 minutes from 2 hourglasses puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/ant-and-triangle-problem/" target="_blank">Ant and Triangle Problem</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/the-man-in-the-elevator/" target="_blank">The Man in the Elevator</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/find-the-survivor/" target="_blank">Find the survivor</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/free-the-prisoners-puzzle/" target="_blank">Free the prisoners puzzle</a></span></li>
<li><span><a href="http://puzzlefry.com/puzzles/great-strategy-can-only-save-life/" target="_blank">GREAT STRATEGY CAN ONLY SAVE LIFE</a></span></li>
</ol><p><br /><span>Specially for Microsoft Interview Puzzles, you may refer,</span><br /><span><a href="http://puzzlefry.com/2015/08/top-15-famous-microsoft-interview-puzzles/" target="_blank">Top 15 Microsoft Interview Puzzles</a></span><br /><span><a href="http://puzzlefry.com/qa-tag/microsoft-interview-puzzles/" target="_blank">Microsoft Interview Puzzles</a></span><br /><br /><span>Other MOST COMMON Interview Puzzles-</span><br /><span><a href="http://puzzlefry.com/2015/08/top-25-tech-interview-puzzles-with-answers/" target="_blank">Top 25 Tech Interview&nbsp;</a></span><span><a href="http://puzzlefry.com/2015/08/top-25-tech-interview-puzzles-with-answers/" target="_blank">Logical Puzzles</a></span><br /><br /><span>Each of the puzzles got repeated a number of times in interviews&nbsp;</span><span>even for top tech companies&nbsp;</span></p>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/37198/understanding-blastn-output-format-6</guid>
	<pubDate>Wed, 27 Jun 2018 18:38:21 -0500</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/37198/understanding-blastn-output-format-6</link>
	<title><![CDATA[Understanding BLASTn output format 6 !]]></title>
	<description><![CDATA[<h3 id="sites-page-title-header" style="text-align: left;"><span>BLASTn output format 6</span></h3><div id="sites-canvas-main"><div id="sites-canvas-main-content"><div dir="ltr"><div><div><em>BLASTn</em> maps DNA against DNA, for example gene sequences against a reference genome<br /><br /><code><strong>blastn</strong>  -query <span>genes.ffn</span>  -subject <span>genome.fna</span>  -outfmt <strong>6</strong></code></div><h2>BLASTn tabular output format 6</h2>
<p><strong>Column headers:</strong><br /><code>qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore</code><br /></p>
<table border="1" cellspacing="0">
<tbody>
<tr>
<td> 1.</td>
<td> qseqid</td>
<td> query (e.g., gene) sequence id</td>
</tr>
<tr>
<td> 2.</td>
<td> sseqid</td>
<td> subject (e.g., reference genome) sequence id</td>
</tr>
<tr>
<td> 3.</td>
<td> pident</td>
<td> percentage of identical matches</td>
</tr>
<tr>
<td> 4.</td>
<td> length</td>
<td> alignment length</td>
</tr>
<tr>
<td> 5.</td>
<td> mismatch</td>
<td> number of mismatches</td>
</tr>
<tr>
<td> 6.</td>
<td> gapopen</td>
<td> number of gap openings</td>
</tr>
<tr>
<td> 7.</td>
<td> qstart</td>
<td> start of alignment in query</td>
</tr>
<tr>
<td> 8.</td>
<td> qend</td>
<td> end of alignment in query</td>
</tr>
<tr>
<td> 9.</td>
<td> sstart</td>
<td> start of alignment in subject</td>
</tr>
<tr>
<td> 10.</td>
<td> send</td>
<td> end of alignment in subject</td>
</tr>
<tr>
<td> 11.</td>
<td> evalue</td>
<td> <a href="http://www.metagenomics.wiki/tools/blast/evalue">expect value</a></td>
</tr>
<tr>
<td> 12.</td>
<td> bitscore</td>
<td> <a href="http://www.metagenomics.wiki/tools/blast/evalue"><strong>bit score</strong></a></td>
</tr>
</tbody>
</table>
<p><strong><br /></strong></p>
</div><h2><a name="TOC-Define-your-own-output-format" id="TOC-Define-your-own-output-format"></a>Define your own output format</h2><div><em>by adding the option -outfmt, as for example: </em><strong><br /></strong></div>
<p><code><strong>-outfmt</strong> <strong>"6</strong> <span>qseqid sseqid pident qlen length mismatch gapope evalue bitscore</span><strong>"</strong></code><br /><br /><em><strong>supported format specifiers are:</strong></em><br /><code>qseqid    </code>Query Seq-id<br /><code>qgi       </code>Query GI<br /><code>qacc      </code>Query accesion<br /><code>qaccver   </code>Query accesion.version<br /><code>qlen      </code>Query sequence length<br /><code>sseqid    </code>Subject Seq-id<br /><code>sallseqid </code>All subject Seq-id(s), separated by a ';'<br /><code>sgi       </code>Subject GI<br /><code>sallgi    </code>All subject GIs<br /><code>sacc      </code>Subject accession<br /><code>saccver   </code>Subject accession.version<br /><code>sallacc   </code>All subject accessions<br /><code>slen      </code>Subject sequence length<br /><code>qstart    </code>Start of alignment in query<br /><code>qend      </code>End of alignment in query<br /><code>sstart    </code>Start of alignment in subject<br /><code>send      </code>End of alignment in subject<br /><code>qseq      </code>Aligned part of query sequence<br /><code>sseq      </code>Aligned part of subject sequence<br /><code>evalue    </code>Expect value<br /><code>bitscore  </code>Bit score<br /><code>score     </code>Raw score<br /><code>length    </code>Alignment length<br /><code>pident    </code>Percentage of identical matches<br /><code>nident    </code>Number of identical matches<br /><code>mismatch  </code>Number of mismatches<br /><code>positive  </code>Number of positive-scoring matches<br /><code>gapopen   </code>Number of gap openings<br /><code>gaps      </code>Total number of gaps<br /><code>ppos      </code>Percentage of positive-scoring matches<br /><code>frames    </code>Query and subject frames separated by a '/'<br /><code>qframe    </code>Query frame<br /><code>sframe    </code>Subject frame<br /><code>btop      </code>Blast traceback operations (BTOP)<br /><code>staxids   </code>Subject Taxonomy ID(s), separated by a ';'<br /><code>sscinames </code>Subject Scientific Name(s), separated by a ';'<br /><code>scomnames </code>Subject Common Name(s), separated by a ';'<br /><code>sblastnames </code>Subject Blast Name(s), separated by a ';'   (in alphabetical order)<br /><code>sskingdoms  </code>Subject Super Kingdom(s), separated by a ';'     (in alphabetical order) <br /><code>stitle      </code>Subject Title<br /><code>salltitles  </code>All Subject Title(s), separated by a '&lt;&gt;'<br /><code>sstrand   </code>Subject Strand<br /><code>qcovs     </code>Query Coverage Per Subject<br /><code>qcovhsp   </code>Query Coverage Per HSP<br /><strong><br /><em>default values are:</em></strong><br /><code><code>-outfmt "</code>6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"</code></p>
</div></div></div>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/36842/gap-filling-or-contigs-extensions-tools</guid>
	<pubDate>Fri, 01 Jun 2018 08:07:32 -0500</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/36842/gap-filling-or-contigs-extensions-tools</link>
	<title><![CDATA[Gap filling or Contigs extensions tools !]]></title>
	<description><![CDATA[
<p>There are many tools to perform gap filling using Illumina short reads, for example "GapFiller: a de novo assembly approach to fill the gap within paired reads" or "Toward almost closed genomes with GapFiller". There are also some tools like GAPresolution that can help to perform local re-assemblies using 454 reads. We used GAPresolution but it is not a very good software, it is useful only in some specific situations.</p>

<p>Take a look at the PRICE software from the DeRisi lab. Its meant to do something very similar. http://derisilab.ucsf.edu/index.php?page=software</p>

<p>You could also look at SSPACE (http://www.baseclear.com/landingpages/basetools-a-wide-range-of-bioinformatics-solutions/sspacev12/), ATLAS tools (http://www.hgsc.bcm.tmc.edu/content/bcm-hgsc-software), and SCARPA (http://compbio.cs.toronto.edu/hapsembler/scarpa.html).</p>

<p>See the PAGIT protocol: http://www.sanger.ac.uk/resources/software/pagit/ </p>

<p>In particular, take a look at the IMAGE tool: http://genomebiology.com/2010/11/4/R41 </p>

<p>Also SOAPdenovo has ha function for scaffolding. Not sure about ABYSS</p>

<p>Here there is a useful explanation of several tools.</p>

<p>https://bioinformaticsonline.com/search?q=scaffolding&amp;entity_type=object&amp;entity_subtype=bookmarks&amp;offset=0&amp;search_type=entities</p>

<p>I could be wrong, but the above answers to your hypothetical scenario appear to miss the point that you aren't interested in assembling the full genome, just the 100 kb part you're interested in. I suggest the following algorithm:</p>

<p>1. Start with the initial assembly C0 of the contigs you have identified as overlapping your region of interest, and the set S of reads those contigs contain. Let C = C0.</p>

<p>2. Repeat:<br />a. Identify paired-end reads (not in C) for which one or both ends align within, or extending, contigs in C.<br />b. Identify unpaired reads that align extending these new paired-end reads.<br />c. Construct a new assembly C' from C and the new reads identified in (a) and (b).<br />d. Trim C' so it does not extend more than 100 kb to either end of C0. Set C = C'.<br />e. Let S' denote the reads that contribute to C'. If S' does not contain any reads not present in S, stop. Otherwise, Set S = S'.</p>

<p>3. If you don't have a complete assembly of the region of interest, generate an STS for each end of each contig, probe a library for clones including these STSes, subclone these clones into a paired-end sequencing vector, and generate paired-end reads for this library; then try steps (1) and (2) again, adding these new sequencing reads to what you had before.</p>

<p>4. If your average sequencing depth for the region of interest exceeds 25 or so without filling all gaps, it is likely that the remaining gaps represent sequences that are not getting cloned in your sequencing vectors. Try different sequencing vectors.</p>
]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/36603/learning-python-programming-a-bioinformatician-perspective</guid>
	<pubDate>Mon, 14 May 2018 16:33:03 -0500</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/36603/learning-python-programming-a-bioinformatician-perspective</link>
	<title><![CDATA[Learning Python Programming - a bioinformatician perspective !]]></title>
	<description><![CDATA[<p>Python Programming&nbsp;is a general purpose programming language that is open source, flexible, powerful and easy to use. One of the most important features of python is its rich set of utilities and libraries for data processing and analytics tasks. In the current era of big biological data, python and biopython is getting more popularity due to its easy-to-use features which supports big data processing.</p><p>In this tutorial series article, I will explore features and packages of python which are widely used in the big data, NGS, and bioinformatics. I will also walk through a real biological example which shows NGS data processing with the help of python packages and programming.</p><p>Python has a couple of points to recommend it to biologists and scientists specifically:</p><ul>
<li>It's widely used in the scientific community</li>
<li>It has a couple of very well designed libraries for doing complex scientific computing (although we won't encounter them in this book)</li>
<li>It lend itself well to being integrated with other, existing tools</li>
<li>It has features which make it easy to manipulate strings of characters (for example, strings of DNA bases and protein amino acid residues, which we as biologists are particularly fond of)</li>
</ul><p>In general, following are some of the important features of python which makes it a perfect fit for rapid application development.</p><ul>
<li>Python is interpreted language so the program does not need to be compiled. Interpreter parses the program code and generates the output.</li>
<li>Python is dynamically typed, so the variables types are defined automatically.</li>
<li>Python is strongly typed. So the developers need to cast the type manually.</li>
<li>Less code and more use makes it more acceptable.</li>
<li>Python is portable, extendable and scalable.</li>
</ul><p>There are two major Python versions, Python 2 and Python 3. Python 2 and 3 are quite different. This tutorial uses Python 3, because it more semantically correct and supports newer features.</p><p>I will post tutorial on daily basis on this page. Check the sub-pages on right side.</p>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/36197/bioinformatics-oneliner</guid>
	<pubDate>Tue, 10 Apr 2018 04:13:03 -0500</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/36197/bioinformatics-oneliner</link>
	<title><![CDATA[Bioinformatics OneLiner]]></title>
	<description><![CDATA[<p>To remove all line ends (\n) from a Unix text file:</p><pre>sed ':a;N;$!ba;s/\n//g' filename.txt &gt; newfilename_oneline.txt</pre><p>To get average for a column of numbers (here the second column $2):</p><pre>awk '{ sum += $2; n++ } END { if (n &gt; 0) print sum / n; }'</pre><p>To get sequence length for all sequences in a fasta file:</p><pre>awk '/^&gt;/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' \<br />filename.fasta</pre><p>To copy (move, rename, etc) files based on their list in a text file:</p><pre>cat file_list.txt | while read line; do cp "$line" complete_dataset/"$line"; done</pre><p>To split bam files into sets with mapped and unmapped reads:</p><pre>samtools view -F4 sample.bam &gt; sample.mapped.sam<br />samtools view -f4 sample.bam &gt; sample.unmapped.sam</pre><p>To gzip all your fastq files using gnu parallel and gzip:</p><pre>parallel gzip ::: *.fastq</pre><p>To gzip all your fastq files using pigz:</p><pre>pigz *.fastq</pre><p>To count all sequences in a fasta file:</p><pre>grep "^&gt;" yourfile.fasta -c</pre><p>To count all sequences in all fasta files in your current directory:</p><pre>for a in *.fasta; do ls $a; grep "^&gt;" -c $a; done</pre><p>To keep only one copy of duplicated lines:</p><pre>awk '!seen[$0]++'</pre><p>To sum assembly size from SPAdes contigs.fasta or scaffolds.fasta file:</p><pre>grep "^&gt;" scaffolds.fasta | cut -f 4 -d '_' | paste -sd+ | bc</pre><p>To remove everything after the first space at each line, e.g. to to simplify fasta headers:</p><pre>cut -d' ' -f1 &lt; your_file</pre><p>To count reads in a all .fastq.gz files in your current folder (fast, using gnu parallel):</p><pre>parallel "echo {} &amp;&amp; gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: *.gz</pre><p>To count reads in a all .fastq.gz files in your current folder:</p><pre>zcat *.gz | echo $((`wc -l`/4))</pre><p>To count reads in a all .fastq files in your current folder:</p><pre>cat *.fastq | echo $((`wc -l`/4))</pre><p>To count base pairs in a all .fastq.gz files in your current folder:</p><pre>zcat *.fastq.gz | paste - - - - | cut -f 2 | tr -d '\n' | wc -c </pre><p>To split multifasta file into many fasta files:</p><pre>awk '/^&gt;/ {OUT=substr($0,2) ".fa"}; {print &gt;&gt; OUT; close(OUT)}' Input_File</pre><p>To convert Illumina FASTQ 1.3 to 1.8:</p><pre>sed -e '4~4y/@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghi/!"#$%&amp;'\''()*+,-.\/0123456789:;&lt;=&gt;?@ABCDEFGHIJ/' f.fastq</pre><p>To convert FASTQ to FASTA:</p><pre>sed -n '1~4s/^@/&gt;/p;2~4p' </pre><p>To get fastq read length distribution:</p><pre>cat reads.fastq | awk '{if(NR%4==2) print length($1)}' | sort | uniq -c</pre><p>To deinterleave interleaved fastq file:</p><pre>cat myf.fq | paste - - - - - - - - | tee &gt;(cut -f 1-4 | tr "\t" "\n" &gt; myfile_1.fq) | cut -f 5-8 | \<br />tr "\t" "\n" &gt; myf2.fq </pre><p>To filter and sort contig identifiers from SPAdes assembly (e.g. here lenght &gt;= 4000 + coverage &gt;=100):</p><pre>grep "^&gt;" scaffolds.fasta | sed s"/_/ /"g | awk '{ if ($4 &gt;= 4000 &amp;&amp; $6 &gt;= 100) print $0 }' | sort -k 4 -n | \<br />sed s"/ /_/"g</pre><p>To append something to all headers of your fasta files:</p><pre>sed 's/&gt;.*/&amp;YOURSTRING/' filename.fasta &gt; new_filename.fasta</pre><p>To replace/squeeze multiple adjacent spaces by only one space:&nbsp;</p><pre>tr -s " " &lt; file</pre><p>To filter fastq based on length (here larger than or equal to 21, but smaller than or equal to 25.</p><pre>cat your.fastq | paste - - - - | awk 'length($2)&nbsp; &gt;= 21 &amp;&amp; length($2) &lt;= 25' | sed 's/\t/\n/g' &gt; filtered.fastq</pre><p>To print difference between the last and first row in 5th column:</p><pre>awk '{if (!first){first=$5;}; last=$5;} END {print last-first}' myfile.txt</pre><p>To sample only 200 first bases from all sequences in a multifasta file (e.g. from assembly scaffolds.fasta file here):</p><pre>awk '/^&gt;/{ seqlen=0; print; next; } seqlen &lt; 200 { if (seqlen + length($0) &gt; 200) $0 = substr($0, 1, 200-seqlen);\<br /> seqlen += length($0); print }' scaffolds.fasta &gt; 200bp_scaffolds.fasta</pre><p>&nbsp;To pipe a compressed fasta file directly into makeblastdb.</p><pre>gunzip -c fasta.gz | makeblastdb -in -</pre><p>To remove sequences with duplicate fasta headers from a fasta file.</p><pre>awk '/^&gt;/{f=!d[$1];d[$1]=1}f' in.fasta &gt; out.fasta</pre>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/35525/linux-commands-cheat-sheet-for-bioinformatics-and-computational-biology-professionals</guid>
	<pubDate>Mon, 05 Feb 2018 18:50:41 -0600</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/35525/linux-commands-cheat-sheet-for-bioinformatics-and-computational-biology-professionals</link>
	<title><![CDATA[Linux Commands Cheat Sheet for Bioinformatics and Computational Biology Professionals]]></title>
	<description><![CDATA[<p><span>The purpose of this cheat sheet is to introduce biologist and bioinformatician to the frequently used tools for NGS analysis as well as giving experience in writing one-liners.</span></p><ul>
<li><span></span><span><strong>File System</strong></span><span><strong><br /> </strong></span><span>ls</span><span>&nbsp;&mdash; list items in current directory</span><span><br /> </span><span>ls -l</span><span>&nbsp;&mdash; list items in current directory and show in long format to see perimissions, size, and modification date</span><span><br /> </span><span>ls -a</span><span>&nbsp;&mdash; list all items in current directory, including hidden files</span><span><br /> </span><span>ls -F</span><span>&nbsp;&mdash; list all items in current directory and show directories with a slash and executables with a star</span><span><br /> </span><span>ls dir</span><span>&nbsp;&mdash; list all items in directory dir</span><span><br /> </span><span>cd dir</span><span>&nbsp;&mdash; change directory to dir</span><span><br /> </span><span>cd ..</span><span>&nbsp;&mdash; go up one directory</span><span><br /> </span><span>cd /</span><span>&nbsp;&mdash; go to the root directory</span><span><br /> </span><span>cd ~</span><span>&nbsp;&mdash; go to to your home directory</span><span><br /> </span><span>cd -</span><span>&nbsp;&mdash; go to the last directory you were just in</span><span><br /> </span><span>pwd</span><span>&nbsp;&mdash; show present working directory</span><span><br /> </span><span>mkdir dir</span><span>&nbsp;&mdash; make directory dir</span><span><br /> </span><span>rm file</span><span>&nbsp;&mdash; remove file</span><span><br /> </span><span>rm -r dir</span><span>&nbsp;&mdash; remove directory dir recursively</span><span><br /> </span><span>cp file1 file2</span><span>&nbsp;&mdash; copy file1 to file2</span><span><br /> </span><span>cp -r dir1 dir2</span><span>&nbsp;&mdash; copy directory dir1 to dir2 recursively</span><span><br /> </span><span>mv file1 file2</span><span>&nbsp;&mdash; move (rename) file1 to file2</span><span><br /> </span><span>ln -s file link</span><span>&nbsp;&mdash; create symbolic link to file</span><span><br /> </span><span>touch file</span><span>&nbsp;&mdash; create or update file</span><span><br /> </span><span>cat file</span><span>&nbsp;&mdash; output the contents of file</span><span><br /> </span><span>less file</span><span>&nbsp;&mdash; view file with page navigation</span><span><br /> </span><span>head file</span><span>&nbsp;&mdash; output the first 10 lines of file</span><span><br /> </span><span>tail file</span><span>&nbsp;&mdash; output the last 10 lines of file</span><span><br /> </span><span>tail -f file</span><span>&nbsp;&mdash; output the contents of file as it grows, starting with the last 10 lines</span><span><br /> </span><span>vim file</span><span>&nbsp;&mdash; edit file</span><span><br /> </span><span>alias name 'command'</span><span>&nbsp;&mdash; create an alias for a command</span><span><br /> </span></li>
<li><span></span><span><strong>System</strong></span><span><strong><br /> </strong></span><span>shutdown</span><span>&nbsp;&mdash; shut down machine</span><span><br /> </span><span>reboot</span><span>&nbsp;&mdash; restart machine</span><span><br /> </span><span>date</span><span>&nbsp;&mdash; show the current date and time</span><span><br /> </span><span>whoami</span><span>&nbsp;&mdash; who you are logged in as</span><span><br /> </span><span>finger user</span><span>&nbsp;&mdash; display information about user</span><span><br /> </span><span>man command</span><span>&nbsp;&mdash; show the manual for command</span><span><br /> </span><span>df</span><span>&nbsp;&mdash; show disk usage</span><span><br /> </span><span>du</span><span>&nbsp;&mdash; show directory space usage</span><span><br /> </span><span>free</span><span>&nbsp;&mdash; show memory and swap usage</span><span><br /> </span><span>whereis app</span><span>&nbsp;&mdash; show possible locations of app</span><span><br /> </span><span>which app</span><span>&nbsp;&mdash; show which app will be run by default</span><span><br /> </span></li>
<li><span></span><span><strong>Process Management</strong></span><span><strong><br /> </strong></span><span>ps</span><span>&nbsp;&mdash; display your currently active processes</span><span><br /> </span><span>top</span><span>&nbsp;&mdash; display all running processes</span><span><br /> </span><span>kill pid</span><span>&nbsp;&mdash; kill process id pid</span><span><br /> </span><span>kill -9 pid</span><span>&nbsp;&mdash; force kill process id pid</span><span><br /> </span></li>
<li><span></span><span><strong>Permissions</strong></span><span><strong><br /> </strong></span><span>ls -l</span><span>&nbsp;&mdash; list items in current directory and show permissions</span><span><br /> </span><span>chmod ugo file</span><span>&nbsp;&mdash; change permissions of file to ugo - u is the user's permissions, g is the group's permissions, and o is everyone else's permissions. The values of u, g, and o can be any number between 0 and 7.</span><span><br /> </span><span>7</span><span>&nbsp;&mdash; full permissions</span><span><br /> </span><span>6</span><span>&nbsp;&mdash; read and write only</span><span><br /> </span><span>5</span><span>&nbsp;&mdash; read and execute only</span><span><br /> </span><span>4</span><span>&nbsp;&mdash; read only</span><span><br /> </span><span>3</span><span>&nbsp;&mdash; write and execute only</span><span><br /> </span><span>2</span><span>&nbsp;&mdash; write only</span><span><br /> </span><span>1</span><span>&nbsp;&mdash; execute only</span><span><br /> </span><span>0</span><span>&nbsp;&mdash; no permissions</span><span><br /> </span><span>chmod 600 file</span><span>&nbsp;&mdash; you can read and write - good for files</span><span><br /> </span><span>chmod 700 file</span><span>&nbsp;&mdash; you can read, write, and execute - good for scripts</span><span><br /> </span><span>chmod 644 file</span><span>&nbsp;&mdash; you can read and write, and everyone else can only read - good for web pages</span><span><br /> </span><span>chmod 755 file</span><span>&nbsp;&mdash; you can read, write, and execute, and everyone else can read and execute - good for programs that you want to share</span><span><br /> </span></li>
<li><span></span><span><strong>Networking</strong></span><span><strong><br /> </strong></span><span>wget file</span><span>&nbsp;&mdash; download a file</span><span><br /> </span><span>curl file</span><span>&nbsp;&mdash; download a file</span><span><br /> </span><span>scp user@host:file dir</span><span>&nbsp;&mdash; secure copy a file from remote server to the dir directory on your machine</span><span><br /> </span><span>scp file user@host:dir</span><span>&nbsp;&mdash; secure copy a file from your machine to the dir directory on a remote server</span><span><br /> </span><span>scp -r user@host:dir dir</span><span>&nbsp;&mdash; secure copy the directory dir from remote server to the directory dir on your machine</span><span><br /> </span><span>ssh user@host</span><span>&nbsp;&mdash; connect to host as user</span><span><br /> </span><span>ssh -p port user@host</span><span>&nbsp;&mdash; connect to host on port as user</span><span><br /> </span><span>ssh-copy-id user@host</span><span>&nbsp;&mdash; add your key to host for user to enable a keyed or passwordless login</span><span><br /> </span><span>ping host</span><span>&nbsp;&mdash; ping host and output results</span><span><br /> </span><span>whois domain</span><span>&nbsp;&mdash; get information for domain</span><span><br /> </span><span>dig domain</span><span>&nbsp;&mdash; get DNS information for domain</span><span><br /> </span><span>dig -x host</span><span>&nbsp;&mdash; reverse lookup host</span><span><br /> </span><span>lsof -i tcp:1337</span><span>&nbsp;&mdash; list all processes running on port 1337</span><span><br /> </span></li>
<li><span></span><span><strong>Searching</strong></span><span><strong><br /> </strong></span><span>grep pattern files</span><span>&nbsp;&mdash; search for pattern in files</span><span><br /> </span><span>grep -r pattern dir</span><span>&nbsp;&mdash; search recursively for pattern in dir</span><span><br /> </span><span>grep -rn pattern dir</span><span>&nbsp;&mdash; search recursively for pattern in dir and show the line number found</span><span><br /> </span><span>grep -r pattern dir --include='*.ext</span><span>&nbsp;&mdash; search recursively for pattern in dir and only search in files with .ext extension</span><span><br /> </span><span>command | grep pattern</span><span>&nbsp;&mdash; search for pattern in the output of command</span><span><br /> </span><span>find file</span><span>&nbsp;&mdash; find all instances of file in real system</span><span><br /> </span><span>locate file</span><span>&nbsp;&mdash; find all instances of file using indexed database built from the updatedb command. Much faster than find</span><span><br /> </span><span>sed -i 's/day/night/g' file</span><span>&nbsp;&mdash; find all occurrences of day in a file and replace them with night - s means substitude and g means global - sed also supports regular expressions</span><span><br /> </span></li>
<li><span></span><span><strong>Compression</strong></span><span><strong><br /> </strong></span><span>tar cf file.tar files</span><span>&nbsp;&mdash; create a tar named file.tar containing files</span><span><br /> </span><span>tar xf file.tar</span><span>&nbsp;&mdash; extract the files from file.tar</span><span><br /> </span><span>tar czf file.tar.gz files</span><span>&nbsp;&mdash; create a tar with Gzip compression</span><span><br /> </span><span>tar xzf file.tar.gz</span><span>&nbsp;&mdash; extract a tar using Gzip</span><span><br /> </span><span>gzip file</span><span>&nbsp;&mdash; compresses file and renames it to file.gz</span><span><br /> </span><span>gzip -d file.gz</span><span>&nbsp;&mdash; decompresses file.gz back to file</span><span><br /> </span></li>
<li><span></span><span><strong>Shortcuts</strong></span><span><strong><br /> </strong></span><span>ctrl+a</span><span>&nbsp;&mdash; move cursor to beginning of line</span><span><br /> </span><span>ctrl+f</span><span>&nbsp;&mdash; move cursor to end of line</span><span><br /> </span><span>alt+f</span><span>&nbsp;&mdash; move cursor forward 1 word</span><span><br /> </span><span>alt+b</span><span>&nbsp;&mdash; move cursor backward 1 word</span><span><br /> </span></li>
<li></li>
</ul>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/35429/list-of-visualization-tools-for-genome-alignments</guid>
	<pubDate>Fri, 02 Feb 2018 13:25:33 -0600</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/35429/list-of-visualization-tools-for-genome-alignments</link>
	<title><![CDATA[List of visualization tools for genome alignments]]></title>
	<description><![CDATA[<p><span>Genome</span><span>&nbsp;browsers are useful not only for showing final results but also for improving analysis protocols, testing data quality, and generating result drafts. Its integration in analysis pipelines allows the optimization of parameters, which leads to better results. But sometime, we need publication ready figure of genomes. Following are the list of genome alignment visualization tools, which could be useful for analysis and&nbsp;interpretation of results:</span></p><p>ABySS Explorer</p><p>Interactive Java application that uses a novel graph-based representation to display a sequence assembly and associated metadata</p><p>http://www.bcgsc.ca/platform/bioinfo/software/abyss-explorer</p><p>BamView</p><p>Genome browser and annotation tool that allows visualization of sequence features, next-generation sequencing (NGS) data and the results of analyses within the context of the sequence, and also its six-frame translation</p><p>http://www.sanger.ac.uk/resources/software/artemis/</p><p>DNannotator&nbsp;</p><p>Annotation web toolkit for regional genomic sequences</p><p>http://bioapp.psych.uic.edu/DNannotator.htm</p><p>JVM&nbsp;</p><p>Java Visual Mapping tool for NGS reads</p><p>http://www.springer.com/cda/content/document/cda_downloaddocument/9789401792448-c2.pdf?SGWID=0-0-45-1487072-p176815501</p><p>LookSeq&nbsp;</p><p>Web-based visualization of sequences derived from multiple sequencing technologies. Low- or high-depth read pileups and easy visualization of putative single nucleotide and structural variation</p><p>http://lookseq.sourceforge.net</p><p>MagicViewer&nbsp;</p><p>Visualization of short read alignment, identification of genetic variation and association with annotation information of a reference genome</p><p>http://bioinformatics.zj.cn/magicviewer/</p><p>MapView&nbsp;</p><p>Alignments of huge-scale single-end and pair-end short reads</p><p>http://omictools.com/mapview-s1367.html</p><p>MultiPipMaker</p><p>Computes alignments of similar regions in two DNA sequences. The resulting alignments are summarized with a &lsquo;percent identity plot&rsquo; (pip)</p><p>http://pipmaker.bx.psu.edu/pipmaker/</p><p>PileLineGUI&nbsp;</p><p>Handling genome position files in NGS studies</p><p>http://sing.ei.uvigo.es/pileline/pilelinegui.html</p><p>SAMtools tview&nbsp;</p><p>Simple and fast text alignment viewer; NGS compatible</p><p>http://www.htslib.org/</p><p>SEWAL</p><p>Uses a locality-sensitive hashing algorithm to enumerate all unique sequences in an entire Illumina sequencing run</p><p>http://www.sourceforge.net/projects/sewal</p><p>STAR&nbsp;</p><p>A web-based integrated solution to management and visualization of sequencing data</p><p>http://wanglab.ucsd.edu/star/browser</p><p>SVA&nbsp;</p><p>Software for annotating and visualizing sequenced human genomes</p><p>http://www.svaproject.org</p><p>Viewer (IGV)&nbsp;</p><p>Visualization of large heterogeneous datasets, providing a smooth and intuitive user experience at all levels of genome resolution</p><p>https://www.broadinstitute.org/igv/</p><p>ZOOM Lite&nbsp;</p><p>NGS data mapping and visualization software</p><p>http://bioinfor.com/zoom/lite/</p>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>

</channel>
</rss>