<?xml version='1.0'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:atom="http://www.w3.org/2005/Atom" >
<channel>
	<title><![CDATA[BOL: Related items]]></title>
	<link>https://bioinformaticsonline.com/related/34525?offset=70</link>
	<atom:link href="https://bioinformaticsonline.com/related/34525?offset=70" rel="self" type="application/rss+xml" />
	<description><![CDATA[]]></description>
	
	<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/37414/arc-pipeline-which-facilitates-iterative-reference-guided-de-novo-assemblies</guid>
	<pubDate>Thu, 26 Jul 2018 09:20:26 -0500</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/37414/arc-pipeline-which-facilitates-iterative-reference-guided-de-novo-assemblies</link>
	<title><![CDATA[ARC: pipeline which facilitates iterative, reference guided de novo assemblies]]></title>
	<description><![CDATA[<p>ARC is a pipeline which facilitates iterative, reference guided&nbsp;<em>de novo</em>&nbsp;assemblies with the intent of:</p>
<ol>
<li>Reducing time in analysis and increasing accuracy of results by only considering those reads which should assemble together.</li>
<li>Reducing/removing reference bias as compared to mapping based approaches.</li>
</ol>
<p><span>The software is designed to work in situations where a whole-genome assembly is not the objective, but rather when the researcher wishes to assemble discreet 'targets' contained within next-generation shotgun sequence data. ARC decomplexifies the traditionally difficult problem of assembly by breaking the reads into small, manageable subsets which can then be assembled quickly and efficiently in parallel. Applications include those in which the researcher wishes to&nbsp;</span><em>de novo</em><span>&nbsp;assemble specific content and a set of semi-similar reference targets is available to initialize the assembly process.</span></p>
<p>https://ibest.github.io/ARC/</p><p>Address of the bookmark: <a href="https://ibest.github.io/ARC/" rel="nofollow">https://ibest.github.io/ARC/</a></p>]]></description>
	<dc:creator>Jit</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/39236/causel-an-epigenome-and-genome-editing-pipeline-for-establishing-function-of-noncoding-gwas-variants</guid>
	<pubDate>Tue, 09 Apr 2019 07:23:37 -0500</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/39236/causel-an-epigenome-and-genome-editing-pipeline-for-establishing-function-of-noncoding-gwas-variants</link>
	<title><![CDATA[CAUSEL: an epigenome- and genome-editing pipeline for establishing function of noncoding GWAS variants]]></title>
	<description><![CDATA[<p><span>Validated a widely accessible approach that can be used to establish functional causality for noncoding sequence variants identified by GWASs.</span></p>
<p><a href="https://www.nature.com/articles/nm.3975">https://www.nature.com/articles/nm.3975</a></p><p>Address of the bookmark: <a href="https://www.nature.com/articles/nm.3975" rel="nofollow">https://www.nature.com/articles/nm.3975</a></p>]]></description>
	<dc:creator>BioJoker</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/40409/haplotypo-a-variant-calling-pipeline-for-phased-genomes</guid>
	<pubDate>Thu, 19 Dec 2019 07:33:40 -0600</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/40409/haplotypo-a-variant-calling-pipeline-for-phased-genomes</link>
	<title><![CDATA[HaploTypo: a variant-calling pipeline for phased genomes]]></title>
	<description><![CDATA[<p>An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome.</p>
<div>Availability and Implementation</div>
<p>HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at&nbsp;<a href="https://github.com/gabaldonlab/haplotypo" target="">https://github.com/gabaldonlab/haplotypo</a>, and as a Docker image.</p><p>Address of the bookmark: <a href="https://github.com/gabaldonlab/haplotypo" rel="nofollow">https://github.com/gabaldonlab/haplotypo</a></p>]]></description>
	<dc:creator>Jit</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/41030/slr-superscaffolder-a-scaffold-assemble-pipeline-for-stlfr-reads</guid>
	<pubDate>Fri, 14 Feb 2020 14:23:30 -0600</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/41030/slr-superscaffolder-a-scaffold-assemble-pipeline-for-stlfr-reads</link>
	<title><![CDATA[SLR-superscaffolder: A scaffold assemble pipeline for stLFR reads.]]></title>
	<description><![CDATA[<p>This is a scaffold assembler designed for stLFR reads[1]. It uses the link-reads information from stLFR reads to assemble contigs to scaffolds.</p>
<p>Here is an illustration of this pipeline:</p>
<p>&nbsp;<img src="https://github.com/BGI-Qingdao/SLR-superscaffolder/raw/master/image.png" alt="image" style="border: 0px;"></p><p>Address of the bookmark: <a href="https://github.com/BGI-Qingdao/SLR-superscaffolder" rel="nofollow">https://github.com/BGI-Qingdao/SLR-superscaffolder</a></p>]]></description>
	<dc:creator>Jit</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/42038/pyparanoid-a-pipeline-for-rapid-identification-of-homologous-gene-families-in-a-set-of-genomes</guid>
	<pubDate>Thu, 13 Aug 2020 10:06:19 -0500</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/42038/pyparanoid-a-pipeline-for-rapid-identification-of-homologous-gene-families-in-a-set-of-genomes</link>
	<title><![CDATA[PyParanoid: a pipeline for rapid identification of homologous gene families in a set of genomes]]></title>
	<description><![CDATA[<p>PyParanoid is a pipeline for rapid identification of homologous gene families in a set of genomes - a central task of any comparative genomics analysis. The "gold standard" for identifying homologs is to use reciprocal best hits (RBHs) which depends on performing a all-vs-all sequence comparison, usually using BLAST, to determine homology. However, these methods are computationally expensive, requiring&nbsp;O(n2)&nbsp;resources to identify RBHs. This is problematic, as the modern deluge of sequencing data means that comparative genomics analyses could be performed on datasets of thousands of strains.</p><p>Address of the bookmark: <a href="https://github.com/ryanmelnyk/PyParanoid" rel="nofollow">https://github.com/ryanmelnyk/PyParanoid</a></p>]]></description>
	<dc:creator>BioStar</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/42946/aligngraph2-similar-genome-assisted-reassembly-pipeline-for-pacbio-long-reads</guid>
	<pubDate>Sun, 14 Mar 2021 09:42:47 -0500</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/42946/aligngraph2-similar-genome-assisted-reassembly-pipeline-for-pacbio-long-reads</link>
	<title><![CDATA[AlignGraph2: similar genome-assisted reassembly pipeline for PacBio long reads]]></title>
	<description><![CDATA[<p><span>AlignGraph2 is the second version of&nbsp;</span><a href="https://github.com/baoe/AlignGraph">AlignGraph</a><span>&nbsp;for PacBio long reads. It extends and refines contigs assembled from the long reads with a published genome similar to the sequencing genome.</span></p>
<p><span>More at&nbsp;https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbab022/6146772</span></p><p>Address of the bookmark: <a href="https://github.com/huangs001/AlignGraph2" rel="nofollow">https://github.com/huangs001/AlignGraph2</a></p>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/blog/view/43634/illumina-based-assembly-pipeline-steps</guid>
	<pubDate>Fri, 10 Dec 2021 06:22:54 -0600</pubDate>
	<link>https://bioinformaticsonline.com/blog/view/43634/illumina-based-assembly-pipeline-steps</link>
	<title><![CDATA[Illumina based assembly pipeline steps !]]></title>
	<description><![CDATA[<h3 id="illumina">Illumina<a href="https://nf-co.re/viralrecon#illumina"><span></span></a></h3><ol>
<li>Merge re-sequenced FastQ files (<a href="http://www.linfo.org/cat.html"><code>cat</code></a>)</li>
<li>Read QC (<a href="https://www.bioinformatics.babraham.ac.uk/projects/fastqc/"><code>FastQC</code></a>)</li>
<li>Adapter trimming (<a href="https://github.com/OpenGene/fastp"><code>fastp</code></a>)</li>
<li>Removal of host reads (<a href="http://ccb.jhu.edu/software/kraken2/"><code>Kraken 2</code></a>; <em>optional</em>)</li>
<li>Variant calling<ol>
<li>Read alignment (<a href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml"><code>Bowtie 2</code></a>)</li>
<li>Sort and index alignments (<a href="https://sourceforge.net/projects/samtools/files/samtools/"><code>SAMtools</code></a>)</li>
<li>Primer sequence removal (<a href="https://github.com/andersen-lab/ivar"><code>iVar</code></a>; <em>amplicon data only</em>)</li>
<li>Duplicate read marking (<a href="https://broadinstitute.github.io/picard/"><code>picard</code></a>; <em>optional</em>)</li>
<li>Alignment-level QC (<a href="https://broadinstitute.github.io/picard/"><code>picard</code></a>, <a href="https://sourceforge.net/projects/samtools/files/samtools/"><code>SAMtools</code></a>)</li>
<li>Genome-wide and amplicon coverage QC plots (<a href="https://github.com/brentp/mosdepth/"><code>mosdepth</code></a>)</li>
<li>Choice of multiple variant calling and consensus sequence generation routes (<a href="https://github.com/andersen-lab/ivar"><code>iVar variants and consensus</code></a>; <em>default for amplicon data</em> <em>||</em> <a href="http://samtools.github.io/bcftools/bcftools.html"><code>BCFTools</code></a>, <a href="https://github.com/arq5x/bedtools2/"><code>BEDTools</code></a>; <em>default for metagenomics data</em>)
<ul>
<li>Variant annotation (<a href="http://snpeff.sourceforge.net/SnpEff.html"><code>SnpEff</code></a>, <a href="http://snpeff.sourceforge.net/SnpSift.html"><code>SnpSift</code></a>)</li>
<li>Consensus assessment report (<a href="http://quast.sourceforge.net/quast"><code>QUAST</code></a>)</li>
<li>Lineage analysis (<a href="https://github.com/cov-lineages/pangolin"><code>Pangolin</code></a>)</li>
<li>Clade assignment, mutation calling and sequence quality checks (<a href="https://github.com/nextstrain/nextclade"><code>Nextclade</code></a>)</li>
<li>Individual variant screenshots with annotation tracks (<a href="https://asciigenome.readthedocs.io/en/latest/"><code>ASCIIGenome</code></a>)</li>
</ul>
</li>
<li>Intersect variants across callers (<a href="http://samtools.github.io/bcftools/bcftools.html"><code>BCFTools</code></a>)</li>
</ol></li>
<li><em>De novo</em> assembly<ol>
<li>Primer trimming (<a href="https://cutadapt.readthedocs.io/en/stable/guide.html"><code>Cutadapt</code></a>; <em>amplicon data only</em>)</li>
<li>Choice of multiple assembly tools (<a href="http://cab.spbu.ru/software/spades/"><code>SPAdes</code></a> <em>||</em> <a href="https://github.com/rrwick/Unicycler"><code>Unicycler</code></a> <em>||</em> <a href="https://github.com/GATB/minia"><code>minia</code></a>)
<ul>
<li>Blast to reference genome (<a href="https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch"><code>blastn</code></a>)</li>
<li>Contiguate assembly (<a href="https://www.sanger.ac.uk/science/tools/pagit"><code>ABACAS</code></a>)</li>
<li>Assembly report (<a href="https://github.com/BU-ISCIII/plasmidID"><code>PlasmidID</code></a>)</li>
<li>Assembly assessment report (<a href="http://quast.sourceforge.net/quast"><code>QUAST</code></a>)</li>
</ul>
</li>
</ol></li>
<li>Present QC and visualisation for raw read, alignment, assembly and variant calling results (<a href="http://multiqc.info/"><code>MultiQC</code></a>)</li>
</ol>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/44561/bactopia-a-flexible-pipeline-for-complete-analysis-of-bacterial-genomes</guid>
	<pubDate>Sat, 08 Jun 2024 16:25:08 -0500</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/44561/bactopia-a-flexible-pipeline-for-complete-analysis-of-bacterial-genomes</link>
	<title><![CDATA[Bactopia: a flexible pipeline for complete analysis of bacterial genomes]]></title>
	<description><![CDATA[<p>Bactopia is a flexible pipeline for complete analysis of bacterial genomes. The goal of Bactopia is process your data with a broad set of tools, so that you can get to the fun part of analyses quicker!</p>
<p>Bactopia was inspired by&nbsp;<a href="https://staphopia.github.io/">Staphopia</a>, a workflow we (Tim Read and myself) released that is targeted towards&nbsp;<em>Staphylococcus aureus</em>&nbsp;genomes. Using what we learned from Staphopia and user feedback, Bactopia was developed from scratch with usability, portability, and speed in mind from the start.</p>
<p>Bactopia uses&nbsp;<a href="https://www.nextflow.io/">Nextflow</a>&nbsp;to manage the workflow, allowing for support of many types of environments (e.g. cluster or cloud). Bactopia allows for the usage of many public datasets as well as your own datasets to further enhance the analysis of your sequencing. Bactopia only uses software packages available from&nbsp;<a href="https://bioconda.github.io/">Bioconda</a>&nbsp;and&nbsp;<a href="https://conda-forge.org/">Conda-Forge</a>&nbsp;to make installation as simple as possible for&nbsp;<em>all</em>&nbsp;users.</p>
<p>To highlight the use of&nbsp;<a href="https://bactopia.github.io/latest/full-guide/">Bactopia</a>&nbsp;and&nbsp;<a href="https://bactopia.github.io/latest/bactopia-tools/">Bactopia Tools</a>, we performed an analysis of 1,664 public&nbsp;<em>Lactobacillus</em>&nbsp;genomes, focusing on&nbsp;<em>Lactobacillus crispatus</em>, a species that is a common part of the human vaginal microbiome. The results from this analysis are published in mSystems under the title:&nbsp;<em><a href="https://doi.org/10.1128/mSystems.00190-20">Bactopia: a flexible pipeline for complete analysis of bacterial genomes</a></em></p>
<p><a href="https://bactopia.github.io/latest/assets/bactopia-workflow.png"><img src="https://bactopia.github.io/latest/assets/bactopia-workflow.png" alt="Bactopia Workflow" style="border: 0px;"></a></p><p>Address of the bookmark: <a href="https://bactopia.github.io/latest/" rel="nofollow">https://bactopia.github.io/latest/</a></p>]]></description>
	<dc:creator>Abhi</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/37527/nanopack-visualizing-and-processing-long-read-sequencing-data</guid>
	<pubDate>Fri, 10 Aug 2018 18:41:34 -0500</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/37527/nanopack-visualizing-and-processing-long-read-sequencing-data</link>
	<title><![CDATA[NanoPack: visualizing and processing long-read sequencing data]]></title>
	<description><![CDATA[<p>The NanoPack tools are written in Python3 and released under the GNU GPL3.0 License. The source code can be found at&nbsp;<a href="https://github.com/wdecoster/nanopack" target="">https://github.com/wdecoster/nanopack</a>, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for Linux and are available as a graphical user interface, a web service at&nbsp;<a href="http://nanoplot.bioinf.be/" target="">http://nanoplot.bioinf.be</a>&nbsp;and command line tools.</p>
<p>&nbsp;https://academic.oup.com/bioinformatics/article/34/15/2666/4934939</p><p>Address of the bookmark: <a href="https://github.com/wdecoster/nanoQC" rel="nofollow">https://github.com/wdecoster/nanoQC</a></p>]]></description>
	<dc:creator>Jit</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/37590/parallel-processing-with-perl</guid>
	<pubDate>Sat, 25 Aug 2018 11:32:40 -0500</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/37590/parallel-processing-with-perl</link>
	<title><![CDATA[Parallel Processing with Perl !]]></title>
	<description><![CDATA[<p>Here is a small tutorial on how to make best use of multiple processors for bioinformatics analysis. One best way is using perl threads and forks. Knowing how these threads and forks work is very important before implementing them. Getting to know how these work would be really useful before reading this tutorial.</p><p>Many times in bioinformatics we need to deal with huge datasets which&nbsp; are more than 100GB size. The traditional way to analysis a file is using the while loop</p><p>while (FILE){</p><p>Do something;</p><p>}</p><p>This is very slow(since we are using only one processor) and if we have 500 million lines in the dataset it takes more than a day to iterate through the whole dataset. So how do we make best use of all our processors and get the work done quickly?</p><p>Here is a very simple and efficient technique with perl which i have been using. I am&nbsp; more inclined towards using perl fork than perl threads.</p><p>One of the oldest way to fork is</p><blockquote><p>my $fork = fork();<br />if($fork){&nbsp;&nbsp;&nbsp;<br />push (@childs,$fork);&nbsp;<br />}<br />elseif($fork==0){<br /><strong>your code here;</strong><br />exit(0);<br />}<br />else{die &ldquo;Couldnt fork : $!&rdquo;;}</p><p>## wait for the child process to finish<br />foreach(@childs){<br />my $tmp=waitid($_,0);<br />}</p></blockquote><p>what a fork does is it creates a child process and takes the variables and code with it to analyze it separately (detached from the parent process) and thus a separate process is created( which usually runs on a separate processor). Thats it!! One big disadvantage of forking is its very difficult to share variables among the different processes. I will show you how to do it easily but still it has its own drawbacks.</p><blockquote><p>Okie, now if you really do not want to use fork in your code, that&rsquo;s okie too..There are many useful modules which do it for you very efficiently. One really useful module is Parallel::ForkManager. You can use Parallel::ForkManager to manage the number of forks you want to generate (number of processors you want to use).</p><p><strong>Simple usage:</strong><br />use Parallel::ForkManager;<br />my $max_processors=8;<br />my $fork= new Parallel::ForkManager($max_processors);<br />foreach (@dna) {<br />$fork-&gt;start and next; # do the fork<br /><strong>you code here;</strong><br />$fork-&gt;finish; # do the exit in the child process<br />}<br />$pm-&gt;wait_all_children;</p></blockquote><p>so you will be generating 8 forks which do the same thing for your each element of array. when one child finishes, Parallel::ForkManager generates a new one and thus you will be using all your processors to analyze the data. Now, if you have generated 8 child processes and want to write the data to one file. You need to lock the file to do this, because you will have problems with the buffering. You can lock the file using flock command.</p><blockquote><p>open (my $QUAL, &ldquo;myfile.txt&rdquo;);<br />flock $QUAL, LOCK_EX or die &ldquo;cant lock file $!&rdquo;;<br />print $QUAL &ldquo;$output&rdquo;;<br />flock $QUAL, LOCK_UN or die &ldquo;$!&rdquo;;<br />close $QUAL;</p></blockquote><p>I would not suggest using flock when dealing with multiple processes because it will decrease the processing efficiency( each child process must wait for the lock to be released by the other child process). Instead, I would suggest each fork writing to a separate file and after the processing just concatenating them.</p><p><strong>Putting it all together, If you have 100GB data you can do this</strong></p><blockquote><p><strong>step 1</strong>&nbsp;: split the dataset equally according to number of processors you have. this may take a few hours(about 2-3 hrs for 100GB file)<br />You can use unix &ldquo;split&rdquo; command for this<br />for example:<br />my $number_split=int($number_of_entries_in_your_dataset/$max_processors);<br />my $split_Files=`split -l $number_split &ldquo;your_file.fasta&rdquo; &ldquo;file_name&rdquo;`;</p><p><strong>step2</strong>: open you directory comtaining you split files and start Parallel::ForkManager.<br /><strong>For example:</strong><br />opendir(DIRECTORY, $split_files_directory) or die $!; ### open the directory<br />my $fork= new Parallel::ForkManager($max_processors);<br />while (my $file = readdir(DIRECTORY)) { ### read the directory<br />if($file=~/^\./){next;}<br />print $file,&rdquo;\n&rdquo;;<br />########## Start fork ##########<br />my $pid= $super_fork-&gt;start and next;<br /><strong>Whatever you want to do with the split file ;</strong><br /><strong>analyze my piece of $file;</strong><br />######### end fork ###############<br />$super_fork-&gt;finish;<br />}<br />$super_fork-&gt;wait_all_children;</p></blockquote><p>So basically each processor will be active with its piece of data (split file) and thus you have created 8 processes at one time which run without interfering with the other process. I again will not suggest writing output from each child process to one file(for reasons above). Write output from each fork to a separate file and finally concatenate them. Thats it, you have just increased your program speed by 8 times!! Isnt it easy?</p><p><strong>Note:</strong><br />You may worry about concatenation of the output each child generates, since it does take some time(remember 100GB). I think now you can use a mysql database LOAD DATA LOCAL INFILE command to load all the files into a single table(Should take about 3hrs for 100Gb dataset) and then export the whole table into one file. This should be faster than just concatenating them using &ldquo;cat&rdquo; command.(correct me if I am wrong)</p><p>Or much simpler way is to use pipes</p><p>cat output_dir/* | my_pipe or my_pipe &lt;(file1) final_file;</p><p>Thats it guys!! Enjoy programming and please do comment. I am not a computer scientist so forgive me for any mistakes and if any please report them. Thank you.</p>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>

</channel>
</rss>