<?xml version='1.0'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:atom="http://www.w3.org/2005/Atom" >
<channel>
	<title><![CDATA[BOL: Related items]]></title>
	<link>https://bioinformaticsonline.com/related/32709?offset=420</link>
	<atom:link href="https://bioinformaticsonline.com/related/32709?offset=420" rel="self" type="application/rss+xml" />
	<description><![CDATA[]]></description>
	
	<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/pages/view/11399/next-generation-sequencing-in-r-or-bioconductor-environment</guid>
	<pubDate>Mon, 02 Jun 2014 18:03:09 -0500</pubDate>
	<link>https://bioinformaticsonline.com/pages/view/11399/next-generation-sequencing-in-r-or-bioconductor-environment</link>
	<title><![CDATA[Next generation sequencing in R or bioconductor environment]]></title>
	<description><![CDATA[<p>There are many R software and bioconductor packages for NGS data analysis, some of them are as follows</p><h3><a name="TOC-Biostrings" id="TOC-Biostrings"></a>Biostrings</h3><p>The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages. <a href="http://bioconductor.org/packages/release/bioc/html/Biostrings.html">Documentation</a></p><div><div style="text-align: left;"><div style="color: #000000;"><h4><a name="TOC-IRanges-Overview" id="TOC-IRanges-Overview"></a>IRanges Overview</h4><p>IRanges provides the low-level infrastructure and containers for handling sets of integer ranges within Bioconductor's BioC-Seq domain. Its classes and methods provide support for many more high-level packages like GenomicRanges, ShortRead, Rsamtools, etc. <a href="http://bioconductor.org/packages/release/bioc/html/IRanges.html">Documentation</a></p><div style="text-align: right;"><div style="text-align: left;"><h4><a name="TOC-GenomicRanges-Overview" id="TOC-GenomicRanges-Overview"></a>GenomicRanges Overview</h4><p>The <em>GenomicRanges</em> package serves as the foundation for representing genomic locations within the Bioconductor project. It is built upon the <em>IRanges</em> infrastructure and defines three major data containers - <em>GRanges, GRangesList</em> and <em>GappedAlignments</em> - which are supporting other important BioC-Seq packages including <em>ShortRead, Rsamtools, rtracklayer, GenomicFeatures</em> and <em>BSgenome</em>.&nbsp; Compared to the IRanges container, the GRanges/<em>GRangesList</em> classes are more flexible and extensible to store additional information about sequence ranges, such as chromosome identifiers (sequence space), strand information and annotation data. <a href="http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html">Documentation</a></p></div></div></div></div><h3><a name="TOC-Motif-Discovery" id="TOC-Motif-Discovery"></a>Motif Discovery</h3><h4><a name="TOC-cosmo" id="TOC-cosmo"></a>cosmo</h4><p>The cosmo package allows to search a set of unaligned DNA sequences for a shared motif that may function as transcription factor binding site. The algorithm extends the popular motif discovery tool MEME (Bailey and Elkan, 1995) in that it allows the search to be supervised by specifying a set of constraints that the motif to be discovered must satisfy. <a href="http://bioconductor.org/packages/release/bioc/html/cosmo.html">Documentation</a></p></div><div>
<p><span></span><span></span></p>
<div style="color: #0000ff;"><h4><a name="TOC-BCRANK" id="TOC-BCRANK"></a>BCRANK</h4><p>BCRANK is a method that takes a ranked list of genomic regions as input and outputs short DNA sequences that are overrepresented in some part of the list. The algorithm was developed for detecting transcription factor (TF) binding sites in a large number of enriched regions from high-throughput ChIP-chip or ChIP-seq experiments, but it can be applied to any ranked list of DNA sequences. Documentation</p>
<p><a href="http://bioconductor.org/packages/release/bioc/html/BCRANK.html"></a></p>
<p>rGADEM: <a href="http://bioconductor.org/packages/devel/bioc/html/rGADEM.html">Documentation</a></p><p>MotIV: <a href="http://bioconductor.org/packages/devel/bioc/html/MotIV.html">Documentation</a></p></div><h3><a name="TOC-ShortRead" id="TOC-ShortRead"></a>ShortRead</h3><p>The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads. <a href="http://bioconductor.org/packages/release/bioc/html/ShortRead.html">Documentation</a></p><h3><a name="TOC-Rsamtools" id="TOC-Rsamtools"></a>Rsamtools</h3><p>Rsamtools provides functions for parsing and inspecting samtools BAM formatted binary alignment data. SAM/BAM is quickly becoming a universal standard alignment format, and is now supported by a wide variety of alignment tools. <a href="http://bioconductor.org/help/bioc-views/2.7/bioc/html/Rsamtools.html">Documentation</a></p>
<p><a href="http://samtools.sourceforge.net/">Samtools Website</a><br /> <a href="http://bio-bwa.sourceforge.net/">BWA (Burrows-Wheeler Alignment) Website</a><br /><span style="color: #0000ff;"></span></p>
<div style="color: #000000;">&nbsp;</div></div><div>
<p><span style="color: #000000;">Additional tools for SNP analysis:&nbsp;</span></p>
<p><a href="http://bioconductor.org/help/bioc-views/release/bioc/html/snpMatrix.html">snpMatrix</a></p><h3><a name="TOC-BSgenome" id="TOC-BSgenome"></a>BSgenome</h3><p>BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism. <a href="http://bioconductor.org/packages/release/bioc/html/BSgenome.html">Documentation</a></p><h3><a name="TOC-rtracklayer" id="TOC-rtracklayer"></a>rtracklayer</h3><p>rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage. <a href="http://bioconductor.org/packages/release/bioc/html/rtracklayer.html">Documentation</a></p><h3><a name="TOC-biomaRt" id="TOC-biomaRt"></a>biomaRt</h3><p>The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite (http:// www.biomart.org). The package enables online retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. This data is retrieved automatically via the Internet, so it's recommended that you cache the data locally, or check versions if your code will be adversely affected by updates to these data. <a href="http://bioconductor.org/packages/release/bioc/html/biomaRt.html">Documentation</a></p><h3><a name="TOC-ChIP-Seq-Analysis-Packages" id="TOC-ChIP-Seq-Analysis-Packages"></a>ChIP-Seq Analysis Packages</h3><p>Bioconductor provides various packages for analyzing and visualizing ChIP-Seq data. Only a small selection of these packages is introduced here. Additional useful introductions to this topic are: <a href="http://www.bioconductor.org/workshops/2009/SeattleJan09/ChIP-seq/">BioC ChIP-seq Case Study</a> and BioC <a href="http://www.bioconductor.org/help/course-materials/2009/SeattleNov09/ChIP-seq/">ChIP-Seq</a>.</p><h4><a name="TOC-chipseq" id="TOC-chipseq"></a>chipseq</h4><p>The chipseq package combines a variety of HT-Seq packages to a pipeline for ChIP-Seq data analysis. <a href="http://bioconductor.org/packages/release/bioc/html/chipseq.html">Documentation</a></p><h4><a name="TOC-BayesPeak" id="TOC-BayesPeak"></a>BayesPeak</h4><p>BayesPeak is a peak calling package for identifying DNA binding sites of proteins in ChIP-Seq experiments. Its algorithm uses hidden Markov models (HMM) and Bayesian statistical methods. The following sample code introduces the identification of peaks with the BayesPeak package as well as the incorporation of read coverage information obtained by the chipseq package. <a href="http://bioconductor.org/packages/release/bioc/html/BayesPeak.html">Documentation</a> [ <a href="http://www.biomedcentral.com/1471-2105/10/299">Publication</a> ]</p><h4><a name="TOC-PICS" id="TOC-PICS"></a>PICS</h4><p>The PICS package applies probabilistic inference to aligned-read ChIP-Seq data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. The following sample code uses the test data set from the above BayesPeak package in order to compare the results from both methods by identifying their consensus peak set. <a href="http://www.bioconductor.org/packages/release/bioc/html/PICS.html">Documentation</a> [ <a href="http://www.hubmed.org/display.cgi?uids=20528864">Publication</a> ]</p><h4><a name="TOC-ChIPpeakAnno" id="TOC-ChIPpeakAnno"></a>ChIPpeakAnno</h4><p>The ChIPpeakAnno package provides. batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. It includes functions to retrieve the sequences around peaks, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. The package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages. <a href="http://bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html">Documentation</a></p><h4><a name="TOC-Additional-ChIP-Seq-Packages" id="TOC-Additional-ChIP-Seq-Packages"></a>Additional ChIP-Seq Packages</h4><p>DiffBind: <a href="http://www.bioconductor.org/packages/release/bioc/html/DiffBind.html">Documentation</a></p><p>MOSAICS: <a href="http://bioconductor.org/packages/devel/bioc/html/mosaics.html">Documentation</a></p><p>iSeq: <a href="http://bioconductor.org/packages/release/bioc/html/iSeq.html">Documentation</a></p><p>ChIPseqR: <a href="http://bioconductor.org/packages/release/bioc/html/ChIPseqR.html">Documentation</a></p><p>ChiPsim: <a href="http://bioconductor.org/packages/release/bioc/html/ChIPsim.html">Documentation</a></p><p>CSAR: <a href="http://www.bioconductor.org/packages/devel/bioc/html/CSAR.html">Documentation</a></p><p>ChIP-Seq Pipeline: <a href="http://www.bioconductor.org/packages/release/bioc/html/PICS.html">PICS</a>, rGADEM and MotIV (<a href="http://www.rglab.org/pics-and-bioconductor/">developer web site</a>)</p><p>SPP: <a href="http://compbio.med.harvard.edu/Supplements/ChIP-seq/">ChIP-seq processing pipeline</a></p><p><a href="http://compbio.med.harvard.edu/Supplements/ChIP-seq/tutorial.html">SPP Tutorial</a></p><p><a href="http://liulab.dfci.harvard.edu/MACS/index.html">MACS</a></p><p><a href="http://gmdd.shgmo.org/Computational-Biology/ChIP-Seq/download/SIPeS">SIPeS</a></p><h3><a name="TOC-RNA-Seq-Analysis" id="TOC-RNA-Seq-Analysis"></a>RNA-Seq Analysis</h3><h4><a name="TOC-Counting-Reads-that-Overlap-with-Annotation-Ranges-" id="TOC-Counting-Reads-that-Overlap-with-Annotation-Ranges-"></a>Counting Reads that Overlap with Annotation Ranges&nbsp;</h4><p>The GenomicRanges package provides support for importing into R short read alignment data in BAM format (via Rsamtools) and associating them with genomic feature ranges, such as exons or genes. This way one can quantify the number of reads aligning to annotated genomic regions. The package defines general purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome. The two main functions for read counting provided by this infrastructure are <span>countOverlaps <span style="color: #000000;"><span>and</span></span> summarizeOverlaps</span>. For their proper usage, it is important to read the corresponding <a href="http://www.bioconductor.org/packages/devel/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf">PDF manual</a>. <a href="http://bioconductor.org/packages/release/bioc/html/GenomicRanges.html">Documentation</a></p><h4><a name="TOC-Differential-Gene-Expression-Analysis-with-DESeq" id="TOC-Differential-Gene-Expression-Analysis-with-DESeq"></a>Differential Gene Expression Analysis with DESeq</h4><p>The DESeq package contains functions to call differentially expressed genes (DEGs) in count tables based on a model using the negative binomial distribution. It expects as input a data frame with the raw read counts per region/gene of interest (rows) for each test sample (columns).&nbsp; Such a count table can be imported into R or generated from BAM alignment files using the <span>countOverlaps</span> function as introduced above. <a href="http://www.bioconductor.org/packages/release/bioc/html/DESeq.html">Documentation</a></p><h4><a name="TOC-Differential-Gene-Expression-Analysis-with-edgeR" id="TOC-Differential-Gene-Expression-Analysis-with-edgeR"></a>Differential Gene Expression Analysis with edgeR</h4><p>The edgeR package uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes (DEGs) in count data.&nbsp;</p>
<p><a href="http://www.bioconductor.org/packages/release/bioc/html/edgeR.html">Documentation</a></p>
<p><span style="color: #000000;">A variety of additional R packages are available for normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG): <br /> </span></p><p><a href="http://bioconductor.org/packages/devel/bioc/html/easyRNASeq.html">easyRNASeq</a> (simplifies read counting per genome feature)</p><p><a href="http://www.bioconductor.org/packages/release/bioc/html/DEXSeq.html">DEXSeq</a> (Inference of differential exon usage);&nbsp;<a href="http://www.bioconductor.org/packages/release/data/experiment/html/parathyroidSE.html">parathyroidSE</a> explains how to generate exon read counts in R</p><p><a href="http://bioconductor.org/packages/release/bioc/html/DEGseq.html">DEGseq</a></p><p><a href="http://www.bioconductor.org/packages/release/bioc/html/baySeq.html">baySeq</a> (also see: <a href="http://www.bioconductor.org/packages/release/bioc/html/segmentSeq.html">segmentSeq</a>)</p><p><a href="http://bioconductor.org/packages/release/bioc/html/Genominator.html">Genominator</a> (<a href="http://www.hubmed.org/display.cgi?uids=20167110">Bullard et al. 2010</a>)</p><div style="text-align: right;"><div style="text-align: left;"><h4><a name="TOC-Detection-of-Alternative-Splice-Junctions" id="TOC-Detection-of-Alternative-Splice-Junctions"></a>Detection of Alternative Splice Junctions</h4>
<p><span style="color: #000000;">Another utility of RNA-Seq experiments is the analysis of splice junctions. The following software suggestions provide this utility:</span></p>
<p><a href="http://woldlab.caltech.edu/rnaseq/">ERANGE<br /> </a><a href="http://tophat.cbcb.umd.edu/">TopHat</a></p><p><a href="http://biogibbs.stanford.edu/%7Ekinfai/SpliceMap/">SpliceMap</a></p><p><a href="http://solidsoftwaretools.com/gf/project/splitseek/">SplitSeek</a></p><h3><a name="TOC-DNA-Methylation-Data-Analysis" id="TOC-DNA-Methylation-Data-Analysis"></a>DNA-Methylation Data Analysis</h3><div><ul>
<li><span style="font-size: 10pt;"><a href="http://www.bioconductor.org/help/course-materials/2012/BiocEurope2012/mattia_pelizzola_methylPipe.pdf">methylPipe</a></span></li>
<li><span style="font-size: 10pt;"><a href="http://www.bioconductor.org/packages/devel/bioc/html/bsseq.html">bsseq</a></span></li>
<li><a href="http://www.bioconductor.org/packages/devel/bioc/html/BiSeq.html">BiSeq</a></li>
<li>Much more under <a href="http://www.bioconductor.org/packages/devel/BiocViews.html#___DNAMethylation">BiocViews</a></li>
</ul></div></div></div><h3><a name="TOC-HT-Seq-Data-Visualization" id="TOC-HT-Seq-Data-Visualization"></a>HT-Seq Data Visualization</h3>
<p><a href="http://www.bioconductor.org/packages/release/bioc/html/ggbio.html">ggbio</a>: ggplot2 extension for genomics data (<a href="http://tengfei.github.com/ggbio/">online manual</a>) <a href="http://www.bioconductor.org/packages/devel/bioc/html/Gviz.html">Gviz</a>:&nbsp;Plotting data and annotation information along genomic coordinates <a href="http://bioconductor.org/packages/release/bioc/html/HilbertVis.html">HilbertVis</a>: Hilbert genome plots</p>
<p><a href="http://bioconductor.org/packages/release/bioc/html/GenomeGraphs.html">GenomeGraphs</a>: Plotting genomic information from Ensembl</p><p><a href="http://www.hubmed.org/display.cgi?uids=18507856">TileQC</a>: Flow Cell Quality Visualization</p><p><a href="http://bioconductor.org/packages/release/bioc/html/rtracklayer.html">rtracklayer</a>: R interface to genome browsers</p><p><a href="http://genoplotr.r-forge.r-project.org/">genoPlotR</a>: Plotting maps of genes and genomes</p><p><a href="http://bioconductor.org/packages/release/bioc/html/Genominator.html">Genominator</a>: Tools for storing, accessing, analyzing and visualizing genomic data.</p><p>&nbsp;</p><p>To install all packages</p><blockquote><p>source("http://bioconductor.org/biocLite.R")<br />biocLite()<br />biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt", "chipseq", "ChIPpeakAnno", "Rsamtools", "BayesPeak", "PICS", "GenomicRanges", "DESeq", "edgeR", "leeBamViews", "GenomicFeatures", "BSgenome.Celegans.UCSC.ce2"))</p></blockquote></div>]]></description>
	<dc:creator>John Parker</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/file/view/37581/comparativegenomics-exercise2</guid>
	<pubDate>Wed, 22 Aug 2018 22:10:56 -0500</pubDate>
	<link>https://bioinformaticsonline.com/file/view/37581/comparativegenomics-exercise2</link>
	<title><![CDATA[ComparativeGenomics Exercise2]]></title>
	<description><![CDATA[<p>COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP&nbsp; @&nbsp;cbs.dtu.dk</p><p>Free Bioinformatics workbench https://www.mn.uio.no/ifi/english/research/networks/clsi/earlier_seminars/2012/tammivesth_osloseminarfinal.pdf</p>]]></description>
	<dc:creator>Neel</dc:creator>
	<enclosure url="https://bioinformaticsonline.com/file/download/37581" length="139956" type="application/pdf" />
</item>

<item>
  <guid isPermaLink='true'>https://bioinformaticsonline.com/researchlabs/view/12870/nuclear-dynamics-lab</guid>
  <pubDate>Thu, 17 Jul 2014 15:03:27 -0500</pubDate>
  <link></link>
  <title><![CDATA[Nuclear Dynamics Lab]]></title>
  <description><![CDATA[
<p>Lab focus is to elucidate fundamental principles, new mechanisms, machineries and emergent properties that are involved in maintaining the genome and gene expression programmes for improvements in lifelong health and well-being for all.</p>

<p>More at http://www.babraham.ac.uk/our-research/nuclear-dynamics/</p>
]]></description>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/blog/view/14011/dynamic-chromosome-breakpoints</guid>
	<pubDate>Wed, 13 Aug 2014 18:38:10 -0500</pubDate>
	<link>https://bioinformaticsonline.com/blog/view/14011/dynamic-chromosome-breakpoints</link>
	<title><![CDATA[Dynamic chromosome breakpoints !!!]]></title>
	<description><![CDATA[<p>Cell division involves the distribution of identical genetic material, DNA, to two daughters&rsquo; cells. During this process, duplicated deoxyribonucleic acid (DNA) goes through a condensation and decondensation process. This is followed by nuclear envelope dissolution, mitotic spindle assembly, migration of the sister chromatid pairs to the metaphase plate, division and segregation of identical sets of chromosomes into daughter nuclei and nuclear envelope reformation.</p><p>The vital metaphase stage of cell division, when the sister chromatids migrated to the centre and lined up in a row, and pulled apart using attached microtubules in such a way that half the DNA ends up in each daughter cell. However, before the mitotic spindle‐mediated movement gets start and pulled DNA apart, the chromosomes are free to undergo <strong>recombination </strong>which involves the exchange of genetic material either between multiple chromosomes or between different regions of the same chromosome.</p><p><img src="http://www.sciencelearn.org.nz/var/sciencelearn/storage/images/contexts/uniquely-me/sci-media/images/chromosomes-crossing-over/464438-1-eng-NZ/Chromosomes-crossing-over.jpg" alt="image" width="504" height="342" style="border: 0px; border: 0px;"></p><p>During recombination, the precise breakage of each strand, exchange between the strands, and sealing of the resulting recombined molecules happens. The &ldquo;<strong>chromosomal breakpoints</strong>&rdquo; refers to these places where they break. Mostly, this process occurs with a high degree of accuracy at high frequency in both eukaryotic and prokaryotic cells. But occasionally this &ldquo;break and sealing/ break and reattach&rdquo; process goes wrong and the reattachment happens in the wrong place which usually create disaster (with few exceptions).These chromosome disaster or abnormalities involve the gain, loss or rearrangement of visible amounts of genetic material during cell division. These abnormalities are of two type, the first one is numerical abnormalities &nbsp;where severe disorders are caused by the loss or gain of whole chromosomes, which affect the copy number of hundreds or even thousands of genes. The second are structural abnormalities which can be unbalanced or balanced. The former are similar to numerical abnormalities in that genetic material is either gained or lost. The natural defects in chromosome segregation are linked to cancer and several genetic diseases (http://en.wikipedia.org/wiki/List_of_genetic_disorders). Therefore, the enzymes involved in regulating cell division are still the attractive drug targets for many diseases.</p><p>&nbsp;</p><p>&nbsp;</p><p><img src="http://upload.wikimedia.org/wikipedia/commons/4/4a/Chromosomal_translocations.svg" alt="image" width="424" height="331" style="border: 0px; border: 0px;"></p><p>&nbsp;</p><p>Apart from certain chromosome abnormalities, these &ldquo;crossing over&rdquo; of segments of maternal and paternal chromosomes to form hybrid chromosomes have some evolutionary importance and considered as a driver of genetic variation. Moreover, the chromosome breakage in evolution is considered to be non-random in nature(http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.0020014). In addition the study of breakpoint regions and non-breakpoint (stable) regions of chromosomes indicates both the regions evolved in distinctly different ways ( http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2675965/). These breakage may lead to genetic diseases or participate to chromosomal rearranmgnets and contributed in development of new species.</p><p>I will try to explain the genome hotspots/Evolutionary Breakpoint Regions(EBRs)/fragile regions/weak fragments/&nbsp; in my next blog.</p><p><strong>Software for recombination detection:</strong></p><p><strong>RAT</strong> http://cbr.jic.ac.uk/dicks/software/RAT/</p><p><strong>Breakpointer</strong> https://github.com/ruping/Breakpointer</p><p><strong>DRP</strong> http://web.cbio.uct.ac.za/~darren/rdp.html</p><p><strong>RB-finder</strong> http://www.ncbi.nlm.nih.gov/pubmed/18707535</p><p><strong>LDhat2.0</strong> http://ldhat.sourceforge.net/LDhat2.0/instructions.shtml</p><p><strong>Reference:</strong></p><p>http://www.nature.com/scitable/topicpage/genetic-recombination-514#</p><p>Image: Wikipedia , sciencelearn.org.nz</p><p><strong>Recommended Articles:</strong></p><p>http://www.friendshipcircle.org/blog/2012/05/22/13-chromosomal-disorders-youve-never-heard-of/</p><p>http://web.udl.es/usuaris/e4650869/docencia/segoncicle/genclin98/recursos_classe_%28pdf%29/revisionsPDF/chromosyndromes.pdf</p><p>http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2775595/table/T2/</p><p>http://learn.genetics.utah.edu/content/disorders/chromosomal/</p><p>http://www.ncert.nic.in/html/learning_basket/biology/cc&amp;cd.pdf</p>]]></description>
	<dc:creator>Jitendra Narayan</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/blog/view/35621/bbtools-for-bioinformatician</guid>
	<pubDate>Thu, 15 Feb 2018 16:45:52 -0600</pubDate>
	<link>https://bioinformaticsonline.com/blog/view/35621/bbtools-for-bioinformatician</link>
	<title><![CDATA[BBTools for bioinformatician !]]></title>
	<description><![CDATA[<p><span></span><br /><strong>BBMap.sh</strong><br /><br /></p><ul>
<li><strong>Mapping Nanopore reads</strong></li>
</ul><p><br /><span>BBMap.sh has a length cap of 6kbp. Reads longer than this will be broken into 6kbp pieces and mapped independently.</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ mapPacBio.sh -Xmx20g k=7 in=reads.fastq ref=reference.fa maxlen=1000 minlen=200 idtag ow int=f qin=33 out=mapped1.sam minratio=0.15 ignorequality slow ordered maxindel1=40 maxindel2=400</pre></div><p><br /><span>The "maxlen" flag shreds them to a max length of 1000; you can set that up to 6000. But I found 1000 gave a higher mapping rate.&nbsp;&nbsp;</span><br /><br /></p><ul>
<li><strong>Using Paired-end and single-end reads at the same time</strong></li>
</ul><p><br /><span>BBMap itself can only run single-ended or paired-ended in a single run, but it has a wrapper that can accomplish it, like this:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ bbwrap.sh in1=read1.fq,singletons.fq in2=read2.fq,null out=mapped.sam append</pre></div><p><span>This will write all the reads to the same output file but only print the headers once. I have not tried that for bam output, only sam output</span><br /><br /><span>Note about alignment stats: For paired reads, you can find the total percent mapped by adding the read 1 percent (where it says "mapped: N%") and read 2 percent, then dividing by 2. The different columns tell you the count/percent of each event. Considering the cigar strings from alignment, "Match Rate" is the number of symbols indicating a reference match (=) and error rate is the number indicating substitution, insertion, or deletion (X, I, D).</span><br /><br /></p><ul>
<li><strong>Exact matches when mapping small reads (e.g. miRNA)</strong></li>
</ul><p><br /><span>When mapping small RNA's with BBMap use the following flags to report only perfect matches.</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">ambig=all vslow perfectmode maxsites=1000</pre></div><p><span>It should be very fast in that mode (despite the vslow flag). Vslow mainly removes masking of low-complexity repetitive kmers, which is not usually a problem but can be with extremely short sequences like microRNAs.</span></p><ul>
<li><strong>Important note about BBMap alignments</strong></li>
</ul><p><br /><span>BBMap is always nondeterministic when run in paired-end mode with multiple threads, because the insert-size average is calculated on a per-thread basis, which affects mapping; and which reads are assigned to which thread is nondeterministic. The only way to avoid that would be to restrict it to a single thread (threads=1), or map the reads as single-ended and then fix pairing afterward:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">bbmap.sh in=reads.fq outu=unmapped.fq int=f
repair.sh in=unmapped.fq out=paired.fq fint outs=singletons.fq</pre></div><p><span>In this case you'd want to only keep the paired output.&nbsp;</span><br /><br /><span>BBSplit is based on BBMap, so it is also nondeterministic in paired mode with multiple threads. BBDuk and Seal (which can be used similarly to BBSplit) are always deterministic.&nbsp;</span><br /><br /><span>--------------------------------------------------------</span><br /><br /><strong>Reformat.sh</strong></p><ul>
<li><strong>Count k-mers/find unknown primers</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=reads.fq out=trimmed.fq ftr=19</pre></div><p><span>This will trim all but the first 20 bases (all bases after position 19, zero-based).</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ kmercountexact.sh in=trimmed.fq out=counts.txt fastadump=f mincount=10 k=20 rcomp=f</pre></div><p><span>This will generate a file containing the counts of all 20-mers that occurred at least 10 times, in a 2-column format that is easy to sort in Excel.&nbsp;</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">ACCGTTACCGTTACCGTTAC	100
AAATTTTTTTCCCCCCCCCC	85</pre></div><p><span>...etc. If the primers are 20bp long, they should be pretty obvious.&nbsp;&nbsp;</span></p><ul>
<li><strong>Convert SAM format from 1.4 to 1.3 (required for many programs)</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=reads.sam out=out.sam sam=1.3</pre></div><ul>
<li><strong>Removing N basecalls</strong></li>
</ul><p><br /><span>You can use BBDuk or Reformat with "qtrim=rl trimq=1". That will only trim trailing and leading bases with Q-score below 1, which means Q0, which means N (in either fasta or fastq format). The BBMap package automatically changes q-scores of Ns that are above 0 to 0 and called bases with q-scores below 2 to 2, since occasionally some Illumina software versions produces odd things like a handful of Q0 called bases or Ns with Q&gt;0, neither of which make any sense in the Phred scale.</span></p><ul>
<li><strong>Sampling reads</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=reads.fq out=sampled.fq sample=3000</pre></div><div><div>Code:</div><pre dir="ltr">To sample 10% of the reads:
reformat.sh in1=reads1.fq in2=reads2.fq out1=sampled1.fq out2=sampled2.fq samplerate=0.1

or more concisely:
reformat.sh in=reads#.fq out=sampled#.fq samplerate=0.1

and for exact sampling:
reformat.sh in=reads#.fq out=sampled#.fq samplereadstarget=100k</pre></div><ul>
<li><strong>Changing fasta headers</strong></li>
</ul><p><br /><span>Remove anything after the first space in fasta header.&nbsp;</span><br /><br /></p><div><div>Code:</div><pre dir="ltr"> reformat.sh in=sequences.fasta out=renamed.fasta trd</pre></div><p><span>"trd" stands for "trim read description" and will truncate everything after the first whitespace.</span></p><ul>
<li><strong>Extract reads from a sam file</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=reads.sam out=reads.fastq</pre></div><ul>
<li><strong>Verify pairing and optionally de-interleave the reads</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=reads.fastq verifypairing</pre></div><ul>
<li><strong>Verify pairing if the reads are in separate files</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in1=r1.fq in2=r2.fq vpair</pre></div><p><span>If that completes successfully and says the reads were correctly paired, then you can simply de-interleave reads into two files like this:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=reads.fastq out1=r1.fastq out2=r2.fastq</pre></div><ul>
<li><strong>Base quality histograms</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=reads.fq qchist=qchist.txt</pre></div><p><span>That stands for "quality count histogram".&nbsp;</span></p><ul>
<li><strong>Filter SAM/BAM file by read length</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=x.sam out=y.sam minlength=50 maxlength=200</pre></div><ul>
<li><strong>Filter SAM/BAM file to detect/filter spliced reads</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=mapped.bam out=filtered.bam maxdellen=50</pre></div><p><span>You can set "maxdellen" to whatever length deletion event you consider the minimum to signify splicing, which depends on the organism.</span><br /><span>-------------------------------------------------------------</span><br /><strong>Repair.sh</strong></p><ul>
<li><strong>"Re-pair" out-of-order reads from paired-end data files</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ repair.sh in1=r1.fq.gz in2=r2.fq.gz out1=fixed1.fq.gz out2=fixed2.fq.gz outsingle=singletons.fq.gz</pre></div><p><span>--------------------------------------------------------------</span><br /><strong>BBMerge.sh</strong><br /><br /><span>BBMerge now has a new flag - "outa" or "outadapter". This allows you to automatically detect the adapter sequence of reads with short insert sizes, in case you don't know what adapters were used. It works like this:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ bbmerge.sh in=reads.fq outa=adapters.fa reads=1m</pre></div><p><span>Of course, it will only work for paired reads! The output fasta file will look like this:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">&gt;Read1_adapter
GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG
&gt;Read2_adapter
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCTCGTATGCCGTCTTCTGCTTG</pre></div><p><span>If you have multiplexed things with different barcodes in the adapters, the part with the barcode will show up as Ns, like this:</span><br /><br /><span>GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG&nbsp;&nbsp;</span><br /><br /><span>Note: For BBMerge with micro-RNA, you need to add the flag&nbsp;</span><strong>mininsert=17</strong><span>. The default is 35, which is too long for micro-RNA libraries.&nbsp;</span></p><ul>
<li><strong>Identifying adapters</strong></li>
</ul><p><span>If you have paired reads, and enough of the reads have inserts shorter than read length, you can identify adapter sequences with BBMerge, like this (they will be printed to adapters.fa):</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ bbmerge.sh in1=r1.fq in2=r2.fq outa=adapters.fa</pre></div><p><br /><span>-----------------------------------------------------------------</span><br /><br /><strong>BBDuk.sh</strong><br /><br /><span>Note: BBDuk is strictly deterministic on a per-read basis, however it does by default reorder the reads when run multithreaded. You can add the flag "ordered" to keep output reads in the same order as input reads</span></p><ul>
<li><strong>Finding reads with a specific sequence at the beginning of read</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ bbduk.sh -Xmx1g in=reads.fq outm=matched.fq outu=unmatched.fq restrictleft=25 k=25 literal=AAAAACCCCCTTTTTGGGGGAAAAA</pre></div><p><span>In this case, all reads starting with "AAAAACCCCCTTTTTGGGGGAAAAA" will end up in "matched.fq" and all other reads will end up in "unmatched.fq". Specifically, the command means "look for 25-mers in the leftmost 25 bp of the read", which will require an exact prefix match, though you can relax that if you want.</span><br /><br /><span>So you could bin all the reads with your known sequence, then look at the remaining reads to see what they have in common. You can do the same thing with the tail of the read using "restrictright" instead, though you can't use both restrictions at the same time.&nbsp;&nbsp;</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ bbduk.sh in=reads.fq outm=matched.fq literal=NNNNNNCCCCGGGGGTTTTTAAAAA k=25 copyundefined</pre></div><p><span>With the "copyundefined" flag, a copy of each reference sequence will be made representing every valid combination of defined letter. So instead of increasing memory or time use by 6^75, it only increases them by 4^6 or 4096 which is completely reasonable, but it only allows substitutions at predefined locations. You can use the "copyundefined", "hdist", and "qhdist" flags together for a lot of flexibility - for example, hdist=2 qhdist=1 and 3 Ns in the reference would allow a hamming distance of 6 with much lower resource requirements than hdist=6. Just be sure to give BBDuk as much memory as possible.</span></p><ul>
<li><strong>Removing illumina adapters (if exact adapters not known)</strong></li>
</ul><p><br /><span>If you're not sure which adapters are used, you can add "ref=truseq.fa.gz,truseq_rna.fa.gz,nextera.fa.gz" and get them all (this will increase the amount of overtrimming, though it should still be negligible).&nbsp;</span></p><ul>
<li><strong>Removing illumina control sequences/phiX reads</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality</pre></div><ul>
<li><strong>Identify certain reads that contain a specific sequence</strong></li>
</ul><div><div>Code:</div><pre dir="ltr">$ bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq literal=ACGTACGTACGTACGTAC k=18 mm=f hdist=2</pre></div><p><span>Make sure "k" is set to the exact length of the sequence. "hdist" controls the number of substitutions allowed. "outm" gets the reads that match. By default this also looks for the reverse-complement; you can disable that with "rcomp=f".&nbsp;&nbsp;</span></p><ul>
<li><strong>Extract sequences that share kmers with your sequences with BBDuk</strong></li>
</ul><div><div>Code:</div><pre dir="ltr">$ bbduk.sh in=a.fa ref=b.fa out=c.fa mkf=1 mm=f k=31</pre></div><p><span>This will print to C all the sequences in A that share 100% of their 31-mers with sequences in B.&nbsp;</span><br /><br /></p><ul>
<li><strong>Extract sequences that contain N's with BBDuk</strong></li>
</ul><div><div>Code:</div><pre dir="ltr">bbduk.sh in=reads.fq out=readsWithoutNs.fq outm=readsWithNs.fq maxns=0</pre></div><p><span>If you have, say, 100bp reads and only want to separate reads containing all 100 Ns, change that to "maxns=99".</span><br /><br /><strong>General notes for BBDuk.sh</strong><span>&nbsp;</span><br /><br /><span>BBDuk can operate in one of 4 kmer-matching modes:</span><br /><span>Right-trimming (ktrim=r), left-trimming (ktrim=l), masking (ktrim=n), and filtering (default). But it can only do one at a time because all kmers are stored in a single table. It can still do non-kmer-based operations such as quality trimming at the same time.</span><br /><br /><span>BBDuk2 can do all 4 kmer operations at once and is designed for integration into automated pipelines where you do contaminant removal and adapter-trimming in a single pass to minimize filesystem I/O. Personally, I never use BBDuk2 from the command line. Both have identical capabilities and functionality otherwise, but the syntax is different.</span><br /><br /><span>------------------------------------------------------------------</span><br /><br /><strong>Randomreads.sh</strong></p><ul>
<li><strong>Generate random reads in various formats</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000</pre></div><p><span>You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.</span><br /><br /><span>Bear in mind that 50% of the reads are going to be generated from the plus strand and 50% from the minus strand. So, either a read will match the reference perfectly, OR its reverse-complement will match perfectly.</span><br /><br /><span>You can generate the same set of reads with and without SNPs by fixing the seed to a positive number, like this:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ randomreads.sh maxsnps=0 adderrors=false out=perfect.fastq reads=1000 minlength=18 maxlength=55 seed=5

$ randomreads.sh maxsnps=2 snprate=1 adderrors=false out=2snps.fastq reads=1000 minlength=18 maxlength=55 seed=5</pre></div><p><span>[As of BBmap v. 36.59] rendomreads.sh gains the ability to simulate metagenomes.&nbsp;</span><br /><br /><span>coverage=X will automatically set "reads" to a level that will give X average coverage (decimal point is allowed).</span><br /><br /><span>metagenome will assign each scaffold a random exponential variable, which decides the probability that a read be generated from that scaffold. So, if you concatenate together 20 bacterial genomes, you can run randomreads and get a metagenomic-like distribution. It could also be used for RNA-seq when using a transcriptome reference.</span><br /><br /><span>The coverage is decided on a per-reference-sequence level, so if a bacterial assembly has more than one contig, you may want to glue them together first with fuse.sh before concatenating them with the other references.&nbsp;</span><br /><br /></p><ul>
<li><strong>Simulate a jump library</strong></li>
</ul><p><br /><span>You can simulate a 4000bp jump library from your existing data like this.</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ cat assembly1.fa assembly2.fa &gt; combined.fa
$ bbmap.sh ref=combined.fa
$ randomreads.sh reads=1000000 length=100 paired interleaved mininsert=3500 maxinsert=4500 bell perfect=1 q=35 out=jump.fq.gz</pre></div><p><span>--------------------------------------------------------------</span><br /><strong>Shred.sh</strong><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ shred.sh in=ref.fasta out=reads.fastq length=200</pre></div><p><span>The difference is that RandomReads will make reads in a random order from random locations, ensuring flat coverage on average, but it won't ensure 100% coverage unless you generate many fold depth. Shred, on the other hand, gives you exactly 1x depth and exactly 100% coverage (and is not capable of modelling errors). So, the use-cases are different.&nbsp;</span><br /><span>---------------------------------------------------------------</span><br /><strong>Demuxbyname.sh</strong></p><ul>
<li><strong>Demultiplex fastq files when the tag is present in the fastq read header (illumina)</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...
outu=filename</pre></div><p><span>"Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.</span><br /><br /><span>In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.</span><br /><br /><span>----------------------------------------------------------------</span><br /><br /><strong>Readlength.sh</strong></p><ul>
<li><strong>Plotting the length distribution of reads</strong></li>
</ul><div><div>Code:</div><pre dir="ltr">$ readlength.sh in=file out=histogram.txt bin=10 max=80000</pre></div><p><span>That will plot the result in bins of size 10, with everything above 80k placed in the same bin. The defaults are set for relatively short sequences so if they are many megabases long you may need to add the flag "-Xmx8g" and increase "max=" to something much higher.</span><br /><br /><span>Alternatively, if these are assemblies and you're interested in continuity information (L50, N50, etc), you can run stats on each or statswrapper on all of them:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">stats.sh in=file</pre></div><p><span>or</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">statswrapper.sh in=file,file,file,file&hellip;</pre></div><p><span>----------------------------------------------------------------</span><br /><strong>Filterbyname.sh</strong><br /><br /><span>By default, "filterbyname" discards reads with names in your name list, and keeps the rest. To include them and discard the others, do this:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ filterbyname.sh in=003.fastq out=filter003.fq names=names003.txt include=t</pre></div><p><span>----------------------------------------------------------------</span><br /><strong>getreads.sh</strong><br /><br /><span>If you only know the number(s) of the fasta/fastq record(s) in a file (records start at 0) then you can use the following command to extract those reads in a new file.</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ getreads.sh in= id=&lt;number,number,number...&gt; out=</pre></div><p><span>The first read (or pair) has ID 0, the second read (or pair) has ID 1, etc.</span><br /><br /><span>Parameters:</span><br /><span>in= Specify the input file, or stdin.</span><br /><span>out= Specify the output file, or stdout.</span><br /><span>id= Comma delimited list of numbers or ranges, in any order.</span><br /><span>For example: id=5,93,17-31,8,0,12-13&nbsp;</span><br /><span>----------------------------------------------------------------</span><br /><strong>Splitsam.sh</strong></p><ul>
<li><strong>Splits a sam file into forward and reverse reads</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">splitsam.sh mapped.sam plus.sam minus.sam unmapped.sam
reformat.sh in=plus.sam out=plus.fq
reformat.sh in=minus.sam out=minus.fq rcomp</pre></div><p><span>----------------------------------------------------------------</span><br /><strong>BBSplit.sh</strong><br /><br /><span>BBSplit now has the ability to output paired reads in dual files using the # symbol. For example:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ bbsplit.sh ref=x.fa,y.fa in1=read1.fq in2=read2.fq basename=o%_#.fq</pre></div><p><span>will produce ox_1.fq, ox_2.fq, oy_1.fq, and oy_2.fq</span><br /><br /><span>You can use the # symbol for input also, like "in=read#.fq", and it will get expanded into 1 and 2.&nbsp;&nbsp;</span><br /><br /><strong>Added feature:&nbsp;</strong><span>One can specify a directory for the "ref=" argument. If anything in the list is a directory, it will use all fasta files in that directory. They need a fasta extension, like .fa or .fasta, but can be compressed with an additional .gz after that. Reason this is useful is to use BBSplit is to have it split input into one output file per reference file.</span><br /><br /><br /><strong>NOTE: 1</strong><span>&nbsp;By default BBSplit uses fairly strict mapping parameters; you can get the same sensitivity as BBMap by adding the flags "minid=0.76 maxindel=16k minhits=1". With those parameters it is extremely sensitive.</span><br /><br /><strong>NOTE: 2</strong><span>&nbsp;BBSplit has different ambiguity settings for dealing with reads that map to multiple genomes. In any case, if the alignment score is higher to one genome than another, it will be associated with that genome only (this considers the combined scores of read pairs - pairs are always kept together). But when a read or pair has two identically-scoring mapping locations, on different genomes, the behavior is controlled by the "ambig2" flag - "ambig2=toss" will discard the read, "all" will send it to all output files, and "split" will send it to a separate file for ambiguously-mapped reads (one per genome to which it maps).</span><br /><br /><strong>NOTE: 3</strong><span>&nbsp;Zero-count lines are suppressed by default, but they should be printed if you include the flag "nzo=f" (nonzeroonly=false).&nbsp;</span><br /><br /><strong>NOTE: 4</strong><span>&nbsp;BBSplit needs multiple reference files as input; one per organism, or one for target and another for everything else. It only outputs one file per reference file.</span><br /><br /><span>Seal.sh, on the other hand, which is similar, can use a single concatenated file, as it (by default) will output one file per reference sequence within a concatenated set of references.&nbsp;</span><br /><span>--------------------------------------------------------------</span><br /><strong>Pileup.sh</strong></p><ul>
<li><strong>To generate transcript coverage stats</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ pileup.sh in=mapped.sam normcov=normcoverage.txt normb=20 stats=stats.txt</pre></div><p><span>That will generate coverage per transcript, with 20 lines per transcript, each line showing the coverage for that fraction of the transcript. "stats" will contain other information like the fraction of bases in each transcript that was covered.&nbsp;</span></p><ul>
<li><strong>To calculate physical coverage stats (region covered by paired-end reads)&nbsp;</strong></li>
</ul><p><span>BBMap has a "physcov" flag that allows it to report physical rather than sequenced coverage. It can be used directly in BBMap, or with pileup, if you already have a sam file. For example:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ pileup.sh in=mapped.sam covstats=coverage.txt</pre></div><ul>
<li><strong>Calculating coverage of the genome</strong></li>
</ul><p><br /><span>Program will take sam or bam, sorted or unsorted.</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ pileup.sh in=mapped.sam out=stats.txt hist=histogram.txt</pre></div><p><span>stats.txt will contain the average depth and percent covered of each reference sequence; the histogram will contain the exact number of bases with a each coverage level. You can also get per-base coverage or binned coverage if you want to plot the coverage. It also generates median and standard deviation, and so forth.</span><br /><br /><span>It's also possible to generate coverage directly from BBMap, without an intermediate sam file, like this:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ bbmap.sh in=reads.fq ref=reference.fasta nodisk covstats=stats.txt covhist=histogram.txt</pre></div><p><span>We use this a lot in situations where all you care about is coverage distributions, which is somewhat common in metagenome assemblies. It also supports most of the flags that pileup.sh supports, though the syntax is slightly different to prevent collisions. In each case you can see all the possible flags by running the shellscript with no arguments.</span></p><ul>
<li><strong>To bin aligned reads</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ pileup.sh in=mapped.sam out=stats.txt bincov=coverage.txt binsize=1000</pre></div><p><span>That will give coverage within each bin. For read density regardless of read length, add the "startcov=t" flag.&nbsp;&nbsp;</span><br /><br /><span>--------------------------------------------------------------</span><br /><strong>Dedupe.sh</strong><br /><br /><span>Dedupe ensures that there is at most one copy of any input sequence, optionally allowing contaminants (substrings) to be removed, and a variable hamming or edit distance to be specified. Usage:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ dedupe.sh in=assembly1.fa,assembly2.fa out=merged.fa</pre></div><p><span>That will absorb exact duplicates and containments. You can use "hdist" and "edist" flags to allow mismatches, or get a complete list of flags by running the shellscript with no arguments.&nbsp;&nbsp;</span><br /><br /><span>Dedupe&nbsp;</span><span style="text-decoration: underline;">will merge assemblies</span><span>, but it&nbsp;</span><span style="text-decoration: underline;">will not produce consensus sequences or join overlapping reads</span><span>; it only removes sequences that are fully contained within other sequences (allowing the specified number of mismatches or edits).</span><br /><br /><span>Dedupe can remove duplicate reads from multiple files simultaneously, if they are comma-delimited (e.g. in=file1.fastq,file2.fastq,file3.fastq). And if you set the flag "uniqueonly=t" then ALL copies of duplicate reads will be removed, as opposed to the default behavior of leaving one copy of duplicate reads.</span><br /><br /><span>However, it does not care which file a read came from; in other words, it can't remove only reads that are duplicates across multiple files but leave the ones that are duplicates within a file. That can still be accomplished, though, like this:</span><br /><br /><span>1) Run dedupe on each sample individually, so now there are at most 1 copy of a read per sample.</span><br /><span>2) Run dedupe again on all of the samples together, with "uniqueonly=t". The only remaining duplicate reads will be the ones duplicated between samples, so that's all that will be removed.&nbsp;&nbsp;</span><br /><br /><span>--------------------------------------------------------------</span></p><ul>
<li><strong>Generate ROC curves from any aligner</strong></li>
</ul><p><br /><strong>[*]index the reference<br /><br /></strong></p><div><div>Code:</div><pre dir="ltr">$ bbmap.sh ref=reference.fasta</pre></div><p><br /><strong>[*]Generate random reads</strong><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ randomreads.sh reads=100000 length=100 out=synth.fastq maxq=35 midq=25 minq=15</pre></div><p><strong>[*]Map to produce a sam file</strong><br /><br /><span>...substitute this command with the appropriate one from your aligner of choice</span></p><div><div>Code:</div><pre dir="ltr">$ bbmap.sh in=synth.fq out=mapped.sam</pre></div><p><strong>[*]Generate ROC curve</strong><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ samtoroc.sh in=mapped.sam reads=100000</pre></div><p><span>--------------------------------------------------------------</span></p><ul>
<li><strong>Calculate heterozygous rate for sequence data</strong></li>
</ul><p><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ kmercountexact.sh in=reads.fq khist=histogram.txt peaks=peaks.txt</pre></div><p><span>You can examine the histogram manually, or use the "peaks" file which tells you the number of unique kmers in each peak on the histogram. For a diploid, the first peak will be the het peak, the second will be the homozygous peak, and the rest will be repeat peaks. The peak caller is not perfect, though, so particularly with noisy data I would only rely on it for the first two peaks, and try to quantify the higher-order peaks manually if you need to (which you generally don't).</span><br /><br /><span>-----------------------------------------------------------------</span></p><ul>
<li><strong>Compare mapped reads between two files</strong></li>
</ul><p><br /><span>To see how many mapped reads (can be mapped concordant or discordant, doesn't matter) are shared between the two alignment files and how many mapped reads are unique to one file or the other.</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ reformat.sh in=file1.sam out=mapped1.sam mappedonly
$ reformat.sh in=file2.sam out=mapped2.sam mappedonly</pre></div><p><span>That gets you the mapped reads only. Then:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ filterbyname.sh in=mapped1.sam names=mapped2.sam out=shared.sam include=t</pre></div><p><span>...which gets you the set intersection;</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ filterbyname.sh in=mapped1.sam names=mapped2.sam out=only1.sam include=f
$ filterbyname.sh in=mapped2.sam names=mapped1.sam out=only2.sam include=f</pre></div><p><span>...which get you the set subtractions.&nbsp;&nbsp;</span><br /><br /><span>--------------------------------------------------------------</span><br /><br /><strong>BBrename.sh</strong></p><div><div>Code:</div><pre dir="ltr">$ bbrename.sh in=old.fasta out=new.fasta</pre></div><p><span>That will rename the reads as 1, 2, 3, 4, ... 222.</span><br /><br /><span>You can also give a custom prefix if you want. The input has to be text format, not .doc.&nbsp;&nbsp;</span><br /><br /><span>---------------------------------------------------------------------</span><br /><br /><strong>BBfakereads.sh</strong></p><ul>
<li><strong>Generating &ldquo;fake&rdquo; paired end reads from a single end read file</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ bfakereads.sh in=reads.fastq out1=r1.fastq out2=r2.fastq length=100</pre></div><p><span>That will generate fake pairs from the input file, with whatever length you want (maximum of input read length). We use it in some cases for generating a fake LMP library for scaffolding from a set of contigs. Read 1 will be from the left end, and read 2 will be reverse-complemented and from the right end; both will retain the correct original qualities. And " /1" " /2" will be suffixed after the read name.&nbsp;&nbsp;</span><br /><br /><span>------------------------------------------------------------------</span><br /><strong>Randomreads.sh</strong></p><ul>
<li><strong>Generate random reads</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ randomreads.sh ref=genome.fasta out=reads.fq len=100 reads=10000</pre></div><p><span>"seed=-1" will use a random seed; any other value will use that specific number as the seed</span><br /><br /><span>You can specify paired reads, an insert size distribution, read lengths (or length ranges), and so forth. But because I developed it to benchmark mapping algorithms, it is specifically designed to give excellent control over mutations. You can specify the number of snps, insertions, deletions, and Ns per read, either exactly or probabilistically; the lengths of these events is individually customizable, the quality values can alternately be set to allow errors to be generated on the basis of quality; there's a PacBio error model; and all of the reads are annotated with their genomic origin, so you will know the correct answer when mapping.</span><br /><br /><span>--------------------------------------------------------------------</span></p><ul>
<li><strong>Generate saturation curves to assess sequencing depth</strong></li>
</ul><p>&nbsp;</p><div><div>Code:</div><pre dir="ltr">$ bbcountunique.sh in=reads.fq out=histogram.txt</pre></div><p><span>It works by pulling kmers from each input read, and testing whether it has been seen before, then storing it in a table.</span><br /><br /><span>The bottom line, "first", tracks whether the first kmer of the read has been seen before (independent of whether it is read 1 or read 2).</span><br /><br /><span>The top line, "pair", indicates whether a combined kmer from both read 1 and read 2 has been seen before. The other lines are generally safe to ignore but they track other things, like read1- or read2-specific data, and random kmers versus the first kmer.</span><br /><br /><span>It plots a point every X reads (configurable, default 25000).</span><br /><br /><span>In noncumulative mode (default), a point indicates "for the last X reads, this percentage had never been seen before". In this mode, once the line hits zero, sequencing more is not useful.</span><br /><br /><span>In cumulative mode, a point indicates "for all reads, this percentage had never been seen before", but still only one point is plotted per X reads.</span><br /><br /><span>-----------------------------------------------------------------</span><br /><strong>CalcTrueQuality.sh</strong><br /><br /><a href="http://seqanswers.com/forums/showthread.php?p=170904" target="_blank">http://seqanswers.com/forums/showthread.php?p=170904</a><br /><br /><span>In light of the quality-score issues with the NextSeq platform, and the possibility of future Illumina platforms (HiSeq 3000 and 4000) also using quantized quality scores, I developed it for recalibrating the scores to ensure accuracy and restore the full range of values.</span><br /><br /><span>-----------------------------------------------------------------</span><br /><br /><strong>BBMapskimmer.sh</strong><br /><br /><span>BBMap is designed to find the best mapping, and heuristics will cause it to ignore mappings that are valid but substantially worse. Therefore, I made a different version of it, BBMapSkimmer, which is designed to find all of the mappings above a certain threshold. The shellscript is bbmapskimmer.sh and the usage is similar to bbmap.sh or mapPacBio.sh. For primers, which I assume will be short, you may wish to use a lower than default K of, say, 10 or 11, and add the "slow" flag.</span><br /><br /><span>--------------------------------------------------------------</span><br /><br /><strong>msa.sh and curprimers.sh</strong><br /><br /><span>Quoted from Brian's response directly.</span><br /><br /><span>I also wrote another pair of programs specifically for working with primer pairs, msa.sh and cutprimers.sh. msa.sh will forcibly align a primer sequence (or a set of primer sequences) against a set of reference sequences to find the single best matching location per reference sequence - in other words, if you have 3 primers and 100 ref sequences, it will output a sam file with exactly 100 alignments - one per ref sequence, using the primer sequence that matched best. Of course you can also just run it with 1 primer sequence.</span><br /><br /><span>So you run msa twice - once for the left primer, and once for the right primer - and generate 2 sam files. Then you feed those into cutprimers.sh, which will create a new fasta file containing the sequence between the primers, for each reference sequence. We used these programs to synthetically cut V4 out of full-length 16S sequences.</span><br /><br /><span>I should say, though, that the primer sites identified are based on the normal BBMap scoring, which is not necessarily the same as where the primers would bind naturally, though with highly conserved regions there should be no difference.</span><br /><br /><span>------------------------------------------------------</span><br /><strong>testformat.sh</strong><br /><br /><strong>Identify type of Q-score encoding in sequence files</strong><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ testformat.sh in=seq.fq.gz
sanger    fastq    gz    interleaved    150bp</pre></div><p><span>--------------------------------------------------</span><br /><strong>kcompress.sh</strong><br /><br /><span>Newest member of BBTools. Identify constituent k-mers.&nbsp;</span><br /><a href="http://seqanswers.com/forums/showthread.php?t=63258" target="_blank">http://seqanswers.com/forums/showthread.php?t=63258</a><br /><br /><span>----------------------------------------------------</span><br /><strong>commonkmers.sh</strong><br /><br /><span>Find all k-mers for a given sequence.</span></p><div><div>Code:</div><pre dir="ltr">$ commonkmers.sh in=reads.fq out=kmers.txt k=4 count=t display=999</pre></div><p><span>Will produce output that looks like</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">MISEQ05:239:000000000-A74HF:1:2110:14788:23085	ATGA=8	ATGC=6	GTCA=6	AAAT=5	AAGC=5	AATG=5	AGCA=5	ATAA=5	ATTA=5	CAAA=5	CATA=5	CATC=5	CTGC=5	AACC=4	AACG=4	AAGA=4	ACAT=4	ACCA=4	AGAA=4	ATCA=4	ATGG=4	CAAG=4	CCAA=4	CCTC=4	CTCA=4	CTGA=4	CTTC=4	GAGC=4	GGTA=4	GTAA=4	GTTA=4	AAAA=3	AAAC=3	AAGT=3	ACCG=3	ACGG=3	ACTG=3	AGAT=3	AGCT=3	AGGA=3	AGTA=3	AGTC=3	CAGC=3	CATG=3	CGAG=3	CGGA=3	CGTC=3	CTAA=3	CTCC=3	CTTA=3	GAAA=3	GACA=3	GACC=3	GAGA=3	GCAA=3	GGAC=3	TCAA=3	TGCA=3	AAAG=2	AACA=2	AATA=2	AATC=2	ACAA=2	ACCC=2	ACCT=2	ACGA=2	ACGC=2	AGAC=2	AGCG=2	AGGC=2	CAAC=2	CAGG=2	CCGC=2	GCCA=2	GCTA=2	GGAA=2	GGCA=2	TAAA=2	TAGA=2	TCCA=2	TGAA=2	AAGG=1	AATT=1	ACGT=1	AGAG=1	AGCC=1	AGGG=1	ATAC=1	ATAG=1	ATTG=1	CACA=1	CACG=1	CAGA=1	CCAC=1	CCCA=1	CCGA=1	CCTA=1	CGAC=1	CGCA=1	CGCC=1	CGCG=1	CGTA=1	CTAC=1	GAAC=1	GCGA=1	GCGC=1	GTAC=1	GTGA=1	TTAA=1</pre></div><p><span>-----------------------------------------------------</span><br /><strong>Mutate.sh</strong><br /><br /><span>Simulate multiple mutants from a known reference (e.g.&nbsp;</span><em>E. coli</em><span>).</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">$ mutate.sh in=e_coli.fasta out=mutant.fasta id=99 
$ randomreads.sh ref=mutant.fasta out=reads.fq.gz reads=5m length=150 paired adderrors</pre></div><p><span>That will create a mutant version of E.coli with 99% identity to the original, and then generate 5 million simulated read pairs from the new genome. You can repeat this multiple times; each mutant will be different.</span><br /><br /><span>------------------------------------</span><br /><br /><strong>Partition.sh</strong><br /><br /><span>One can partition a large dataset with partition.sh into smaller subsets (example below splits data into 8 chunks).</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8</pre></div><p><span>-----------------------------------</span><br /><strong>clumpify.sh</strong><br /><br /><span>If you are concerned about file size and want the files to be as small as possible, give Clumpify a try. It can reduce filesize by around 30% losslessly by reordering the reads. I've found that this also typically accelerates subsequent analysis pipelines by a similar factor (up to 30%). Usage:</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">clumpify.sh in=reads.fastq.gz out=clumped.fastq.gz</pre></div><div><div>Code:</div><pre dir="ltr">clumpify.sh in1=reads_R1.fastq.gz in2=reads_R2.fastq.gz out1=clumped_R1.fastq.gz out2=clumped_R2.fastq.gz</pre></div><ul>
<li><strong>Clumpify.sh can now mark/remove sequence duplicates (optical/PCR/otherwise) from NGS data</strong></li>
</ul><p><br /><span>This does NOT require alignments so it should prove more useful compared to Picard MarkDuplicates. Relevant options for clumpify.sh command are listed below.</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">dedupe=f optical=f (default)
Nothing happens with regards to duplicates.

dedupe=t optical=f
All duplicates are detected, whether optical or not.  All copies except one are removed for each duplicate.

dedupe=f optical=t
Nothing happens.

dedupe=t optical=t

Only optical duplicates (those with an X or Y coordinate within dist) are detected.  All copies except one are removed for each duplicate.
The allduplicates flag makes all copies of duplicates removed, rather than leaving a single copy.  But like optical, it has no effect unless dedupe=t.

Note: If you set "dupedist" to anything greater than 0, "optical" gets enabled automatically.</pre></div><p><span>-------------------------------------</span><br /><strong>fuse.sh</strong><br /><br /><span>Fuse will automatically reverse-complement read 2. Pad (N) amount can be adjusted as necessary. This will for example create a full size amplicon that can be used for alignments.</span><br /><br /></p><div><div>Code:</div><pre dir="ltr">fuse.sh in1=r1.fq in2=r2.fq pad=130 out=fused.fq fusepairs</pre></div>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/17926/orange-bioinformatics-2534</guid>
	<pubDate>Mon, 06 Oct 2014 12:51:37 -0500</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/17926/orange-bioinformatics-2534</link>
	<title><![CDATA[Orange-Bioinformatics 2.5.34]]></title>
	<description><![CDATA[<p>Orange Bioinformatics extends <a href="http://orange.biolab.si/">Orange</a>, a data mining software package, with common functionality for bioinformatics. The provided functionality can be accessed as a Python library or through a visual programming interface (Orange Canvas). The latter is also suitable for non-programmers.</p>
<p>Orange Bioinformatics provides access to publicly available data, like GEO data sets, Biomart, GO, KEGG, Atlas, ArrayExpress, and PIPAx database. As for the analytics, there is gene selection, quality control, scoring distances between experiments with multiple factors. All features can be combined with powerful visualization, network exploration and data mining techniques from the Orange data mining framework.</p><p>Address of the bookmark: <a href="https://pypi.python.org/pypi/Orange-Bioinformatics/2.5.34" rel="nofollow">https://pypi.python.org/pypi/Orange-Bioinformatics/2.5.34</a></p>]]></description>
	<dc:creator>Robert M Willioms</dc:creator>
</item>

<item>
  <guid isPermaLink='true'>https://bioinformaticsonline.com/opportunity/view/18385/biinformamatics-lead-at-google-life-sciences</guid>
  <pubDate>Fri, 17 Oct 2014 02:24:55 -0500</pubDate>
  <link></link>
  <title><![CDATA[Biinformamatics Lead at Google Life Sciences]]></title>
  <description><![CDATA[
<p>Google Life Sciences is recruiting a technical lead with experience in bioinformatics and clinical bioinformatics, including for biomarker discovery projects such as the Baseline study.</p>

<p>Responsibilities</p>

<p>Lead teams of scientists in structuring, prototyping, and executing large-scale bioinformatic and other analysis.<br />Develop novel bioinformatics, statistical, data processing, pathway, data mining and other algorithms to identify biological signals and their clinical correlates in broad kinds of individual and population data.<br />Develop novel platform-level analytical tools for sequence-based assays (assembly, annotation, variant calling and interpretation, phasing, genome structure, etc.), expression assays (RNAseq and microarray), proteomics, and metabolomics.<br />Develop statistical models that robustly correlate complex laboratory-derived information with phenotypic and clinical information.<br />Create scientifically rigorous visualizations, communications, and presentations of results.</p>

<p>Reference @ https://www.google.com/about/careers/search#!t=jo&amp;jid=62095001</p>
]]></description>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/blog/view/44758/the-ifs-and-buts-of-ngs-quality-control-and-trimming</guid>
	<pubDate>Thu, 02 Jan 2025 20:11:07 -0600</pubDate>
	<link>https://bioinformaticsonline.com/blog/view/44758/the-ifs-and-buts-of-ngs-quality-control-and-trimming</link>
	<title><![CDATA[The &quot;Ifs&quot; and &quot;Buts&quot; of NGS Quality Control and Trimming]]></title>
	<description><![CDATA[<p>Next-Generation Sequencing (NGS) has revolutionized biological research, providing vast amounts of data for a wide range of applications. However, the reliability of NGS analyses heavily depends on the quality of raw sequencing data. Quality control (QC) and trimming are critical preprocessing steps that can make or break your downstream analyses. In this blog, we explore the "ifs" (why you should perform QC and trimming) and the "buts" (challenges or considerations) of this vital step in NGS workflows.</p><h3><strong>The "Ifs" of NGS QC and Trimming</strong></h3><ol>
<li>
<p><strong>Ensures Data Integrity</strong><br />If you want to minimize errors in downstream analyses, QC and trimming remove low-quality reads and bases, ensuring high-confidence data. This step is essential for reliable variant calling, assembly, and other applications.</p>
</li>
<li>
<p><strong>Removes Contaminants</strong><br />If adapter sequences or contaminants are present in the raw reads, trimming can eliminate them. This prevents issues like misalignment or incorrect biological interpretations, ensuring cleaner data for analysis.</p>
</li>
<li>
<p><strong>Improves Mapping and Assembly</strong><br />If your goal is better alignment to a reference genome or improved de novo assembly, trimming low-quality bases and adapters is critical. High-quality reads map more efficiently and generate more accurate assemblies.</p>
</li>
<li>
<p><strong>Reduces Computational Load</strong><br />If you want to save computational resources, trimming reduces the dataset size, which speeds up processing and analysis. Clean datasets mean less computational time spent on processing low-quality data.</p>
</li>
<li>
<p><strong>Prepares for Standardized Analyses</strong><br />If your project involves multiple datasets, QC and trimming ensure uniformity across them. This standardization makes comparisons valid and reproducible, particularly in large collaborative studies.</p>
</li>
</ol><h3><strong>The "Buts" of NGS QC and Trimming</strong></h3><ol>
<li>
<p><strong>Risk of Over-Trimming</strong><br />But excessive trimming can lead to the loss of informative sequences, reducing read depth and potentially discarding biologically relevant data. This is especially critical in studies with limited sequencing depth.</p>
</li>
<li>
<p><strong>Bias Introduction</strong><br />But trimming algorithms might introduce biases, especially if they inadvertently remove sequences with specific biological patterns. This can skew results and compromise biological insights.</p>
</li>
<li>
<p><strong>Loss of Context in Paired-End Reads</strong><br />But trimming one read in a pair more than the other can lead to loss of pairing information. This complicates downstream analyses that rely on paired-end data, such as structural variant detection.</p>
</li>
<li>
<p><strong>Time and Resource Intensive</strong><br />But running QC and trimming for large datasets can be computationally expensive and time-consuming. As sequencing depth increases, preprocessing becomes a bottleneck in the analysis pipeline.</p>
</li>
<li>
<p><strong>Variable Standards</strong><br />But the criteria for trimming (e.g., quality threshold, minimum read length) can vary between tools and datasets. This variability may affect reproducibility and comparability of results across studies.</p>
</li>
</ol><h3><strong>Balancing the "Ifs" and "Buts"</strong></h3><p>To maximize the benefits of QC and trimming while mitigating the challenges, consider the following best practices:</p><ul>
<li>
<p><strong>Use QC Tools Wisely:</strong> Start with tools like <strong>FastQC</strong> to identify quality issues in your raw data. Visualizing quality metrics helps tailor your trimming parameters.</p>
</li>
<li>
<p><strong>Choose Reliable Trimming Tools:</strong> Tools like <strong>Trimmomatic</strong>, <strong>Cutadapt</strong>, and <strong>BBduk</strong> offer adaptive and customizable trimming options. Select one that aligns with your dataset and project goals.</p>
</li>
<li>
<p><strong>Set Reasonable Parameters:</strong> Avoid over-trimming by setting quality thresholds and minimum read lengths that balance data retention and quality improvement.</p>
</li>
<li>
<p><strong>Test Downstream Effects:</strong> Validate the impact of QC and trimming on downstream analyses, such as alignment efficiency, variant calling accuracy, or assembly quality.</p>
</li>
<li>
<p><strong>Document Your Workflow:</strong> Maintain detailed records of the parameters and tools used for QC and trimming. This ensures reproducibility and enables better troubleshooting.</p>
</li>
</ul><h3><strong>Conclusion</strong></h3><p>NGS quality control and trimming are essential steps to ensure reliable and accurate data for analysis. While the "ifs" highlight the clear benefits of these steps, the "buts" remind us of the potential pitfalls. By adopting best practices and carefully balancing these considerations, you can optimize your preprocessing workflow and unlock the full potential of your sequencing data.</p>]]></description>
	<dc:creator>BioStar</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/bookmarks/view/19633/vital-it</guid>
	<pubDate>Thu, 18 Dec 2014 10:46:59 -0600</pubDate>
	<link>https://bioinformaticsonline.com/bookmarks/view/19633/vital-it</link>
	<title><![CDATA[Vital-IT]]></title>
	<description><![CDATA[<p>Vital-IT is a <strong>bioinformatics competence center</strong> that supports and collaborates with life scientists in Switzerland and beyond. The <a href="http://www.vital-it.ch/about/team.php">multi-disciplinary team</a> provides expertise, training and maintains a high-performance computing (HPC) and storage infrastructure, so as to help develop, maintain and extend life science and medical research (<a href="http://www.vital-it.ch/about/activities.php">activities</a>).</p><p>Address of the bookmark: <a href="http://www.vital-it.ch/" rel="nofollow">http://www.vital-it.ch/</a></p>]]></description>
	<dc:creator>Abhi</dc:creator>
</item>

<item>
  <guid isPermaLink='true'>https://bioinformaticsonline.com/researchlabs/view/19648/mit-computational-biology-group</guid>
  <pubDate>Thu, 18 Dec 2014 14:47:01 -0600</pubDate>
  <link></link>
  <title><![CDATA[MIT Computational Biology Group]]></title>
  <description><![CDATA[
<p>My research group consists primarily of computer science graduate students and postdocs with expertise in algorithms, statistical inferences and machine learning, and sharing a passion for understanding fundamental biological problems.</p>

<p>We work in a highly interdisciplinary environment at the interface of Computer Science and Biology. Since its inception, our lab has eagerly engaged in collaborative research partnerships with biological and experimental collaborators, facilitated by our affiliation with the Broad Institute and the Computational and Systems Biology initiative (CSBi) at MIT, our participation in the Epigenome Roadmap, ENCODE, and modENCODE consortia, and by several other ongoing collaborations at MIT, Harvard, and the Harvard Medical School affiliated hospitals.</p>

<p>http://compbio.mit.edu/</p>
]]></description>
</item>

</channel>
</rss>