<?xml version='1.0'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:atom="http://www.w3.org/2005/Atom" >
<channel>
	<title><![CDATA[BOL: Estimate Genome Size]]></title>
	<link>https://bioinformaticsonline.com/snippets/view/35642/estimate-genome-size?</link>
	<atom:link href="https://bioinformaticsonline.com/snippets/view/35642/estimate-genome-size?" rel="self" type="application/rss+xml" />
	<description><![CDATA[]]></description>
	
	<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/35642/estimate-genome-size</guid>
	<pubDate>Thu, 22 Feb 2018 03:28:26 -0600</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/35642/estimate-genome-size</link>
	<title><![CDATA[Estimate Genome Size]]></title>
	<description><![CDATA[<code># Count k-mer occurrence using Jellyfish 2.2.6
jellyfish count -t 8 -C -m 19 -s 5G -o 19mer_out --min-qual-char=? sread_1.fastq sread_2.fastq

# points for a histogram
jellyfish histo -o 19mer_out.histo 19mer_out

#Plot results using R
##load the data into dataframe19
dataframe19 &lt;- read.table(&quot;19mer_out.histo&quot;) 
##plots the data points 1 through 200 in the dataframe19 using a line
plot(dataframe19[1:200,], type=&quot;l&quot;)
##plot the data points from 2 through 100
points(dataframe19[2:100,])

#calculate the total k-mers in the distribution
#Assuming the total number of data points is 9325
sum(as.numeric(dataframe19[2:9325,1]*dataframe19[2:9325,2]))

#peak position and genome size
#plotted graph we can get an idea where the peak position lies
#see actual point 
data[10:20,] #If peak more likely to be between 10-20

#If 12 is the peak
sum(as.numeric(dataframe19[2:9325,1]*dataframe19[2:9325,2]))/12

#Compare the peak shape with Poisson distribution in R
singleC &lt;- sum(as.numeric(dataframe19[2:28,1]*dataframe19[2:28,2]))/12
poisdtb &lt;- dpois(1:100,12)*singleC
plot(poisdtb, type=&#039;l&#039;, lty=2, col=&quot;green&quot;)
lines(dataframe19[1:100,12] * singleC, type = &quot;l&quot;, col=3)#, Ity=2)
lines(dataframe19[1:100,],type= &quot;l&quot;)


#ALTERNATE  WAY
#https://github.com/dib-lab/khmer/blob/master/scripts/normalize-by-median.py
python normalize-by-median.py -x 1e8 -k 20 -C 20 -R report.txt reads.fa
#https://github.com/dib-lab/khmer-recipes/blob/master/003-estimate-genome-size/estimate-genome-size.py
python estimate-genome-size.py -C 20 -k 20 reads.fa.keep report.txt</code>]]></description>
	<dc:creator>Rahul Nayak</dc:creator>
</item>
<item>
	<guid isPermaLink='true'>https://bioinformaticsonline.com/snippets/view/35642/estimate-genome-size#item-annotation-3586</guid>
	<pubDate>Sat, 23 Feb 2019 15:06:22 -0600</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/35642/estimate-genome-size#item-annotation-3586</link>
	<title><![CDATA[Comment by Abhimanyu Singh]]></title>
	<description><![CDATA[<p># count k-mers (see jellyfish documentation for options)</p>
<p>gzip -dc reads1.fastq.gz reads2.fastq.gz | jellyfish count -m 31 -o fastq.counts -C -s 10000000000 -U 500 -t 30 /dev/fd/0</p>
<p># generate a histogram</p>
<p>jellyfish histo fastq.counts_0 &gt; fastq.counts_0.histo</p>
<p># generate a pdf graph of the histogram</p>
<p>jellyplot.pl fastq.counts_0.histo</p>
<p># look at fastq.counts_0.histo.pdf and identify the approximate peak</p>
<p># use find_valleys.pl to help pinpoint the actual peak</p>
<p>find_valleys.pl fastq.counts_0.histo</p>
<p># estimate the size and coverage</p>
<p>estimate_genome_size.pl --kmer=31 --peak=42 --fastq=reads1.fastq.gz reads2.fastq.gz</p>]]></description>
	<dc:creator>Abhimanyu Singh</dc:creator>
</item>

</channel>
</rss>