<?xml version='1.0'?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:georss="http://www.georss.org/georss" xmlns:atom="http://www.w3.org/2005/Atom" >
<channel>
	<title><![CDATA[BOL: Owner]]></title>
	<link>https://bioinformaticsonline.com/snippets/owner/surabhi?offset=10</link>
	<atom:link href="https://bioinformaticsonline.com/snippets/owner/surabhi?offset=10" rel="self" type="application/rss+xml" />
	<description><![CDATA[]]></description>
	
	<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43558/perl-onliner-to-check-the-ids-in-two-files</guid>
	<pubDate>Thu, 21 Oct 2021 07:21:10 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43558/perl-onliner-to-check-the-ids-in-two-files</link>
	<title><![CDATA[Perl onliner to check the ids in two files !]]></title>
	<description><![CDATA[<code>perl -lane &#039;BEGIN{open(A,&quot;ids2.txt&quot;); while(&lt;A&gt;){chomp; $k{$_}++}} if (defined($k{$F[0]})) {print &quot;$_\t$F[0]\t1&quot;} else {print &quot;$_\tNA\t0&quot;}; &#039; ids1.txt &gt; aaa.xls</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43556/simulate-the-reads</guid>
	<pubDate>Wed, 20 Oct 2021 04:52:40 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43556/simulate-the-reads</link>
	<title><![CDATA[Simulate the reads !]]></title>
	<description><![CDATA[<code># make reference for randomreads.sh
# randomreads.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.13 pbmax=0.17 \
reads=100 paired=f \
gaussianlength=t \
minlength=1000 midlength=20000 maxlength=100000 \
out=/dev/null




# make 60x haploid coverage for Illumina reads
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
coverage=30 paired=t maxinsert=550 mininsert=450 \
out1=illumina1.fastq.gz out2=illumina2.fastq.gz &gt; random_reads_illumina.log 2&gt;&amp;1




# interleave the paired-end reads
# reformat.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/reformat.sh \
in=illumina1.fastq.gz in2=illumina2.fastq.gz out=illumina.int.fastq 2&gt;/dev/null




# use KmerGenie 1.7051 to get an idea of k-mer with that produces longest N50
# http://kmergenie.bx.psu.edu/
mkdir -p /genetics/elbers/test/fly2/kmergenie-illumina-raw-reads

cd /genetics/elbers/test/fly2/kmergenie-illumina-raw-reads
/genetics/elbers/kmergenie-1.7051/kmergenie ../illumina.int.fastq \
&gt; kmergenie-illumina-raw-reads.log 2&gt;&amp;1
rm ../illumina.int.fastq

k=`grep &quot;^best k:&quot; \
kmergenie-illumina-raw-reads.log | grep -Po &quot;\d+&quot;` 
echo &quot;best k=${k}&quot;




# make 30x haploid coverage for PacBio CLR reads
# error rate from 13 - 15 % minimum 1000bp midlength 20000bp maximum 30000bp
cd /genetics/elbers/test/fly2

/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ow=t seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.13 pbmax=0.15 \
coverage=15 paired=f \
gaussianlength=t \
minlength=1000 midlength=20000 maxlength=30000 \
out=pacbio.fastq.gz &gt; random_reads_pacbio.log 2&gt;&amp;1



# make 30x haploid coverage for PacBio reads for Hifi reads
# error rate from 1 - 0.1 % minimum 9000bp midlength 10000bp max 12000bp
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ow=t seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.001 pbmax=0.01 \
coverage=15 paired=f \
gaussianlength=t \
minlength=9000 midlength=10000 maxlength=12000 \
out=hifi.fastq.gz &gt; random_reads_pacbio_hifi.log 2&gt;&amp;1</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43413/get-the-linux-system-information</guid>
	<pubDate>Thu, 30 Sep 2021 06:37:45 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43413/get-the-linux-system-information</link>
	<title><![CDATA[Get the Linux system information !]]></title>
	<description><![CDATA[<code>#!/bin/bash

# while-menu-dialog: a menu driven system information program

DIALOG_CANCEL=1
DIALOG_ESC=255
HEIGHT=0
WIDTH=0

display_result() {
  dialog --title &quot;$1&quot; \
    --no-collapse \
    --msgbox &quot;$result&quot; 0 0
}

while true; do
  exec 3&gt;&amp;1
  selection=$(dialog \
    --backtitle &quot;System Information&quot; \
    --title &quot;Menu&quot; \
    --clear \
    --cancel-label &quot;Exit&quot; \
    --menu &quot;Please select:&quot; $HEIGHT $WIDTH 4 \
    &quot;1&quot; &quot;Display System Information&quot; \
    &quot;2&quot; &quot;Display Disk Space&quot; \
    &quot;3&quot; &quot;Display Home Space Utilization&quot; \
    2&gt;&amp;1 1&gt;&amp;3)
  exit_status=$?
  exec 3&gt;&amp;-
  case $exit_status in
    $DIALOG_CANCEL)
      clear
      echo &quot;Program terminated.&quot;
      exit
      ;;
    $DIALOG_ESC)
      clear
      echo &quot;Program aborted.&quot; &gt;&amp;2
      exit 1
      ;;
  esac
  case $selection in
    1 )
      result=$(echo &quot;Hostname: $HOSTNAME&quot;; uptime)
      display_result &quot;System Information&quot;
      ;;
    2 )
      result=$(df -h)
      display_result &quot;Disk Space&quot;
      ;;
    3 )
      if [[ $(id -u) -eq 0 ]]; then
        result=$(du -sh /home/* 2&gt; /dev/null)
        display_result &quot;Home Space Utilization (All Users)&quot;
      else
        result=$(du -sh $HOME 2&gt; /dev/null)
        display_result &quot;Home Space Utilization ($USER)&quot;
      fi
      ;;
  esac
done</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43412/bash-script-for-getopts</guid>
	<pubDate>Wed, 29 Sep 2021 04:53:14 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43412/bash-script-for-getopts</link>
	<title><![CDATA[Bash script for getopts]]></title>
	<description><![CDATA[<code>#using : after a switch variable means it requires some input (ie, t: requires something after t to validate while h requires nothing.
while getopts “ht:r:p:v” OPTION
do
     case $OPTION in
         h)
             usage
             exit 1
             ;;
         t)
             TEST=$OPTARG
             ;;
         r)
             SERVER=$OPTARG
             ;;
         p)
             PASSWD=$OPTARG
             ;;
         v)
             VERBOSE=1
             ;;
         ?)
             usage
             exit
             ;;
     esac
done

if [[ -z $TEST ]] || [[ -z $SERVER ]] || [[ -z $PASSWD ]]
then
     usage
     exit 1
fi</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43408/command-line-to-create-blast-uniref-database</guid>
	<pubDate>Tue, 28 Sep 2021 05:46:20 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43408/command-line-to-create-blast-uniref-database</link>
	<title><![CDATA[Command line to create blast uniref database !]]></title>
	<description><![CDATA[<code>#The NCBI BLAST+ distribution does not include &#039;blastpgp&#039;, it has been replaced by the &#039;psiblast&#039; program. The &#039;blastpgp&#039; program is available in the legacy NCBI BLAST package (no longer supported), which is available from the NCBI&#039;s FTP site: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.26/.

wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
gunzip -v uniref90.fasta.gz
bin/pfilt uniref90.fasta &gt; uniref90filt
formatdb -t uniref90filt -i uniref90filt

#When using NCBI BLAST+ the &#039;formatdb&#039; command should be replaced by the equivalent &#039;makeblastdb&#039; command:

makeblastdb -dbtype prot -in uniref90filt -out uniref90filt</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43405/blastpgp-arguments</guid>
	<pubDate>Tue, 28 Sep 2021 05:30:02 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43405/blastpgp-arguments</link>
	<title><![CDATA[blastpgp arguments !]]></title>
	<description><![CDATA[<code>blastpgp   arguments:

  -d  Database [String]
    default = nr
  -i  Query File [File In]
    default = stdin
  -A  Multiple Hits window size (zero for single hit algorithm) [Integer]
    default = 40
  -f  Threshold for extending hits [Integer]
    default = 0
  -e  Expectation value (E) [Real]
    default = 10.0
  -m  alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = Tabular output, 
9 = Tabular output with comments [Integer]
    default = 0
  -o  Output File for Alignment [File Out]  Optional
    default = stdout
  -y  Dropoff (X) for blast extensions in bits (default if zero) [Real]
    default = 7.0
  -P  0 for multiple hits 1-pass, 1 for single hit 1-pass, 2 for 2-pass [Integer]
    default = 0
  -F  Filter query sequence with SEG [String]
    default = F
  -G  Cost to open a gap [Integer]
    default = 11
  -E  Cost to extend a gap [Integer]
    default = 1
  -X  X dropoff value for gapped alignment (in bits) [Integer]
    default = 15
  -N  Number of bits to trigger gapping [Real]
    default = 22.0
  -g  Gapped [T/F]
    default = T
  -S  Start of required region in query [Integer]
    default = 1
  -H  End of required region in query (-1 indicates end of query) [Integer]
    default = -1
  -a  Number of processors to use [Integer]
    default = 1
  -I  Show GI&#039;s in deflines [T/F]
    default = F
  -h  e-value threshold for inclusion in multipass model [Real]
    default = 0.005
  -c  Constant in pseudocounts for multipass version [Integer]
    default = 9
  -j  Maximum number of passes to use in  multipass version [Integer]
    default = 1
  -J  Believe the query defline [T/F]
    default = F
  -Z  X dropoff value for final gapped alignment (in bits) [Integer]
    default = 25
  -O  SeqAlign file (&#039;Believe the query defline&#039; must be TRUE) [File Out]  Optional
  -M  Matrix [String]
    default = BLOSUM62
  -v  Number of database sequences to show one-line descriptions for (V) [Integer]
    default = 500
  -b  Number of database sequence to show alignments for (B) [Integer]
    default = 250
  -C  Output File for PSI-BLAST Checkpointing [File Out]  Optional
  -R  Input File for PSI-BLAST Restart [File In]  Optional
  -W  Word size, default if zero [Integer]
    default = 0
  -z  Effective length of the database (use zero for the real size) [Real]
    default = 0
  -K  Number of best hits from a region to keep [Integer]
    default = 0
  -s  Compute locally optimal Smith-Waterman alignments [T/F]
    default = F
  -Y  Effective length of the search space (use zero for the real size) [Real]
    default = 0
  -p  program option for PHI-BLAST [String]
    default = blastpgp
  -k  Hit File for PHI-BLAST [File In]
    default = hit_file
  -T  Produce HTML output [T/F]
    default = F
  -Q  Output File for PSI-BLAST Matrix in ASCII [File Out]  Optional
  -B  Input Alignment File for PSI-BLAST Restart [File In]  Optional
  -l  Restrict search of database to list of GI&#039;s [String]  Optional
  -U  Use lower case filtering of FASTA sequence [T/F]  Optional
    default = F
  -t  Use composition based statistics [T/F]
    default = T
  -L  Cost to decline alignment (disabled when 0) [Integer]
    default = 0</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43404/perl-script-for-smith-waterman-algorithm</guid>
	<pubDate>Tue, 28 Sep 2021 05:19:18 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43404/perl-script-for-smith-waterman-algorithm</link>
	<title><![CDATA[Perl script for Smith-Waterman Algorithm]]></title>
	<description><![CDATA[<code># Smith-Waterman  Algorithm

# usage statement
die &quot;usage: $0 &lt;sequence 1&gt; &lt;sequence 2&gt;\n&quot; unless @ARGV == 2;

# get sequences from command line
my ($seq1, $seq2) = @ARGV;

# scoring scheme
my $MATCH    =  1; # +1 for letters that match
my $MISMATCH = -1; # -1 for letters that mismatch
my $GAP      = -1; # -1 for any gap

# initialization
my @matrix;
$matrix[0][0]{score}   = 0;
$matrix[0][0]{pointer} = &quot;none&quot;;
for(my $j = 1; $j &lt;= length($seq1); $j++) {
    $matrix[0][$j]{score}   = 0;
    $matrix[0][$j]{pointer} = &quot;none&quot;;
}
for (my $i = 1; $i &lt;= length($seq2); $i++) {
    $matrix[$i][0]{score}   = 0;
    $matrix[$i][0]{pointer} = &quot;none&quot;;
}

# fill
my $max_i     = 0;
my $max_j     = 0;
my $max_score = 0;

for(my $i = 1; $i &lt;= length($seq2); $i++) {
    for(my $j = 1; $j &lt;= length($seq1); $j++) {
        my ($diagonal_score, $left_score, $up_score);
        
        # calculate match score
        my $letter1 = substr($seq1, $j-1, 1);
        my $letter2 = substr($seq2, $i-1, 1);       
        if ($letter1 eq $letter2) {
            $diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH;
        }
        else {
            $diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH;
        }
        
        # calculate gap scores
        $up_score   = $matrix[$i-1][$j]{score} + $GAP;
        $left_score = $matrix[$i][$j-1]{score} + $GAP;
        
        if ($diagonal_score &lt;= 0 and $up_score &lt;= 0 and $left_score &lt;= 0) {
            $matrix[$i][$j]{score}   = 0;
            $matrix[$i][$j]{pointer} = &quot;none&quot;;
            next; # terminate this iteration of the loop
        }
        
        # choose best score
        if ($diagonal_score &gt;= $up_score) {
            if ($diagonal_score &gt;= $left_score) {
                $matrix[$i][$j]{score}   = $diagonal_score;
                $matrix[$i][$j]{pointer} = &quot;diagonal&quot;;
            }
            else {
                $matrix[$i][$j]{score}   = $left_score;
                $matrix[$i][$j]{pointer} = &quot;left&quot;;
            }
        } else {
            if ($up_score &gt;= $left_score) {
                $matrix[$i][$j]{score}   = $up_score;
                $matrix[$i][$j]{pointer} = &quot;up&quot;;
            }
            else {
                $matrix[$i][$j]{score}   = $left_score;
                $matrix[$i][$j]{pointer} = &quot;left&quot;;
            }
        }
        
        # set maximum score
        if ($matrix[$i][$j]{score} &gt; $max_score) {
            $max_i     = $i;
            $max_j     = $j;
            $max_score = $matrix[$i][$j]{score};
        }
    }
}

# trace-back

my $align1 = &quot;&quot;;
my $align2 = &quot;&quot;;

my $j = $max_j;
my $i = $max_i;

while (1) {
    last if $matrix[$i][$j]{pointer} eq &quot;none&quot;;
    
    if ($matrix[$i][$j]{pointer} eq &quot;diagonal&quot;) {
        $align1 .= substr($seq1, $j-1, 1);
        $align2 .= substr($seq2, $i-1, 1);
        $i--; $j--;
    }
    elsif ($matrix[$i][$j]{pointer} eq &quot;left&quot;) {
        $align1 .= substr($seq1, $j-1, 1);
        $align2 .= &quot;-&quot;;
        $j--;
    }
    elsif ($matrix[$i][$j]{pointer} eq &quot;up&quot;) {
        $align1 .= &quot;-&quot;;
        $align2 .= substr($seq2, $i-1, 1);
        $i--;
    }   
}

$align1 = reverse $align1;
$align2 = reverse $align2;
print &quot;$align1\n&quot;;
print &quot;$align2\n&quot;;</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43403/oneliner-to-convert-lower-case-to-sequence-masked-with-ns</guid>
	<pubDate>Tue, 28 Sep 2021 04:47:05 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43403/oneliner-to-convert-lower-case-to-sequence-masked-with-ns</link>
	<title><![CDATA[Oneliner to convert lower-case to sequence masked with Ns]]></title>
	<description><![CDATA[<code>perl -pe &#039;/^[^&gt;]/ and $_=~ s/[a-z]/N/g&#039; genomic.fna &gt; genomic.N-masked.fna

awk &#039;{if(/^[^&gt;]/)gsub(/[a-z]/,&quot;N&quot;);print $0}&#039; genomic.fna &gt; genomic.N-masked.fna</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43327/list-of-string-comparison-algorithms</guid>
	<pubDate>Fri, 27 Aug 2021 07:27:29 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43327/list-of-string-comparison-algorithms</link>
	<title><![CDATA[List of string comparison algorithms !]]></title>
	<description><![CDATA[<code>String comparison:

    Levenshtein Distance
    Damerau-Levenshtein Distance
    Jaro Distance
    Jaro-Winkler Distance
    Match Rating Approach Comparison
    Hamming Distance

More at https://jellyfish.readthedocs.io/en/latest/comparison.html</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>
<item>
	<guid isPermaLink="true">https://bioinformaticsonline.com/snippets/view/43000/python-script-to-download-covid-genome</guid>
	<pubDate>Fri, 26 Mar 2021 07:01:29 -0500</pubDate>
	<link>https://bioinformaticsonline.com/snippets/view/43000/python-script-to-download-covid-genome</link>
	<title><![CDATA[Python script to download covid genome !]]></title>
	<description><![CDATA[<code>#!/usr/bin/env python3

# these are the publicly available &quot;complete&quot; sequences
# https://www.gisaid.org/ has more (1200?), but they require you to sign up

import requests
import yaml

seqs = yaml.load(requests.get(&quot;https://www.ncbi.nlm.nih.gov/core/assets/genbank/files/ncov-sequences.yaml&quot;).text)
seqs = seqs[&#039;genbank-sequences&#039;]
print(&quot;got %d sequences&quot; % len(seqs))

from Bio import Entrez
allseq = {}
for x in seqs:
  if &#039;gene-region&#039; in x and x[&#039;gene-region&#039;] == &quot;complete&quot;:
    nm = x[&#039;accession&#039;]
    print(&quot;downloading&quot;, nm)
    dna = Entrez.efetch(db=&#039;nucleotide&#039;,id=nm, rettype = &#039;fasta&#039;, retmode= &#039;text&#039;).read().split(&quot;\n&quot;)[1:]
    allseq[nm] = &#039;&#039;.join(dna)

import json
with open(&quot;data/allseq.json&quot;, &quot;w&quot;) as f:
  json.dump(allseq, f)</code>]]></description>
	<dc:creator>Surabhi Chaudhary</dc:creator>
</item>

</channel>
</rss>