BOL: All

Command line to create blast uniref database !

Surabhi Chaudhary — Tue, 28 Sep 2021 05:46:20 -0500

#The NCBI BLAST+ distribution does not include 'blastpgp', it has been replaced by the 'psiblast' program. The 'blastpgp' program is available in the legacy NCBI BLAST package (no longer supported), which is available from the NCBI's FTP site: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/release/2.2.26/.

wget ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref90/uniref90.fasta.gz
gunzip -v uniref90.fasta.gz
bin/pfilt uniref90.fasta > uniref90filt
formatdb -t uniref90filt -i uniref90filt

#When using NCBI BLAST+ the 'formatdb' command should be replaced by the equivalent 'makeblastdb' command:

makeblastdb -dbtype prot -in uniref90filt -out uniref90filt

blastpgp arguments !

Surabhi Chaudhary — Tue, 28 Sep 2021 05:30:02 -0500

blastpgp   arguments:

  -d  Database [String]
    default = nr
  -i  Query File [File In]
    default = stdin
  -A  Multiple Hits window size (zero for single hit algorithm) [Integer]
    default = 40
  -f  Threshold for extending hits [Integer]
    default = 0
  -e  Expectation value (E) [Real]
    default = 10.0
  -m  alignment view options:
0 = pairwise,
1 = query-anchored showing identities,
2 = query-anchored no identities,
3 = flat query-anchored, show identities,
4 = flat query-anchored, no identities,
5 = query-anchored no identities and blunt ends,
6 = flat query-anchored, no identities and blunt ends,
7 = XML Blast output,
8 = Tabular output, 
9 = Tabular output with comments [Integer]
    default = 0
  -o  Output File for Alignment [File Out]  Optional
    default = stdout
  -y  Dropoff (X) for blast extensions in bits (default if zero) [Real]
    default = 7.0
  -P  0 for multiple hits 1-pass, 1 for single hit 1-pass, 2 for 2-pass [Integer]
    default = 0
  -F  Filter query sequence with SEG [String]
    default = F
  -G  Cost to open a gap [Integer]
    default = 11
  -E  Cost to extend a gap [Integer]
    default = 1
  -X  X dropoff value for gapped alignment (in bits) [Integer]
    default = 15
  -N  Number of bits to trigger gapping [Real]
    default = 22.0
  -g  Gapped [T/F]
    default = T
  -S  Start of required region in query [Integer]
    default = 1
  -H  End of required region in query (-1 indicates end of query) [Integer]
    default = -1
  -a  Number of processors to use [Integer]
    default = 1
  -I  Show GI's in deflines [T/F]
    default = F
  -h  e-value threshold for inclusion in multipass model [Real]
    default = 0.005
  -c  Constant in pseudocounts for multipass version [Integer]
    default = 9
  -j  Maximum number of passes to use in  multipass version [Integer]
    default = 1
  -J  Believe the query defline [T/F]
    default = F
  -Z  X dropoff value for final gapped alignment (in bits) [Integer]
    default = 25
  -O  SeqAlign file ('Believe the query defline' must be TRUE) [File Out]  Optional
  -M  Matrix [String]
    default = BLOSUM62
  -v  Number of database sequences to show one-line descriptions for (V) [Integer]
    default = 500
  -b  Number of database sequence to show alignments for (B) [Integer]
    default = 250
  -C  Output File for PSI-BLAST Checkpointing [File Out]  Optional
  -R  Input File for PSI-BLAST Restart [File In]  Optional
  -W  Word size, default if zero [Integer]
    default = 0
  -z  Effective length of the database (use zero for the real size) [Real]
    default = 0
  -K  Number of best hits from a region to keep [Integer]
    default = 0
  -s  Compute locally optimal Smith-Waterman alignments [T/F]
    default = F
  -Y  Effective length of the search space (use zero for the real size) [Real]
    default = 0
  -p  program option for PHI-BLAST [String]
    default = blastpgp
  -k  Hit File for PHI-BLAST [File In]
    default = hit_file
  -T  Produce HTML output [T/F]
    default = F
  -Q  Output File for PSI-BLAST Matrix in ASCII [File Out]  Optional
  -B  Input Alignment File for PSI-BLAST Restart [File In]  Optional
  -l  Restrict search of database to list of GI's [String]  Optional
  -U  Use lower case filtering of FASTA sequence [T/F]  Optional
    default = F
  -t  Use composition based statistics [T/F]
    default = T
  -L  Cost to decline alignment (disabled when 0) [Integer]
    default = 0

Perl script for Smith-Waterman Algorithm

Surabhi Chaudhary — Tue, 28 Sep 2021 05:19:18 -0500

# Smith-Waterman  Algorithm

# usage statement
die "usage: $0  \n" unless @ARGV == 2;

# get sequences from command line
my ($seq1, $seq2) = @ARGV;

# scoring scheme
my $MATCH    =  1; # +1 for letters that match
my $MISMATCH = -1; # -1 for letters that mismatch
my $GAP      = -1; # -1 for any gap

# initialization
my @matrix;
$matrix[0][0]{score}   = 0;
$matrix[0][0]{pointer} = "none";
for(my $j = 1; $j <= length($seq1); $j++) {
    $matrix[0][$j]{score}   = 0;
    $matrix[0][$j]{pointer} = "none";
}
for (my $i = 1; $i <= length($seq2); $i++) {
    $matrix[$i][0]{score}   = 0;
    $matrix[$i][0]{pointer} = "none";
}

# fill
my $max_i     = 0;
my $max_j     = 0;
my $max_score = 0;

for(my $i = 1; $i <= length($seq2); $i++) {
    for(my $j = 1; $j <= length($seq1); $j++) {
        my ($diagonal_score, $left_score, $up_score);
        
        # calculate match score
        my $letter1 = substr($seq1, $j-1, 1);
        my $letter2 = substr($seq2, $i-1, 1);       
        if ($letter1 eq $letter2) {
            $diagonal_score = $matrix[$i-1][$j-1]{score} + $MATCH;
        }
        else {
            $diagonal_score = $matrix[$i-1][$j-1]{score} + $MISMATCH;
        }
        
        # calculate gap scores
        $up_score   = $matrix[$i-1][$j]{score} + $GAP;
        $left_score = $matrix[$i][$j-1]{score} + $GAP;
        
        if ($diagonal_score <= 0 and $up_score <= 0 and $left_score <= 0) {
            $matrix[$i][$j]{score}   = 0;
            $matrix[$i][$j]{pointer} = "none";
            next; # terminate this iteration of the loop
        }
        
        # choose best score
        if ($diagonal_score >= $up_score) {
            if ($diagonal_score >= $left_score) {
                $matrix[$i][$j]{score}   = $diagonal_score;
                $matrix[$i][$j]{pointer} = "diagonal";
            }
            else {
                $matrix[$i][$j]{score}   = $left_score;
                $matrix[$i][$j]{pointer} = "left";
            }
        } else {
            if ($up_score >= $left_score) {
                $matrix[$i][$j]{score}   = $up_score;
                $matrix[$i][$j]{pointer} = "up";
            }
            else {
                $matrix[$i][$j]{score}   = $left_score;
                $matrix[$i][$j]{pointer} = "left";
            }
        }
        
        # set maximum score
        if ($matrix[$i][$j]{score} > $max_score) {
            $max_i     = $i;
            $max_j     = $j;
            $max_score = $matrix[$i][$j]{score};
        }
    }
}

# trace-back

my $align1 = "";
my $align2 = "";

my $j = $max_j;
my $i = $max_i;

while (1) {
    last if $matrix[$i][$j]{pointer} eq "none";
    
    if ($matrix[$i][$j]{pointer} eq "diagonal") {
        $align1 .= substr($seq1, $j-1, 1);
        $align2 .= substr($seq2, $i-1, 1);
        $i--; $j--;
    }
    elsif ($matrix[$i][$j]{pointer} eq "left") {
        $align1 .= substr($seq1, $j-1, 1);
        $align2 .= "-";
        $j--;
    }
    elsif ($matrix[$i][$j]{pointer} eq "up") {
        $align1 .= "-";
        $align2 .= substr($seq2, $i-1, 1);
        $i--;
    }   
}

$align1 = reverse $align1;
$align2 = reverse $align2;
print "$align1\n";
print "$align2\n";

Oneliner to convert lower-case to sequence masked with Ns

Surabhi Chaudhary — Tue, 28 Sep 2021 04:47:05 -0500

perl -pe '/^[^>]/ and $_=~ s/[a-z]/N/g' genomic.fna > genomic.N-masked.fna

awk '{if(/^[^>]/)gsub(/[a-z]/,"N");print $0}' genomic.fna > genomic.N-masked.fna

Trim the reads in loop using Trimmomatic !

Neel — Thu, 23 Sep 2021 13:13:38 -0500

for infile in *_1.fastq.gz
do
   base=$(basename ${infile} _1.fastq.gz)
   trimmomatic PE ${infile} ${base}_2.fastq.gz \
                ${base}_1.trim.fastq.gz ${base}_1un.trim.fastq.gz \
                ${base}_2.trim.fastq.gz ${base}_2un.trim.fastq.gz \
                SLIDINGWINDOW:4:20 MINLEN:25 ILLUMINACLIP:NexteraPE-PE.fa:2:40:15 
done

Commands to Remove White Space In Text Or String Using Awk And Sed In Linux

Neel — Wed, 22 Sep 2021 08:01:34 -0500

text=" ATGGTV AGTGACCTAGAGTGATGA G   GGRTTT"

echo "$text" | sed 's/ //g'
OR
echo "$text" | awk '{ gsub(/ /,""); print }'

Return: ATGGTVAGTGACCTAGAGTGATGAGGGRTTT

echo "$text" | sed 's/^ //g'

echo "$text" | sed 's/ \$//g'

#Multiple space
cat /tmp/test.txt | sed 's/[ ]\+/ /g'

echo "$text1" | awk '{ gsub(/[ ]+/," "); print }'

cat /tmp/test.txt | awk '{ gsub(/[ ]+/," "); print }'

Install Packages in Python

LEGE — Fri, 17 Sep 2021 02:02:14 -0500

#Create a conda environment.
#Install a Python package in the terminal using conda.

$ conda create -n myenv

$ conda create -n myenv Python=3.7

$ conda env create -f environment.yml

#List Available Conda Environments

$ conda env list

#Activate an Environment for Use
$ conda activate myenv

#Update Conda Environments Using a YAML File
$ conda activate earth-analytics-python
$ conda env update -f environment.yml

#Adding a Package to your YAML File

name: genome-analytics-python
channels:
  - conda-forge
  - defaults

dependencies:
  - python=3.7
  - matplotlib
  # Core scientific python
  - numpy

#List Installed Dependencies Within an Environment
(myenv) $ conda list

Run multiple bash command in screen !

LEGE — Thu, 16 Sep 2021 15:04:21 -0500

#login to screen 
screen -r 123

#bash.sh
srun --partition=compute --nodes=1 --ntasks-per-node=40 --pty bash.sh

#Run and check the status in screen
contol +A +D

Install Nexflow on Linux !

LEGE — Wed, 15 Sep 2021 20:47:11 -0500

# Make sure that Java v8+ is installed:
java -version

# Install Nextflow
curl -fsSL get.nextflow.io | bash

# Add Nextflow binary to your PATH:
mv nextflow ~/bin/
# OR system-wide installation:
# sudo mv nextflow /usr/local/bin

Tadpole is 250x faster than SPADes assembler !

LEGE — Thu, 02 Sep 2021 08:30:43 -0500

lege@jit-Lenovo-ideapad-320-15ISK:~/Downloads/MyTools/Vir$ tadpole.sh 

Written by Brian Bushnell
Last modified July 16, 2018

Description:  Uses kmer counts to assemble contigs, extend sequences, 
or error-correct reads.  Tadpole has no upper bound for kmer length,
but some values are not supported.  Specifically, it allows 1-31,
multiples of 2 from 32-62, multiples of 3 from 63-93, etc.
Please read bbmap/docs/guides/TadpoleGuide.txt for more information.

Usage:
Assembly:     tadpole.sh in= out=
Extension:    tadpole.sh in= out= mode=extend
Correction:   tadpole.sh in= out= mode=correct

Extension and correction may be done simultaneously.  Error correction on 
multiple files may be done like this:

tadpole.sh in=libA_r1.fq,libA_merged.fq in2=libA_r2.fq,null extra=libB_r1.fq out=ecc_libA_r1.fq,ecc_libA_merged.fq out2=ecc_libA_r2.fq,null mode=correct

Extending contigs with reads could be done like this:

tadpole.sh in=contigs.fa out=extended.fa el=100 er=100 mode=extend extra=reads.fq k=62


Input parameters:
in=           Primary input file for reads to use as kmer data.
in2=          Second input file for paired data.
extra=        Extra files for use as kmer data, but not for error-
                    correction or extension.
reads=-1            Only process this number of reads, then quit (-1 means all).
NOTE: in, in2, and extra may also be comma-delimited lists of files.

Output parameters:
out=          Write contigs (in contig mode) or corrected/extended 
                    reads (in other modes).
out2=         Second output file for paired output.
outd=         Write discarded reads, if using junk-removal flags.
dot=          Write a contigs connectivity graph (partially implemented)
dump=         Write kmers and their counts.
fastadump=t         Write kmers and counts as fasta versus 2-column tsv.
mincounttodump=1    Only dump kmers with at least this depth.
showstats=t         Print assembly statistics after writing contigs.

Prefiltering parameters:
prefilter=0         If set to a positive integer, use a countmin sketch
                    to ignore kmers with depth of that value or lower.
prehashes=2         Number of hashes for prefilter.
prefiltersize=0.2   (pff) Fraction of memory to use for prefilter.
minprobprefilter=t  (mpp) Use minprob for the prefilter.
prepasses=1         Use this many prefiltering passes; higher be more thorough
                    if the filter is very full.  Set to 'auto' to iteratively 
                    prefilter until the remaining kmers will fit in memory.
onepass=f           If true, prefilter will be generated in same pass as kmer
                    counts.  Much faster but counts will be lower, by up to
                    prefilter's depth limit.

Hashing parameters:
k=31                Kmer length (1 to infinity).  Memory use increases with K.
prealloc=t          Pre-allocate memory rather than dynamically growing; 
                    faster and more memory-efficient.  A float fraction (0-1)
                    may be specified; default is 1.
minprob=0.5         Ignore kmers with overall probability of correctness below this.
minprobmain=t       (mpm) Use minprob for the primary kmer counts.
threads=X           Spawn X hashing threads (default is number of logical processors).
rcomp=t             Store and count each kmer together and its reverse-complement.
coremask=t          All kmer extensions share the same hashcode.
fillfast=t          Speed up kmer extension lookups.

Assembly parameters:
mincountseed=3      (mcs) Minimum kmer count to seed a new contig or begin extension.
mincountextend=2    (mce) Minimum kmer count continue extension of a read or contig.
                    It is recommended that mce=1 for low-depth metagenomes.
mincountretain=0    (mincr) Discard kmers with count below this.
maxcountretain=INF  (maxcr) Discard kmers with count above this.
branchmult1=20      (bm1) Min ratio of 1st to 2nd-greatest path depth at high depth.
branchmult2=3       (bm2) Min ratio of 1st to 2nd-greatest path depth at low depth.
branchlower=3       (blc) Max value of 2nd-greatest path depth to be considered low.
minextension=2      (mine) Do not keep contigs that did not extend at least this much.
mincontig=auto      (minc) Do not write contigs shorter than this.
mincoverage=1       (mincov) Do not write contigs with average coverage below this.
trimends=0          (trim) Trim contig ends by this much.  Trimming by K/2 
                    may yield more accurate genome size estimation.
contigpasses=16     Build contigs with decreasing seed depth for this many iterations.
contigpassmult=1.7  Ratio between seed depth of two iterations.
ownership=auto      For concurrency; do not touch.
processcontigs=f    Explore the contig connectivity graph. (partially implemented)

Processing modes:
mode=contig         contig: Make contigs from kmers.
                    extend: Extend sequences to be longer, and optionally
                            perform error correction.
                    correct: Error correct only.
                    insert: Measure insert sizes.
                    discard: Discard low-depth reads, without error correction.

Extension parameters:
extendleft=100      (el) Extend to the left by at most this many bases.
extendright=100     (er) Extend to the right by at most this many bases.
ibb=t               (ignorebackbranches) Do not stop at backward branches.
extendrollback=3    Trim a random number of bases, up to this many, on reads
                    that extend only partially.  This prevents the creation
                    of sharp coverage discontinuities at branches.

Error-correction parameters:
ecc=f               Error correct via kmer counts.
reassemble=t        If ecc is enabled, use the reassemble algorithm.
pincer=f            If ecc is enabled, use the pincer algorithm.
tail=f              If ecc is enabled, use the tail algorithm.
eccfull=f           If ecc is enabled, use tail over the entire read.
aggressive=f        (aecc) Use aggressive error correction settings.
                    Overrides some other flags like errormult1 and deadzone.
conservative=f      (cecc) Use conservative error correction settings.
                    Overrides some other flags like errormult1 and deadzone.
rollback=t          Undo changes to reads that have lower coverage for
                    any kmer after correction.
markbadbases=0      (mbb) Any base fully covered by kmers with count below 
                    this will have its quality reduced.
markdeltaonly=t     (mdo) Only mark bad bases adjacent to good bases.
meo=t               (markerrorreadsonly) Only mark bad bases in reads 
                    containing errors.
markquality=0       (mq) Set quality scores for marked bases to this.
                    A level of 0 will also convert the base to an N.
errormult1=16       (em1) Min ratio between kmer depths to call an error.
errormult2=2.6      (em2) Alternate ratio between low-depth kmers.
errorlowerconst=3   (elc) Use mult2 when the lower kmer is at most this deep.
mincountcorrect=3   (mcc) Don't correct to kmers with count under this.
pathsimilarityfraction=0.45(psf) Max difference ratio considered similar.
                           Controls whether a path appears to be continuous.
pathsimilarityconstant=3   (psc) Absolute differences below this are ignored.
errorextensionreassemble=5 (eer) Verify this many kmers before the error as
                           having similar depth, for reassemble.
errorextensionpincer=5     (eep) Verify this many additional bases after the
                           error as matching current bases, for pincer.
errorextensiontail=9       (eet) Verify additional bases before and after 
                           the error as matching current bases, for tail.
deadzone=0          (dz) Do not try to correct bases within this distance of
                    read ends.
window=12           (w) Length of window to use in reassemble mode.
windowcount=6       (wc) If more than this many errors are found within a
                    a window, halt correction in that direction.
qualsum=80          (qs) If the sum of the qualities of corrected bases within
                    a window exceeds this, halt correction in that direction.
rbi=t               (requirebidirectional) Require agreement from both 
                    directions when correcting errors in the middle part of 
                    the read using the reassemble algorithm.
errorpath=1         (ep) For debugging purposes.

Junk-removal parameters (to only remove junk, set mode=discard):
tossjunk=f          Remove reads that cannot be used for assembly.
                    This means they have no kmers above depth 1 (2 for paired
                    reads) and the outermost kmers cannot be extended.
                    Pairs are removed only if both reads fail.
tossdepth=-1        Remove reads containing kmers at or below this depth.
                    Pairs are removed if either read fails.
lowdepthfraction=0  (ldf) Require at least this fraction of kmers to be
                    low-depth to discard a read; range 0-1. 0 still
                    requires at least 1 low-depth kmer.
requirebothbad=f    (rbb) Only discard pairs if both reads are low-depth.
tossuncorrectable   (tu) Discard reads containing uncorrectable errors.
                    Requires error-correction to be enabled.

Shaving parameters:
shave=t             Remove dead ends (aka hair).
rinse=t             Remove bubbles.
wash=               Set shave and rinse at the same time.
maxshavedepth=1     (msd) Shave or rinse kmers at most this deep.
exploredist=300     (sed) Quit after exploring this far.
discardlength=150   (sdl) Discard shavings up to this long.
Note: Shave and rinse can produce substantially better assemblies
for low-depth data, but they are very slow for large metagenomes.

Overlap parameters (for overlapping paired-end reads only):
merge=f             Attempt to merge overlapping reads prior to 
                    kmer-counting, and again prior to correction.  Output
                    will still be unmerged pairs.
ecco=f              Error correct via overlap, but do not merge reads.
testmerge=t         Test kmer counts around the read merge junctions.  If
                    it appears that the merge created new errors, undo it.

Java Parameters:
-Xmx                This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
                    -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.
-eoom               This flag will cause the process to exit if an out-of-memory exception occurs.  Requires Java 8u92+.
-da                 Disable assertions.