BOL: Owner

Rename the screen name in Linux !

Shruti Paniwala — Mon, 30 May 2022 20:44:38 -0500

screen -S 22798.pts-5.hn1 -X sessionname node33_COMPUTE

R script to visualize fastANI core-genome comparison

Shruti Paniwala — Fri, 30 Apr 2021 02:29:19 -0500

#######
# Purpose: Visualize fastANI core-genome comparison
# Usage: Rscript     
# Output: .pdf
# Uses genoPlotR package: http://genoplotr.r-forge.r-project.org

#Parse command line arguments
query_fasta=commandArgs(TRUE)[1]
subject_fasta=commandArgs(TRUE)[2]
fastANI_visual_file=commandArgs(TRUE)[3]

library(genoPlotR)

#Read fastANI output
comparison <- try(read_comparison_from_blast(fastANI_visual_file))

#Read sequences into genoPlotR objects
Query <- try(read_dna_seg_from_file(query_fasta))
Ref <- try(read_dna_seg_from_file(subject_fasta))

plotTitle = paste(query_fasta, subject_fasta, sep=" v/s ")

pdf( paste(fastANI_visual_file,".pdf",sep="") )

plot_gene_map(dna_segs=list(Query, Ref), comparisons=list(comparison), main=plotTitle, scale=FALSE, scale_cex=1, n_scale_ticks=4)

dev.off()

Install kraken on linux

Shruti Paniwala — Mon, 04 May 2020 04:16:39 -0500

$ conda install -c bioconda kraken
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/anaconda3

  added / updated specs:
    - kraken


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    jellyfish-1.1.12           |       h6bb024c_1         3.1 MB  bioconda
    kraken-1.1                 |       h470a237_2         114 KB  bioconda
    ------------------------------------------------------------
                                           Total:         3.2 MB

The following NEW packages will be INSTALLED:

  jellyfish          bioconda/linux-64::jellyfish-1.1.12-h6bb024c_1
  kraken             bioconda/linux-64::kraken-1.1-h470a237_2
  perl               pkgs/main/linux-64::perl-5.26.2-h14c3975_0


Proceed ([y]/n)? y


Downloading and Extracting Packages
kraken-1.1           | 114 KB    | ################################################## | 100% 
jellyfish-1.1.12     | 3.1 MB    | ################################################## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

$conda install -c bioconda kraken-all
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /home/anaconda3

  added / updated specs:
    - kraken-all


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    kraken-0.10.6_eaf8fb68     |                2          98 KB  bioconda
    kraken-all-0.10.6_eaf8fb68 |                0          857 B  bioconda
    perl-threaded-5.26.0       |                0           4 KB  bioconda
    ------------------------------------------------------------
                                           Total:         103 KB

The following NEW packages will be INSTALLED:

  kraken-all         bioconda/linux-64::kraken-all-0.10.6_eaf8fb68-0
  perl-threaded      bioconda/linux-64::perl-threaded-5.26.0-0

The following packages will be SUPERSEDED by a higher-priority channel:

  kraken             bioconda/label/cf201901::kraken-1.1-h~ --> bioconda::kraken-0.10.6_eaf8fb68-2


Proceed ([y]/n)? y


Downloading and Extracting Packages
kraken-0.10.6_eaf8fb | 98 KB     | ################################################## | 100% 
kraken-all-0.10.6_ea | 857 B     | ################################################## | 100% 
perl-threaded-5.26.0 | 4 KB      | ################################################## | 100% 
Preparing transaction: done
Verifying transaction: done
Executing transaction: done

Samtools commands for bioinformatician !

Shruti Paniwala — Sun, 15 Dec 2019 10:31:22 -0600

## count mapped reads
samtools view -c -F 260 mapping_file.bam


### converting sam file into fasta
samtools fasta reads_mapped.sam > reads.fasta

### converting sam file into bam
# -b : output is bam
# -S : input is sam
# -o : output file name
samtools view -b -S -o sal_sej.bam sal_sej.sam

### viewing bam files (view command)
samtool view sal_sej.bam | less


### sort reads by flag specified and show them
# -f INT show flag matches
samtools -f 4 sal_sej.bam | less

# -F INT show reads excepting flag matches
samtools -F 4 sal_sej.bam | less


### count reads (-c) by flag specified
samtool view -c -f 4 sal_sej.bam


### count reads by quality value specified (-q) (>=)
# -q : minimal quality value
samtools view -q 42 -c sal_sej.bam



### sorting bam file by genome position
samtools sort sal_sej.bam > sal_sej_sorted.bam

### indexing sorted file
# ouput is file.bai
# always index sorted files
samtools index sal_sej_sorted.bam.bam

### identifying genome variants (mpileup command)
# -g : output is bcf (binary call format) file
# -f : use reference genome given
samtools mpileup -g -f sal_ref_sej.fasta sal_sej_sorted.bam.bam > sal_vars.bcf


### calling snp and indels
# -c : find snp
# -v : output only potential variants
bcftools view -c -v sal_vars.bcf > sal_vars.vcf
### note: new command is "call". type "bcftools" to see current version and commands

### calling snp and indels with no frequency threshold
# omit -v parameter
bcftools call -c sal_vars.bcf > sal_vars.vcf

### normalize (realign) indels
# -f : reference fasta, needed to left align and normalize
bcftools norm -f ./input_data/ref.fasta \
vars.vcf \
-o vars_indels_realigned.vcf

### show alignment
# first arg is sorted bam file
# second arg is reference 
samtools tview sal_sej_sorted.bam.bam sal_ref_sej.fasta

### results into txt file
samtools depth /path/to/sorted_bam.bam > /path/to/coverage_results.txt
### note: returns count of depth at each position. input should be sorted

### count depth at each position and put it into a txt file
# -a : at all positions
samtools depth -a sorted_dupremoved.bam > depth.txt

### one liner to count mean depth
samtools depth -a sorted_dupremoved.bam | awk '{c++;s+=$3}END{print s/c}'

### one liner to count coverage breadth
samtools depth -a sorted_dupremoved.bam | awk '{c++; if($3>0) total+=1}END{print (total/c)*100}'

### flagstat
# output file has two columns: QC-passed reads and QC-failed reads
# and rows: total reads, duplicates, mapped.
samtools flagstat sorted_dupremoved.bam > flagstat.txt

####################################
# try these recipies: 
### convert BAM to FASTA:
samtools view filename.bam | awk '{OFS="\t"; print ">"$1"\n"$10}' - > filename.fasta

## converting BAM to SAM:
samtools view -h -o out.sam in.bam
## "samtools view -h  > "

QV calculation in Bash !

Shruti Paniwala — Fri, 13 Dec 2019 21:14:16 -0600

# $1 = vcf file
# $2 = input bam file
# $3 = output QV file

module load samtools

NUM_BP=`samtools depth $2 | perl -e '$c = 0; while(<>){chomp; @s = split(/\t/); if(scalar(@s) >= 3){$c++;}} print "$c\n";'`
echo "num bp: "$NUM_BP

NUM_SNP=`cat $1 |grep -v "#" | awk -F "\t" '{if (!match($NF, "0/1")) print $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$8}' | tr ';' ' ' | sed s/AB=//g | awk -v WEIGHT=0 '{if ($6 >= WEIGHT) print $0}' | awk -v SUM=0 '{if (length($4) == length($5)) { SUM+=length($4); } else if (length($4) < length($5)) { SUM+=length($5)-length($4); } else { SUM+=length($4)-length($5)}} END { print SUM}'`
echo "num snp: "$NUM_SNP

perl -e 'chomp(@ARGV); $ns = $ARGV[0]; $nb = $ARGV[1]; print (-10 * log($ns/$nb)/log(10)); print "\n";' $NUM_SNP $NUM_BP > $3
cat $3

Perl script to find coding regions in DNA sequences

Shruti Paniwala — Mon, 11 Jun 2018 08:03:03 -0500

#!/usr/bin/perl -w

use strict;


# if the number of input arguments is lower than 2
# return a message showing the error

if (scalar(@ARGV) < 2) {
  print "dnaloglkh.pl codontable DNAsequence\n";
  exit(1);
}

# create two vars contaning the filenames

my $filecodontable = $ARGV[0];
my $filesequence = $ARGV[1];


# open the first file: table of codon usage frequencies

if (!open(PROPCODONS,"< $filecodontable")) {
  print "dnaloglkh.pl: the file $filecodontable can not be opened\n";
  exit(1);
}


# load the frequencies of codon usage into the hash table %pcodons
# this hash is indexed by a triplet of nucleotides or codon

my %pcodons;
my ($codon,$freq);
my $line;

# for each line in the file (tableofcodons) do

while ($line = ) {

	# extract the two values or columns with a regular expression
	# A group of letters and a decimal number
	
	($codon,$freq) = ($line =~ m/(\w+) (\d+\.\d+)/);
  
	# load the hash table
	$pcodons{$codon}=$freq;
}

close(PROPCODONS);

##########################################################
# open the DNA sequence (second file)

if (!open(SEQUENCE,"< $filesequence")) {
	print "dnaloglkh.pl: the file $filesequence can not be opened\n";
	exit(1);
}


# FASTA format: 
# >header containing a description of the sequence
# AGACTGACTTAC
# AGACTGACTTAC
# AGACTGAC

my $iden = ;  # reading the header of the sequence
my @seql = ;  # load the complete sequence into an array
my $seq  = "";          # use a var to save the whole sequence inside

close(SEQUENCE);

# concat every segment (line) in the array into the var $seq

my $i=0;

while ($i < scalar(@seql)) {
	chomp($seql[$i]);
	$seq = $seq . $seql[$i];
	$i = $i + 1;
}

$seq = "\U$seq";

# we are going to use the array @codons to store all of the codons in
# the input DNA sequence. To split the sequence in segments of length
# three, a regular expression will be used searching for all of the
# possible groups of three symbols (option g)

my @codons = ($seq =~ m/.../g);

# the likelihood ratio is for a current triplet, the ratio between 
# the probability computed from the table of codon frequencies divided
# by the random probability (uniform), that is, 1/64. After this,
# the log must be applied to obtain the log-likelihood ratio.
# It is recommended the use of logs to manage calculations involving
# small numbers between 0 and 1. Log of the product is equal to the
# the sum of logs and the log of the division is the substraction of logs

my $r = 0.0;

$i=0;
while ($i < scalar(@codons)) {
  $r = $r + log($pcodons{$codons[$i]}) - log(1/64);
  $i = $i + 1;
}

printf "%.2f\n",$r;


__END__

To test the program, save this two files in your current working directory: exon.fa and intron.fa. Both files contain the sequence of a real exon and intron, respectively. Extremely different values of coding potential should be obtained with the two sequences: high values (positive) for coding protein sequences and low values (zero or even negative) for intronic regions. 

 exon.fa 

>HUMHBB.2ex
GCTGCTGGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACT
CCTGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGCCT
TTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAGTGAGCT
GCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGG


 intron.fa.

>HUMHBB.2pin
ATAACAATTGTTTTCTTTTGTTTAATTCTTGCTTTCTTTTTTTTTCTTCTCCGCAATTTTT
ACTATTATACTTAATGCCTTAACATTGTGTATAACAAAAGGAAATATCTCTGAGATACATT
AAGTAACTTAAAAAAAAACTTTACACAGTCTGCCTAGTACATTACTATTTGGAATATATGT
GTGCTTATTTGCATATTCATAATCTCCCTACTTTATTTTC

Extracting FASTA sequences based on position with perl script !!

Shruti Paniwala — Wed, 01 Mar 2017 17:10:11 -0600

#!/usr/bin/env perl
 
#Uses: perl sub-seq.pl input.txt range
 
use strict;
use warnings;
 
my $end   = pop;
my $start = pop;
local $/ = '>';
 
while (<>) {
    chomp;
    next unless /(.+)/;
    my ($header) = "$/$1_$start-$end\n";
    my $seq = ${^POSTMATCH};
    $seq =~ s/\s//g;
    print $header;
    print +( substr $seq, $start - 1, $end ) . "\n";
}

Perl script to insert the DNA string in genome

Shruti Paniwala — Mon, 23 Jan 2017 10:04:55 -0600

#!/usr/bin/perl

use warnings;
use strict;
use Bio::SeqIO;
use Bio::Seq;

my $file = $ARGV[0]; # input fasta file (genome file)
my $out = $ARGV[1]; # output fasta file

my $chr="test"; #insertion chromosome
my $pos=10; # position of the insertion
my $seqI = "AAAA"; #sequence of the insertion

my $seq_in  = Bio::SeqIO->new( -format => 'fasta',-file => $file);
my $seq_out = Bio::SeqIO->new( -format => 'fasta',-file => ">".$out);
while( my $seq = $seq_in->next_seq() ) {    
    if($seq->primary_id eq $chr){
        my $length = length($seq->seq);    
        my $upstream=substr($seq->seq, 0, $pos);
        my $downstream=substr($seq->seq, $pos,$length);        
        my $seq_obj = Bio::Seq->new(-seq => $upstream.$seqI.$downstream,-display_id => $seq->primary_id,-alphabet => "dna" );
            $seq_out->write_seq($seq_obj);
    }
    else{
        $seq_out->write_seq($seq);
    }
}

Blast script to index and extract sequence !!

Shruti Paniwala — Thu, 14 Jul 2016 09:52:17 -0500

# look at the file
$ head EC4115.fa 
>NC_011353.1 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome.
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTC
TGACAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC

# generate the blast database
$ makeblastdb -dbtype nucl -out EC -in EC4115.fa -parse_seqids

# retreive an entry by id
$ blastdbcmd -db EC -entry 'NC_011353.1' | head
>lcl|NC_011353.1 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTCTCTGACAGCAGCTTCTGAACTG
GTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGAC
....

# query the blast database by id and coordinates
$ blastdbcmd -db EC -range 100-105 -entry 'NC_011353.1'
>lcl|NC_011353.1:100-105 Escherichia coli O157:H7 str. EC4115 chromosome, complete genome
TTAAAA