BOL: All

Count number of lines in each file in Linux !

Neel — Fri, 18 Feb 2022 22:43:57 -0600

for FILE in *.rd; do wc -l $FILE; done > allReads.hits

Install GATK 4 using conda !

Jit — Sun, 13 Feb 2022 20:35:54 -0600

#GATK is a toolkit developed by the broad institute focused primarily on variant discovery and genotyping. It is open source, hosted on github, and available under a BSD 3-clause license. First let’s download and unzip GATK from github. The creators of GATK recommend running GATK through conda which is a package, environment, and dependency management software, in essence conda basically creates a virtual environment from which to run software. The next step then is to tell conda to create a virtual environment for GATK by using the yaml file included within GATK as the instructions for creating the virtual environment. We do this with the command conda env create, we also use the -p option to specify where this environment should be stored. We will also make a symlink so the executable downloaded is available directly from our bin folder. To run GATK we must first start up the virtual environment with the command source activate, we can then run the program by providing the path to the executable. To exit the virtual environment run the command source deactivate.

# download and unzip
cd ~/workspace/bin
wget https://github.com/broadinstitute/gatk/releases/download/4.0.2.1/gatk-4.0.2.1.zip
unzip gatk-4.0.2.1.zip

# make sure ubuntu user can create their own conda environments
sudo chown -R ubuntu:ubuntu /home/ubuntu/.conda

# create conda environment for gatk
cd gatk-4.0.2.1/
conda env create -f gatkcondaenv.yml -p ~/workspace/bin/conda/gatk

# make symlink
ln -s ~/workspace/bin/gatk-4.0.2.1/gatk ~/workspace/bin/gatk

# test installation
source activate ~/workspace/bin/conda/gatk
~/workspace/bin/gatk

# to exit the virtual environment
source deactivate

Command line to print disk usage on Linux terminal !

Jit — Thu, 10 Feb 2022 21:21:49 -0600

#Print disk usage - perl
du -h |perl -e'%h=map{/.\s/;99**(ord$&&7)-$`,$_}`du -h`;die@h{sort%h}'

#Bash
du -k * | sort -nr | cut -f2 | xargs -d '\n' du -sh

#Base
du -scBM | sort -n

#More 
du -s * | sort -rn | cut -f2- | xargs -d "\n" du -sh

Commands to get the detail of disk usage on Linux !

Jit — Wed, 09 Feb 2022 21:44:37 -0600

#A simplistic approach would be

du -shc /home/*
du -shc /home/jnarayan

#To sort it:
du -smc /home/* | sort -n

#There is also a wellknown Perl script that has the option of mailing disk usage reports per user: durep
http://www.ubuntugeek.com/create-disk-usage-reports-with-durep.html

BBmap the reads with all alignments !

Jit — Mon, 07 Feb 2022 08:29:42 -0600

bbmap.sh in=../reference/reference.numbered.fa ambig=all vslow perfectmode maxsites=100000 out=fetch_Ids_for_barcode.sam

Bash script to split multifasta file !

Neel — Wed, 02 Feb 2022 03:53:30 -0600

#Using awk, we can easily split a file (multi.fa) into chunks of size N (here, N=500), by using the following one-liner:

awk 'BEGIN {n=0;} /^>/ {if(n%500==0){file=sprintf("chunk%d.fa",n);} print >> file; n++; next;} { print >> file; }' < multi.fa

#OR

awk -v chunksize=$(grep ">" multi.fasta -c) 'BEGIN{n=0; chunksize=int(chunksize/10)+1 } /^>/ {if(n%chunksize==0){file=sprintf("chunk%d.fa",n);} print >> file; n++; next;} { print >> file; }' < multi.fasta

#Another great solution is genome tools (gt), which you can find here: http://genometools.org/, which has the following simple command:

gt splitfasta -numfiles 10 multi.fasta

Install Varscan on Ubuntu / Linux !

Abhi — Wed, 02 Feb 2022 02:38:25 -0600

#Varscan is a java program designed to call variants in sequencing data. It was developed at the Genome Institute at Washington University and is hosted on github. To use Varscan we simply need to download the distributed jar file into our ~/workspace/bin. As with the other java programs which have already been installed in this section we can invoke Varscan via java -jar.

# Install Varscan
cd ~/workspace/bin
curl -L -k -o VarScan.v2.4.2.jar https://github.com/dkoboldt/varscan/releases/download/2.4.2/VarScan.v2.4.2.jar
java -jar ~/workspace/bin/VarScan.v2.4.2.jar

Install StringTie on ubuntu / Linux !

Abhi — Wed, 02 Feb 2022 02:36:02 -0600

#StringTie is a software program to perform transcript assembly and quantification of RNAseq data. The binary distributions are available so to install we can just download this distribution and extract it. Like with our other programs we also make a symlink to make it easier to find.

# download and extract
cd ~/workspace/bin
wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-1.3.0.Linux_x86_64.tar.gz
tar -xzvf stringtie-1.3.0.Linux_x86_64.tar.gz

# make symlink
ln -s ~/workspace/bin/stringtie-1.3.0.Linux_x86_64/stringtie ~/workspace/bin/stringtie

# test installation
~/workspace/bin/stringtie -h

Install R in Ubuntu / Linux !

Abhi — Wed, 02 Feb 2022 02:34:51 -0600

#R is a feature rich interpretive programming language originally released in 1995. It is heavily used in the bioinformatics community largely due to numerous R libraries available on bioconductor. It takes a several minutes to compile so we’ll use one which has already been setup. If we were to install R, we first would need to download and extract the source code. Next we’d configure the installation with --with-x=no which tells R to install without X11, a windowing system for displays. We’d also specify --prefix which is where the R framework will go, this includes the additional R libraries we’ll download later. From there we’d do make and make install to build the software and copy the files to their proper location and create symlinks for the executables. Finally we’d install the devtools and Biocmanager packages from the command line to make installing additional packages easier. We’ve commented out the code below, however it is exactly what was run to set up the R we will be using, except the installation location.

## download and extract
cd ~/workspace/bin
wget https://cran.r-project.org/src/base/R-3/R-3.5.1.tar.gz
tar -zxvf R-3.5.1.tar.gz

## configure the installation, build the code
cd R-3.5.1
./configure --prefix=/home/ubuntu/workspace/bin --with-x=no
make
make install

## make symlinks
ln -s ~/workspace/bin/R-3.5.1/bin/Rscript ~/workspace/bin/Rscript
ln -s ~/workspace/bin/R-3.5.1/bin/R ~/workspace/bin/R

## test installation
cd ~/workspace/bin
~/workspace/bin/Rscript --version

## install additional packages
~/workspace/bin/R --vanilla -e 'install.packages(c("devtools", "BiocManager", "dplyr", "tidyr", "ggplot2"), repos="http://cran.us.r-project.org")'

Install Install Gffcompare on Ubuntu / Linux

Abhi — Wed, 02 Feb 2022 02:34:18 -0600

#Gffcompare is a program that is used to perform operations on general feature format (GFF) and general transfer format (GTF) files. It has a binary distribution compatible with the linux we’re using so we will just download, extract, and make a symlink.

# download and extract
cd ~/workspace/bin
wget http://ccb.jhu.edu/software/stringtie/dl/gffcompare-0.9.8.Linux_x86_64.tar.gz
tar -xzvf gffcompare-0.9.8.Linux_x86_64.tar.gz

# make symlink
ln -s ~/workspace/bin/gffcompare-0.9.8.Linux_x86_64/gffcompare ~/workspace/bin/gffcompare

# check Installation
~/workspace/bin/gffcompare