BOL: All site pages

Linux Commands Cheat Sheet for Bioinformatics and Computational Biology Professionals

Rahul Nayak — Mon, 05 Feb 2018 18:50:41 -0600

The purpose of this cheat sheet is to introduce biologist and bioinformatician to the frequently used tools for NGS analysis as well as giving experience in writing one-liners.

File System
ls — list items in current directory
ls -l — list items in current directory and show in long format to see perimissions, size, and modification date
ls -a — list all items in current directory, including hidden files
ls -F — list all items in current directory and show directories with a slash and executables with a star
ls dir — list all items in directory dir
cd dir — change directory to dir
cd .. — go up one directory
cd / — go to the root directory
cd ~ — go to to your home directory
cd - — go to the last directory you were just in
pwd — show present working directory
mkdir dir — make directory dir
rm file — remove file
rm -r dir — remove directory dir recursively
cp file1 file2 — copy file1 to file2
cp -r dir1 dir2 — copy directory dir1 to dir2 recursively
mv file1 file2 — move (rename) file1 to file2
ln -s file link — create symbolic link to file
touch file — create or update file
cat file — output the contents of file
less file — view file with page navigation
head file — output the first 10 lines of file
tail file — output the last 10 lines of file
tail -f file — output the contents of file as it grows, starting with the last 10 lines
vim file — edit file
alias name 'command' — create an alias for a command
System
shutdown — shut down machine
reboot — restart machine
date — show the current date and time
whoami — who you are logged in as
finger user — display information about user
man command — show the manual for command
df — show disk usage
du — show directory space usage
free — show memory and swap usage
whereis app — show possible locations of app
which app — show which app will be run by default
Process Management
ps — display your currently active processes
top — display all running processes
kill pid — kill process id pid
kill -9 pid — force kill process id pid
Permissions
ls -l — list items in current directory and show permissions
chmod ugo file — change permissions of file to ugo - u is the user's permissions, g is the group's permissions, and o is everyone else's permissions. The values of u, g, and o can be any number between 0 and 7.
7 — full permissions
6 — read and write only
5 — read and execute only
4 — read only
3 — write and execute only
2 — write only
1 — execute only
0 — no permissions
chmod 600 file — you can read and write - good for files
chmod 700 file — you can read, write, and execute - good for scripts
chmod 644 file — you can read and write, and everyone else can only read - good for web pages
chmod 755 file — you can read, write, and execute, and everyone else can read and execute - good for programs that you want to share
Networking
wget file — download a file
curl file — download a file
scp user@host:file dir — secure copy a file from remote server to the dir directory on your machine
scp file user@host:dir — secure copy a file from your machine to the dir directory on a remote server
scp -r user@host:dir dir — secure copy the directory dir from remote server to the directory dir on your machine
ssh user@host — connect to host as user
ssh -p port user@host — connect to host on port as user
ssh-copy-id user@host — add your key to host for user to enable a keyed or passwordless login
ping host — ping host and output results
whois domain — get information for domain
dig domain — get DNS information for domain
dig -x host — reverse lookup host
lsof -i tcp:1337 — list all processes running on port 1337
Searching
grep pattern files — search for pattern in files
grep -r pattern dir — search recursively for pattern in dir
grep -rn pattern dir — search recursively for pattern in dir and show the line number found
grep -r pattern dir --include='*.ext — search recursively for pattern in dir and only search in files with .ext extension
command | grep pattern — search for pattern in the output of command
find file — find all instances of file in real system
locate file — find all instances of file using indexed database built from the updatedb command. Much faster than find
sed -i 's/day/night/g' file — find all occurrences of day in a file and replace them with night - s means substitude and g means global - sed also supports regular expressions
Compression
tar cf file.tar files — create a tar named file.tar containing files
tar xf file.tar — extract the files from file.tar
tar czf file.tar.gz files — create a tar with Gzip compression
tar xzf file.tar.gz — extract a tar using Gzip
gzip file — compresses file and renames it to file.gz
gzip -d file.gz — decompresses file.gz back to file
Shortcuts
ctrl+a — move cursor to beginning of line
ctrl+f — move cursor to end of line
alt+f — move cursor forward 1 word
alt+b — move cursor backward 1 word

List of visualization tools for genome alignments

Rahul Nayak — Fri, 02 Feb 2018 13:25:33 -0600

Genome browsers are useful not only for showing final results but also for improving analysis protocols, testing data quality, and generating result drafts. Its integration in analysis pipelines allows the optimization of parameters, which leads to better results. But sometime, we need publication ready figure of genomes. Following are the list of genome alignment visualization tools, which could be useful for analysis and interpretation of results:

ABySS Explorer

Interactive Java application that uses a novel graph-based representation to display a sequence assembly and associated metadata

http://www.bcgsc.ca/platform/bioinfo/software/abyss-explorer

BamView

Genome browser and annotation tool that allows visualization of sequence features, next-generation sequencing (NGS) data and the results of analyses within the context of the sequence, and also its six-frame translation

http://www.sanger.ac.uk/resources/software/artemis/

DNannotator

Annotation web toolkit for regional genomic sequences

http://bioapp.psych.uic.edu/DNannotator.htm

JVM

Java Visual Mapping tool for NGS reads

http://www.springer.com/cda/content/document/cda_downloaddocument/9789401792448-c2.pdf?SGWID=0-0-45-1487072-p176815501

LookSeq

Web-based visualization of sequences derived from multiple sequencing technologies. Low- or high-depth read pileups and easy visualization of putative single nucleotide and structural variation

http://lookseq.sourceforge.net

MagicViewer

Visualization of short read alignment, identification of genetic variation and association with annotation information of a reference genome

http://bioinformatics.zj.cn/magicviewer/

MapView

Alignments of huge-scale single-end and pair-end short reads

http://omictools.com/mapview-s1367.html

MultiPipMaker

Computes alignments of similar regions in two DNA sequences. The resulting alignments are summarized with a ‘percent identity plot’ (pip)

http://pipmaker.bx.psu.edu/pipmaker/

PileLineGUI

Handling genome position files in NGS studies

http://sing.ei.uvigo.es/pileline/pilelinegui.html

SAMtools tview

Simple and fast text alignment viewer; NGS compatible

http://www.htslib.org/

SEWAL

Uses a locality-sensitive hashing algorithm to enumerate all unique sequences in an entire Illumina sequencing run

http://www.sourceforge.net/projects/sewal

STAR

A web-based integrated solution to management and visualization of sequencing data

http://wanglab.ucsd.edu/star/browser

SVA

Software for annotating and visualizing sequenced human genomes

http://www.svaproject.org

Viewer (IGV)

Visualization of large heterogeneous datasets, providing a smooth and intuitive user experience at all levels of genome resolution

https://www.broadinstitute.org/igv/

ZOOM Lite

NGS data mapping and visualization software

http://bioinfor.com/zoom/lite/

Comprehensive list of visualization tools for biological pathways

Neel — Tue, 30 Jan 2018 06:01:31 -0600

The study of biological pathways is a key to understand the different processes inside a cell: proteins exert their function not in isolation but in a tightly controlled network of interactions and reactions. Activation of a pathway typically leads to a change of state in the cell. Pathways come in different flavors, depending on their functions in the cell – the three main types are metabolic pathways, gene regulatory pathways, and signaling pathways. These biological pathways and networks are not only an appropriate approach to visualize molecular reactions. They have also become one leading method in -omics data analysis and visualization.

Following are the comprehensive list of visualization tools for biological pathways:

BiNA

Drawings of metabolic networks supporting hiding of cofactors and drawing of chemical structures

http://bina.unipax.info/

BioTapestry

Interactive tool for building, visualizing and sharing gene regulatory network models over the web

http://www.biotapestry.org/

Caleydo

Visual analysis framework targeted at biomolecular data. Visualization of interdependencies between multiple datasets

http://www.caleydo.org/

CellDesigner

A modeling tool for biochemical networks

http://www.celldesigner.org/

Edinburgh Pathway Editor

Edit and draw pathway diagrams

http://epe.sourceforge.net/SourceForge/EPE.html

GenMAPP

Visualization of gene expression and other genomic data on maps representing biological pathways and groupings of genes

http://www.genmapp.org/

Ingenuity IPA

Data integration platform and manually annotated pathways

http://tinyurl.com/IngenuityPath

JDesigner

Graphical modeling environment for biochemical reaction networks

http://jdesigner.sourceforge.net/Site/JDesigner.html

KaPPA View

Plant pathways

http://kpv.kazusa.or.jp/

KEGG Atlas

Interactive Kyoto Encyclopedia of Genes and Genomes pathways

http://www.genome.jp/kegg/

Omix

Visualizing multi-omics data in metabolic networks

https://www.omix-visualization.com

PathVisio

Biological pathway analysis software that allows drawing, editing and analysis of biological pathways

http://www.pathvisio.org/

VitaPad

Application to visualize biological pathways and map experimental data to them

http://tinyurl.com/vitapad/

Web tools for pathways

ArrayXPath

Mapping and visualizing microarray gene-expression data and integrated biological pathway resources using SVG

http://tinyurl.com/ArrayXPath/

GEPAT

Integrated analysis of transcriptome data in genomic, proteomic and metabolic contexts

http://gepat.sourceforge.net/

iPath

Web-based tool for the visualization, analysis and customization of pathway maps

http://pathways.embl.de/

Kegg-Based Viewer

KEGG-based pathway visualization tool for complex high-throughput data

http://www.g-language.org/data/marray/

MapMan

User-driven tool that displays large datasets onto diagrams of metabolic pathways or other processes

http://mapman.gabipd.org/web/guest/mapman

MetPA

Analysis and visualization of metabolomic data within the biological context of metabolic pathways

http://metpa.metabolomics.ca

Omics Viewer

Data mapping on BioCyc pathways (collection of 5500 pathway/genome databases)

http://www.biocyc.org/

Pathway Explorer

Interactive Java drawing tool for the construction of biological pathway diagrams in a visual way and the annotation of the components and interactions between them

http://genome.tugraz.at/pathwayexplorer/pathwayexplorer_description.shtml

Pathway projector

Zoomable pathway browser using KEGG atlas and Google Maps API

http://www.g-language.org/PathwayProjector/

PATIKA

Integrated environment composed of a central database and a visual editor, built around an extensive ontology and an integration framework

http://www.cs.bilkent.edu.tr/~patikaweb/

Reactome SkyPainter

Visualization of over-represented pathways and reactions from gene lists

http://www.reactome.org/skypainter-2

WikiPathways

Wiki-based, open, public platform dedicated to the curation of biological pathways by and for the scientific community

http://www.wikipathways.org/

List of visualization tools for network biology

Jit — Mon, 29 Jan 2018 05:12:24 -0600

Network analysis is any structured technique used to mathematically analyze a circuit (a “network” of interconnected components). The Network analysis provides the ability to quantify associations between individuals, which makes it possible to infer details about the network as a whole at the species and/or population level. Few tools published in BMC are listed here https://bmcbioinformatics.biomedcentral.com/articles/sections/networks-analysis.

Following are the list of standalone applications for network analysis:

Arena 3D

3D visualization of multi-layer networks

http://www.arena3d.org

Biana

Data integration and network management

http://sbi.imim.es/web/BIANA.php

BioLayout Express 3D

2D/3D network visualization

http://www.biolayout.org/

BiologicalNetworks

Efficient integrated multi-level analysis of microarray, sequence, regulatory and other data

http://www.biologicalnetworks.org

BioMiner

Modeling, analyzing and visualizing biochemical pathways and networks

http://www.zbi.uni-saarland.de/chair/projects/BioMiner

Cell Illustrator

Petri nets for modeling and simulating biological networks

http://www.cellillustrator.com

COPASI

Analysis of biochemical networks and their dynamics

http://www.copasi.org/

Cytoscape

Network visualization and analysis. Over 200 plugins [60]

http://www.cytoscape.org/

Dizzy

Chemical kinetics stochastic simulation software

http://magnet.systemsbiology.net/software/Dizzy/

DyCoNet

Gephi plugin that can be used to identify dynamic communities in networks

https://github.com/juliemkauffman/DyCoNet

GENeVis

Network and pathway visualization

http://tinyurl.com/genevis/

GEPHI

Interactive visualization and exploration for any network and complex system, dynamic and hierarchical graph.

https://gephi.org

Igraph

Collection of network analysis tools with the emphasis on efficiency, portability and ease of use

http://igraph.sourceforge.net

Medusa

Semantic and multi-edged simple networks

https://sites.google.com/site/medusa3visualization/

NAViGaTOR

Visualizing and analyzing protein-protein interaction networks

http://tinyurl.com/navigator1/

N-Browse

Interactive graphical browser for biological networks

http://www.gnetbrowse.org/

NeAT

Topological and clustering analysis of networks

http://rsat.ulb.ac.be/neat/

Ondex

Data integration and visualization of large networks

http://www.ondex.org/

Osprey

Visualization and annotation of biological networks

http://biodata.mshri.on.ca/osprey/servlet/Index

Pajek

Analysis and visualization of large networks and social network analysis

http://vlado.fmf.uni-lj.si/pub/networks/pajek/

PathwayAssist

Navigation and analysis of biological pathways, gene regulation networks and protein interaction maps.

http://www.ariadnegenomics.com/downloads/

PIVOT

Layout algorithms for visualizing protein interactions and families

http://acgt.cs.tau.ac.il/pivot/

ProCope

Prediction and evaluation of protein complexes from purification data experiments

http://www.bio.ifi.lmu.de/Complexes/ProCope/

ProViz

Visualization and exploration of interaction networks. Gene Ontology and PSI-MI formats supported

http://cbi.labri.fr/eng/proviz.htm

SpectralNET

Network analysis and visualizations. Scatter plots and dimensionality reduction algorithms

https://www.broadinstitute.org/software/spectralnet

Tulip

Enables the development of algorithms, visual encodings, interaction techniques, data models and domain-specific visualizations

http://tulip.labri.fr/TulipDrupal/

VANESA

Automatic reconstruction and analysis of biological networks and Petri nets based on life-science database information

http://agbi.techfak.uni-bielefeld.de/vanesa/

VANTED

Network reconstruction, data visualization, integration of various data types, network simulation

http://tinyurl.com/vanted/

yEd

Creation of diagrams manually and import external data

http://tinyurl.com/yEdGraph/

Web tools for network analysis

APID

Unified protein-protein interactions from BIND, BioGRID, DIP, HPRD, IntAct and MINT

http://bioinfow.dep.usal.es/apid/

Arcadia

Translates text-based descriptions of biological networks (SBML files) into standardized diagrams (Systems Biology Graphical Notation Process Description maps)

http://arcadiapathways.sourceforge.net/

AVIS

Viewer for signaling networks

http://actin.pharm.mssm.edu/AVIS2

bioPIXIE

Discovery of biological networks from diverse functional genomic data

http://pixie.princeton.edu/pixie

CellPublisher

Interactive representations of biochemical processes

http://cellpublisher.gobics.de/

Graphle

Distributed network exploration and visualization of interactive large, dense graphs

http://tinyurl.com/graphle/

GraphWeb

Web server for graph-based analysis of biological networks

http://biit.cs.ut.ee/graphweb/

Hubba

Web-based service to explore the essential nodes in a network

http://hub.iis.sinica.edu.tw/Hubba

NetworkBLAST

Analysis of protein interaction networks across species to infer protein complexes that are conserved in evolution

http://www.cs.tau.ac.il/~bnet/networkblast.htm

Pathview

Tool set for pathway-based data integration and visualization

http://Pathview.r-forge.r-project.org/

PINA

Integrated platform for protein interaction network construction, filtering, analysis, visualization and management

http://cbg.garvan.unsw.edu.au/pina/home.do

ReMatch

Web-based tool for integration of user-given stoichiometric metabolic models into a database collected from public data sources

http://www.cs.helsinki.fi/group/sysfys/software/rematch/

SNOW

Gene mapping on a reference or human protein-protein interaction network that SNOW hosts

http://snow.bioinfo.cipf.es

STITCH

Resource to explore known and predicted interactions of chemicals and proteins

http://stitch.embl.de/

STRING

Protein interaction networks and integration of data such as genomic context, high-throughput experiments, conserved coexpression and previous knowledge derived from the literature

http://string-db.org

TVNViewer

An interactive visualization tool for exploring networks that change over time or space

http://www.sailing.cs.cmu.edu/main/?page_id=545

tYNA

System for managing, comparing and mining multiple networks

http://tyna.gersteinlab.org/tyna/

VisANT

Visualization, mining, analysis and modeling of biological networks, metabolic networks and ecosystems

http://visant.bu.edu/

sam to bam conversion !!

Jit — Fri, 26 Jan 2018 02:36:18 -0600

To do sam to bam conversion, follow the following commands :-

Code:

$ samtools view -b -S file.sam > file.bam

Then you will need to use

Code:

$ samtools sort file.bam file-sorted

followed by

Code:

$ samtools index  file-sorted.bam

in order to get an indexed file.

If you just type

Code:

$ samtools

or samtools followed by the name of one of the samtools commands, you will get a few lines of help giving the correct syntax for that command,

PerlOneLiner for Bioinformatician

Shruti Paniwala — Mon, 15 Jan 2018 04:57:40 -0600

FILE SPACING
------------

# Double space a file
perl -pe '$\="\n"'
perl -pe 'BEGIN { $\="\n" }'
perl -pe '$_ .= "\n"'
perl -pe 's/$/\n/'
perl -nE 'say'

# Double space a file, except the blank lines
perl -pe '$_ .= "\n" unless /^$/'
perl -pe '$_ .= "\n" if /\S/'

# Triple space a file
perl -pe '$\="\n\n"'
perl -pe '$_.="\n\n"'

# N-space a file
perl -pe '$_.="\n"x7'

# Add a blank line before every line
perl -pe 's//\n/'

# Remove all blank lines
perl -ne 'print unless /^$/'
perl -lne 'print if length'
perl -ne 'print if /\S/'

# Remove all consecutive blank lines, leaving just one
perl -00 -pe ''
perl -00pe0

# Compress/expand all blank lines into N consecutive ones
perl -00 -pe '$_.="\n"x4'

# Fold a file so that every set of 10 lines becomes one tab-separated line
perl -lpe '$\ = $. % 10 ? "\t" : "\n"'

LINE NUMBERING
--------------

# Number all lines in a file
perl -pe '$_ = "$. $_"'

# Number only non-empty lines in a file
perl -pe '$_ = ++$a." $_" if /./'

# Number and print only non-empty lines in a file (drop empty lines)
perl -ne 'print ++$a." $_" if /./'

# Number all lines but print line numbers only non-empty lines
perl -pe '$_ = "$. $_" if /./'

# Number only lines that match a pattern, print others unmodified
perl -pe '$_ = ++$a." $_" if /regex/'

# Number and print only lines that match a pattern
perl -ne 'print ++$a." $_" if /regex/'

# Number all lines, but print line numbers only for lines that match a pattern
perl -pe '$_ = "$. $_" if /regex/'

# Number all lines in a file using a custom format (emulate cat -n)
perl -ne 'printf "%-5d %s", $., $_'

# Print the total number of lines in a file (emulate wc -l)
perl -lne 'END { print $. }'
perl -le 'print $n=()=<>'
perl -le 'print scalar(()=<>)'
perl -le 'print scalar(@foo=<>)'
perl -ne '}{print $.'
perl -nE '}{say $.'

# Print the number of non-empty lines in a file
perl -le 'print scalar(grep{/./}<>)'
perl -le 'print ~~grep{/./}<>'
perl -le 'print~~grep/./,<>'
perl -E 'say~~grep/./,<>'

# Print the number of empty lines in a file
perl -lne '$a++ if /^$/; END {print $a+0}'
perl -le 'print scalar(grep{/^$/}<>)'
perl -le 'print ~~grep{/^$/}<>'
perl -E 'say~~grep{/^$/}<>'

# Print the number of lines in a file that match a pattern (emulate grep -c)
perl -lne '$a++ if /regex/; END {print $a+0}'
perl -nE '$a++ if /regex/; END {say $a+0}'

CALCULATIONS
------------

# Check if a number is a prime
perl -lne '(1x$_) !~ /^1?$|^(11+?)\1+$/ && print "$_ is prime"'

# Print the sum of all the fields on a line
perl -MList::Util=sum -alne 'print sum @F'

# Print the sum of all the fields on all lines
perl -MList::Util=sum -alne 'push @S,@F; END { print sum @S }'
perl -MList::Util=sum -alne '$s += sum @F; END { print $s }'

# Shuffle all fields on a line
perl -MList::Util=shuffle -alne 'print "@{[shuffle @F]}"'
perl -MList::Util=shuffle -alne 'print join " ", shuffle @F'

# Find the minimum element on a line
perl -MList::Util=min -alne 'print min @F'

# Find the minimum element over all the lines
perl -MList::Util=min -alne '@M = (@M, @F); END { print min @M }'
perl -MList::Util=min -alne '$min = min @F; $rmin = $min unless defined $rmin && $min > $rmin; END { print $rmin }'

# Find the maximum element on a line
perl -MList::Util=max -alne 'print max @F'

# Find the maximum element over all the lines
perl -MList::Util=max -alne '@M = (@M, @F); END { print max @M }'

# Replace each field with its absolute value
perl -alne 'print "@{[map { abs } @F]}"'

# Find the total number of fields (words) on each line
perl -alne 'print scalar @F'

# Print the total number of fields (words) on each line followed by the line
perl -alne 'print scalar @F, " $_"'

# Find the total number of fields (words) on all lines
perl -alne '$t += @F; END { print $t}'

# Print the total number of fields that match a pattern
perl -alne 'map { /regex/ && $t++ } @F; END { print $t }'
perl -alne '$t += /regex/ for @F; END { print $t }'
perl -alne '$t += grep /regex/, @F; END { print $t }'

# Print the total number of lines that match a pattern
perl -lne '/regex/ && $t++; END { print $t }'

# Print the number PI to n decimal places
perl -Mbignum=bpi -le 'print bpi(n)'

# Print the number PI to 39 decimal places
perl -Mbignum=PI -le 'print PI'

# Print the number E to n decimal places
perl -Mbignum=bexp -le 'print bexp(1,n+1)'

# Print the number E to 39 decimal places
perl -Mbignum=e -le 'print e'

# Print UNIX time (seconds since Jan 1, 1970, 00:00:00 UTC)
perl -le 'print time'

# Print GMT (Greenwich Mean Time) and local computer time
perl -le 'print scalar gmtime'
perl -le 'print scalar localtime'

# Print local computer time in H:M:S format
perl -le 'print join ":", (localtime)[2,1,0]'

# Print yesterday's date
perl -MPOSIX -le '@now = localtime; $now[3] -= 1; print scalar localtime mktime @now'

# Print date 14 months, 9 days and 7 seconds ago
perl -MPOSIX -le '@now = localtime; $now[0] -= 7; $now[4] -= 14; $now[7] -= 9; print scalar localtime mktime @now'

# Prepend timestamps to stdout (GMT, localtime)
tail -f logfile | perl -ne 'print scalar gmtime," ",$_'
tail -f logfile | perl -ne 'print scalar localtime," ",$_'

# Calculate factorial of 5
perl -MMath::BigInt -le 'print Math::BigInt->new(5)->bfac()'
perl -le '$f = 1; $f *= $_ for 1..5; print $f'

# Calculate greatest common divisor (GCM)
perl -MMath::BigInt=bgcd -le 'print bgcd(@list_of_numbers)'

# Calculate GCM of numbers 20 and 35 using Euclid's algorithm
perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m'

# Calculate least common multiple (LCM) of numbers 35, 20 and 8
perl -MMath::BigInt=blcm -le 'print blcm(35,20,8)'

# Calculate LCM of 20 and 35 using Euclid's formula: n*m/gcd(n,m)
perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m'

# Generate 10 random numbers between 5 and 15 (excluding 15)
perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n'

# Find and print all permutations of a list
perl -MAlgorithm::Permute -le '$l = [1,2,3,4,5]; $p = Algorithm::Permute->new($l); print @r while @r = $p->next'

# Generate the power set
perl -MList::PowerSet=powerset -le '@l = (1,2,3,4,5); for (@{powerset(@l)}) { print "@$_" }'

# Convert an IP address to unsigned integer
perl -le '$i=3; $u += ($_<<8*$i--) for "127.0.0.1" =~ /(\d+)/g; print $u'
perl -le '$ip="127.0.0.1"; $ip =~ s/(\d+)\.?/sprintf("%02x", $1)/ge; print hex($ip)'
perl -le 'print unpack("N", 127.0.0.1)'
perl -MSocket -le 'print unpack("N", inet_aton("127.0.0.1"))'

# Convert an unsigned integer to an IP address
perl -MSocket -le 'print inet_ntoa(pack("N", 2130706433))'
perl -le '$ip = 2130706433; print join ".", map { (($ip>>8*($_))&0xFF) } reverse 0..3'
perl -le '$ip = 2130706433; $, = "."; print map { (($ip>>8*($_))&0xFF) } reverse 0..3'

STRING CREATION AND ARRAY CREATION
----------------------------------

# Generate and print the alphabet
perl -le 'print a..z'
perl -le 'print ("a".."z")'
perl -le '$, = ","; print ("a".."z")'
perl -le 'print join ",", ("a".."z")'

# Generate and print all the strings from "a" to "zz"
perl -le 'print ("a".."zz")'
perl -le 'print "aa".."zz"'

# Create a hex lookup table
@hex = (0..9, "a".."f")

# Convert a decimal number to hex using @hex lookup table
perl -le '$num = 255; @hex = (0..9, "a".."f"); while ($num) { $s = $hex[($num%16)&15].$s; $num = int $num/16 } print $s'
perl -le '$hex = sprintf("%x", 255); print $hex'
perl -le '$num = "ff"; print hex $num'

# Generate a random 8 character password
perl -le 'print map { ("a".."z")[rand 26] } 1..8'
perl -le 'print map { ("a".."z", 0..9)[rand 36] } 1..8'

# Create a string of specific length
perl -le 'print "a"x50'

# Create a repeated list of elements
perl -le '@list = (1,2)x20; print "@list"'

# Create an array from a string
@months = split ' ', "Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec"
@months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/

# Create a string from an array
@stuff = ("hello", 0..9, "world"); $string = join '-', @stuff

# Find the numeric values for characters in the string
perl -le 'print join ", ", map { ord } split //, "hello world"'

# Convert a list of numeric ASCII values into a string
perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)'
perl -le '@ascii = (99, 111, 100, 105, 110, 103); print map { chr } @ascii'

# Generate an array with odd numbers from 1 to 100
perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"'
perl -le '@odd = grep { $_ & 1 } 1..100; print "@odd"'

# Generate an array with even numbers from 1 to 100
perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"'

# Find the length of the string
perl -le 'print length "one-liners are great"'

# Find the number of elements in an array
perl -le '@array = ("a".."z"); print scalar @array'
perl -le '@array = ("a".."z"); print $#array + 1'

TEXT CONVERSION AND SUBSTITUTION
--------------------------------

# ROT13 a string
'y/A-Za-z/N-ZA-Mn-za-m/'

# ROT 13 a file
perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file

# Base64 encode a string
perl -MMIME::Base64 -e 'print encode_base64("string")'
perl -MMIME::Base64 -0777 -ne 'print encode_base64($_)' file

# Base64 decode a string
perl -MMIME::Base64 -le 'print decode_base64("base64string")'
perl -MMIME::Base64 -ne 'print decode_base64($_)' file

# URL-escape a string
perl -MURI::Escape -le 'print uri_escape($string)'

# URL-unescape a string
perl -MURI::Escape -le 'print uri_unescape($string)'

# HTML-encode a string
perl -MHTML::Entities -le 'print encode_entities($string)'

# HTML-decode a string
perl -MHTML::Entities -le 'print decode_entities($string)'

# Convert all text to uppercase
perl -nle 'print uc'
perl -ple '$_=uc'
perl -nle 'print "\U$_"'

# Convert all text to lowercase
perl -nle 'print lc'
perl -ple '$_=lc'
perl -nle 'print "\L$_"'

# Uppercase only the first word of each line
perl -nle 'print ucfirst lc'
perl -nle 'print "\u\L$_"'

# Invert the letter case
perl -ple 'y/A-Za-z/a-zA-Z/'

# Camel case each line
perl -ple 's/(\w+)/\u$1/g'
perl -ple 's/(?

# Strip leading whitespace (spaces, tabs) from the beginning of each line
perl -ple 's/^[ \t]+//'
perl -ple 's/^\s+//'

# Strip trailing whitespace (space, tabs) from the end of each line
perl -ple 's/[ \t]+$//'

# Strip whitespace from the beginning and end of each line
perl -ple 's/^[ \t]+|[ \t]+$//g'

# Convert UNIX newlines to DOS/Windows newlines
perl -pe 's|\n|\r\n|'

# Convert DOS/Windows newlines to UNIX newlines
perl -pe 's|\r\n|\n|'

# Convert UNIX newlines to Mac newlines
perl -pe 's|\n|\r|'

# Substitute (find and replace) "foo" with "bar" on each line
perl -pe 's/foo/bar/'

# Substitute (find and replace) all "foo"s with "bar" on each line
perl -pe 's/foo/bar/g'

# Substitute (find and replace) "foo" with "bar" on lines that match "baz"
perl -pe '/baz/ && s/foo/bar/'

# Binary patch a file (find and replace a given array of bytes as hex numbers)
perl -pi -e 's/\x89\xD8\x48\x8B/\x90\x90\x48\x8B/g' file

SELECTIVE PRINTING AND DELETING OF CERTAIN LINES
------------------------------------------------

# Print the first line of a file (emulate head -1)
perl -ne 'print; exit'

# Print the first 10 lines of a file (emulate head -10)
perl -ne 'print if $. <= 10'
perl -ne '$. <= 10 && print'
perl -ne 'print if 1..10'

# Print the last line of a file (emulate tail -1)
perl -ne '$last = $_; END { print $last }'
perl -ne 'print if eof'

# Print the last 10 lines of a file (emulate tail -10)
perl -ne 'push @a, $_; @a = @a[@a-10..$#a]; END { print @a }'

# Print only lines that match a regular expression
perl -ne '/regex/ && print'

# Print only lines that do not match a regular expression
perl -ne '!/regex/ && print'

# Print the line before a line that matches a regular expression
perl -ne '/regex/ && $last && print $last; $last = $_'

# Print the line after a line that matches a regular expression
perl -ne 'if ($p) { print; $p = 0 } $p++ if /regex/'

# Print lines that match regex AAA and regex BBB in any order
perl -ne '/AAA/ && /BBB/ && print'

# Print lines that don't match match regexes AAA and BBB
perl -ne '!/AAA/ && !/BBB/ && print'

# Print lines that match regex AAA followed by regex BBB followed by CCC
perl -ne '/AAA.*BBB.*CCC/ && print'

# Print lines that are 80 chars or longer
perl -ne 'print if length >= 80'

# Print lines that are less than 80 chars in length
perl -ne 'print if length < 80'

# Print only line 13
perl -ne '$. == 13 && print && exit'

# Print all lines except line 27
perl -ne '$. != 27 && print'
perl -ne 'print if $. != 27'

# Print only lines 13, 19 and 67
perl -ne 'print if $. == 13 || $. == 19 || $. == 67'
perl -ne 'print if int($.) ~~ (13, 19, 67)'

# Print all lines between two regexes (including lines that match regex)
perl -ne 'print if /regex1/../regex2/'

# Print all lines from line 17 to line 30
perl -ne 'print if $. >= 17 && $. <= 30'
perl -ne 'print if int($.) ~~ (17..30)'
perl -ne 'print if grep { $_ == $. } 17..30'

# Print the longest line
perl -ne '$l = $_ if length($_) > length($l); END { print $l }'

# Print the shortest line
perl -ne '$s = $_ if $. == 1; $s = $_ if length($_) < length($s); END { print $s }'

# Print all lines that contain a number
perl -ne 'print if /\d/'

# Find all lines that contain only a number
perl -ne 'print if /^\d+$/'

# Print all lines that contain only characters
perl -ne 'print if /^[[:alpha:]]+$/

# Print every second line
perl -ne 'print if $. % 2'

# Print every second line, starting the second line
perl -ne 'print if $. % 2 == 0'

# Print all lines that repeat
perl -ne 'print if ++$a{$_} == 2'

# Print all unique lines
perl -ne 'print unless $a{$_}++'

# Print the first field (word) of every line (emulate cut -f 1 -d ' ')
perl -alne 'print $F[0]'

HANDY REGULAR EXPRESSIONS
-------------------------

# Match something that looks like an IP address
/^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/
/^(\d{1,3}\.){3}\d{1,3}$/

# Test if a number is in range 0-255
/^([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$/

# Match an IP address
my $ip_part = qr|([0-9]|[0-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])|;
if ($ip =~ /^($ip_part\.){3}$ip_part$/) {
say "valid ip";
}

# Check if the string looks like an email address
/\S+@\S+\.\S+/

# Check if the string is a decimal number
/^\d+$/
/^[+-]?\d+$/
/^[+-]?\d+\.?\d*$/

# Check if the string is a hexadecimal number
/^0x[0-9a-f]+$/i

# Check if the string is an octal number
/^0[0-7]+$/

# Check if the string is binary
/^[01]+$/

# Check if a word appears twice in the string
/(word).*\1/

# Increase all numbers by one in the string
$str =~ s/(\d+)/$1+1/ge

# Extract HTTP User-Agent string from the HTTP headers
/^User-Agent: (.+)$/

# Match printable ASCII characters
/[ -~]/

# Match unprintable ASCII characters
/[^ -~]/

# Match text between two HTML tags
m|([^<]*)|
m|(.*?)|

# Replace all tags with
$html =~ s|<(/)?b>|<$1strong>|g

# Extract all matches from a regular expression
my @matches = $text =~ /regex/g;

PERL TRICKS
-----------

# Print the version of a Perl module
perl -MModule -le 'print $Module::VERSION'
perl -MLWP::UserAgent -le 'print $LWP::UserAgent::VERSION'

Converting FASTQ to FASTA

Neel — Fri, 12 Jan 2018 03:49:09 -0600

There are several ways you can convert fastq to fasta sequences. Some methods are listed below.
Using SED
sed can be used to selectively print the desired lines from a file, so if you print the first and 2rd line of every 4 lines, you get the sequence header and sequence needed for fasta format.
sed -n '1~4s/^@/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
Using PASTE
You can linerize every 4 lines in a tabular format and print first and second field using paste
cat INFILE.fastq | paste - - - - |cut -f 1, 2| sed 's/@/>/'g | tr -s "/t" "/n" > OUTFILE.fasta
EMBOSS:seqret
Standard script that can be used for many purposes. One such use is fastq-fasta conversion
seqret -sequence reads.fastq -outseq reads.fasta
awk can be used for conversion as follows:
Using AWK
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
FASTX-toolkit
fastq_to_fasta is available in the FASTX-toolkit that scales really well with the huge datasets
fastq_to_fasta -h usage: fastq_to_fasta [-h] [-r] [-n] [-v] [-z] [-i INFILE] [-o OUTFILE] # Remember to use -Q33 for illumina reads! version 0.0.6 [-h] = This helpful help screen. [-r] = Rename sequence identifiers to numbers. [-n] = keep sequences with unknown (N) nucleotides. Default is to discard such sequences. [-v] = Verbose - report number of sequences. If [-o] is specified, report will be printed to STDOUT. If [-o] is not specified (and output goes to STDOUT), report will be printed to STDERR. [-z] = Compress output with GZIP. [-i INFILE] = FASTA/Q input file. default is STDIN. [-o OUTFILE] = FASTA output file. default is STDOUT.
Bioawk
Another option to convert fastq to fasta format using bioawk
bioawk -c fastx '{print ">"$name"\n"$seq}' input.fastq > output.fasta
Seqtk
From the same developer, there is another option using a tool called seqtk
seqtk seq -a input.fastq > output.fasta
Note that you can use either compressed or uncompressed files for this tool

BBSplit: Read Binning Tool for Metagenomes and Contaminated Libraries

Poonam Mahapatra — Wed, 03 Jan 2018 00:25:27 -0600

BBSplit internally uses BBMap to map reads to multiple genomes at once, and determine which genome they match best. This is different than with ordinary mapping. If a genome (say, human) contains an exact repeat somewhere, reads mapping to it will be mapped ambiguously. But if you want to determine whether reads are mouse or human, it does not matter whether they map ambiguously within human, only whether they are ambiguous between human and mouse. BBSplit tracks this additional ambiguity information and decides how to use it based on the “ambig2” flag. The normal use of BBSplit is like Seal, either quantifying how many reads go to each reference, or splitting the reads into multiple output files, one per reference. BBSplit can only be run using references indexed with BBSplit, as they contain additional information regarding which sequences came from which reference file.
BBSplit is a tool that bins reads by mapping to multiple references simultaneously, using BBMap. The reads go to the bin of the reference they map to best. There are also disambiguation options, such that reads that map to multiple references can be binned with all of them, none of them, one of them, or put in a special "ambiguous" file for each of them. Paired reads will always be kept together.

For example, if you had a library of something that was contaminated with e.coli and salmonella, you could do this:

bbsplit.sh in=reads.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu=clean.fq int=t

This will produce 3 output files:
out_ecoli.fq (ecoli reads)
out_salmonella.fq (salmonella reads)
clean.fq (unmapped reads)

In this case, "int=t" means that the input file is paired and interleaved. For single-end reads you would leave that out. For paired reads in 2 files, you would do this:
bbsplit.sh in1=reads1.fq in2=reads2.fq ref=ecoli.fa,salmonella.fa basename=out_%.fq outu1=clean1.fq outu2=clean2.fq
BBSplit is available here:
https://sourceforge.net/projects/bbmap/
The sensitivity can be raised to be equivalent to BBMap with these flags: "minratio=0.56 minhits=1 maxindel=16000"

Bioinformatics Web Application Development with Perl

Jit — Tue, 26 Dec 2017 18:14:11 -0600

Perl's second wave of adoption came from the growth of the world wide web. Dynamic web pages—the precursor to modern web applications—were easy to create with Perl and CGI. Thanks to Perl's ubiquity as a language for system administrators and its power to manipulate text, it was the default choice for web programming. Its presence everywhere made it popular and, in some ways, the duct tape of the Internet.
Web Application Development
The old days of CGI programs and the simple development style that represented seem clunky. Web pages have become web applications. Development has moved from generating static HTML to both client and server side programming, with rich client interfaces and powerful backends.
Perl is still well suited for developing modern web apps. The language grows more powerful and easier to use every year, the available libraries are wonderful and keep getting better, and the inventions and discoveries available in modern Perl are unsurpassed.
In particular, a modern Perl developer can do amazing things with modern Perl tools. If you still think of Perl web development as a cgi-bin directory full of messy scripts that spew warnings to STDERR, you're a decade out of date. Better yet, you can replace that mess piecemeal, thanks to the new tools and techniques of modern Perl. See, for example, the ever-growing list of technologies Built in Perl.
Modern Perl Web Frameworks
While the old wave of web development may have made the CGI.pm module central, modern Perl web programming follows a stricter separation of business logic, URL and request routing, and output. The days of slinging a string here, an array there, a Perl hash yonder, declaring every variable at the top of the program, and maybe making a subroutine are gone. The Perl world has seen the value of abstraction and ways to mechanize away boilerplate. Perl has dozens of frameworks and toolkits designed to make web development and deployment simpler.
Any of a dozen of these frameworks will help you do great things, but three in particular stand out. You can build web sites and web applications of tremendous value with all three. These are neither the only good possibilities (think of POE or Jifty or Continuity or...) nor the only mechanisms for web programming with Perl (see Mechanize or LWP or Mojo::UserAgent for more). Yet if you want three good options to choose between, start here.
Catalyst
The Catalyst framework is a flexible and powerful system for building small to large web apps. It uses the Moose object system to provide great APIs for extension and further development. It's the most mature of the modern top Perl web frameworks, yet it retains its flexibility and vibrancy. In particular, its plugin and extension ecosystem allows it to evolve to provide new and essential features.
Catalyst has embraced the Plack/PSGI standard for Perl web deployment and recent versions are exploring high-scalability, event-based request handling models.
Dancer
The Dancer framework is deliberately minimal in syntax and scope, but it also has a vibrant plugin ecosystem. Dancer particularly excels for smaller sites and applications, though good programmers can build larger things with it.
The first version of Dancer was easy to use. Dancer 2 continues that ease while improving the internals and robustness of applications.
Mojolicious
The Mojolicious (Mojo) framework has a real-time design based on high performance event handling. Its focus is solving new and interesting problems in simple and effective ways, and the project has produced a lot of new code that does old things in better ways.
In particular, Mojolicious goes to great lengths to support new web standards, such as CSS 3, web sockets, and HTTP 2.
Where Catalyst embraces the CPAN fully, Mojolicious by design provides most of what an average app might need in a single download. It's still fully compatible with the CPAN, but the intention is to provide good working defaults in a package that's easy to start with. Mojo's fans are quick to praise it as fun to develop.
A modern Perl web developer should be familiar with at least one of these frameworks.
Modern Perl Storage Mechanisms
Perl's venerable DBI module has been the focal point of database access since its invention. Its design allows it to provide the same interface to huge relational databases and flat files alike through its DBD extension mechanism. Yet the DBI by itself isn't the be-all, end-all of data storage and access in Perl.
DBIx::Class
DBIx::Class sits on top of DBI to provide an API to your database based on the concept of queries and results. This is often sufficient to remove all but the most complicated of SQL from your code, leaving you to manipulate your business models instead of the small details of how a relational database works. The power and maintainability you receive is well the small cost of the learning curve.
Even better, DBIC can manage (and even generate) your database schema for you.
Recent versions of DBIC have demonstrated that a well-written ORM can perform much better than even clever hand-written code. Because it builds on the Perl DBI, it scales everywhere from SQLite to PostgreSQL, MySQL, Oracle, and more.
Rose::DB
The lesser-known but no less powerful Rose::DB::Object builds on Rose::DB to provide an object-relational mapper for Perl. While its high level features most directly compare to those of DBIx::Class, it's often measurably faster.
NoSQL on the CPAN
Of course the CPAN has modules for almost any NoSQL database or job queue or persistence mechanism you could name, and several you have never heard of. Everything you need is a quick CPAN or cpanm away!
Modern Perl Deployment Strategies
In the early days of the web, deploying a Perl web application meant putting one or more .cgi or .pl files in a special directory and hoping that your system administrator had everything configured correctly. The execution model was often slow and cumbersome, and accessing shared resources such as databases was often tricky.
Modern Perl has better choices. While deployment strategies are the source of many arguments, the return on your investment from learning the modern way is impressive.
Plack/PSGI
The PSGI specification (as exemplified by Plack) describes a strategy for building Perl web apps independent of server and with the possibility to share custom processing behaviors.
In other words, it's a standard for writing Perl apps to take advantage of the huge ecosystem of Perl development available on the CPAN without tying yourself to a server like Apache, Apache 2, nginx, or anything else.
Any good modern Perl web framework (including those listed here) supports PSGI. Several deployment mechanisms exist to meet various business needs which also support PSGI. In particular, you can deploy the same application with a local testing server on your own machine as you can to your production server or servers without changing your application at all.
mod_perl
The older but still viable mod_perl Apache httpd module embeds Perl into the web server. This was the first widespread persistence mechanism for Perl web applications themselves and it's still popular to this day, though PSGI compliance is often the choice for new development. (PSGI handlers to use mod_perl as the backend are available.)
Modern Perl developers should familiarize themselves with PSGI and the wealth of available Plack middleware.
Perl Web Development
Of course no discussion of Perl web development would be complete without mentioning the strength of the CPAN. Almost any project will benefit from the wealth of freely available libraries built to solve real problems. These distributions run the gamut from full-blown web frameworks and content management systems to APIs for web services, development tools, testing systems, and interfaces to document formats and external resources.
For example, if you need to write a web service which accepts JSON data and produces Excel spreadsheets, you can glue together a few CPAN distributions and get the job done early. If you need to consume XML from a remote service and emit a PDF, you're in luck.
Perl's prowess as a general purpose programming language as well as its flexibility and power in managing text and gluing systems together make it a wonderful fit for web development. The community's adoption of modern Perl standards such as PSGI and Plack only enhance your power.
Web application development in Perl is still viable, and modern Perl tools and techniques and libraries make it more powerful and pleasant than ever.

Run miniasm assembler on nanopore reads !

Jit — Mon, 18 Dec 2017 04:07:50 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.
Find the detail of the reads repeats:
fq2fa ONT_A.fastq ONT_A.fasta

minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf

awk '{if($1==$6){print}}' AONT.paf > AONTself.paf

awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list
Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)
minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps

minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps

minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps
Assemble with miniasm:
miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta

minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf

awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf

minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps
Njoy the assembly !