BOL: Related items

Check Linux server configuration !!

Rahul Nayak — Tue, 06 May 2014 01:10:57 -0500

Bioinformatician uses servers for computational analysis. Sometime we need to check the server details before running our programs or tools. Here I am showing some basic commands using them you can gather the system/server information.

To check what version of Operating System is installed on the server you can use the following commands:-
=================================================================
1.cat /etc/issue
[root@localhost ~]# cat /etc/issue
Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Kernel \r on an \m

2.cat /etc/redhat-release
[root@localhost ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.5 (Tikanga)

3.lsb_release -a
[root@localhost ~]# lsb_release -a
LSB Version:    :core-3.1-ia32:core-3.1-noarch:graphics-3.1-ia32:graphics-3.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 5.5 (Tikanga)
Release:        5.5
Codename:       Tikanga

To check whether the operating system is 32 or 64bit:-
================================
# uname -i
[root@localhost ~]# uname -i
i386
(i386 represents that server is having 32bit operating system)

[root@localhost ~]# uname -i
x86_64
(x86_64 represents that server is having 64bit operating system)

To see the processor/CPU information:-
=============================
# cat /proc/cpuinfo
[root@localhost ~] cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Intel(R) Xeon(R) CPU            5130 @ 2.00GHz
stepping        : 6
cpu MHz         : 1995.087
cache size      : 4096 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 2
apicid          : 0
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc pni monitor ds_cpl vmx tm2 ssse3 cx16 xtpr lahf_lm
bogomips        : 3990.17
(Here processor number 0 indicates that the system is having one process(processor number starts with zero))

To check memory information:-
===========================
# free -m
[root@localhost ~]# free -m
             total       used       free     shared    buffers     cached
Mem:          5066       3513       1552          0        612       2319
-/+ buffers/cache:        582       4484
Swap:         1983          0       1983

# cat /proc/meminfo
[root@localhost ~]# cat /proc/meminfo
MemTotal:      5187752 kB
MemFree:       1639300 kB
Buffers:        627024 kB
Cached:        2374944 kB
SwapCached:          0 kB
Active:        2458788 kB
Inactive:       920964 kB
HighTotal:     4325164 kB
HighFree:      1561936 kB
LowTotal:       862588 kB
LowFree:         77364 kB
SwapTotal:     2031608 kB
SwapFree:      2031608 kB
Dirty:             704 kB
Writeback:           0 kB
AnonPages:      377892 kB
Mapped:          35328 kB
Slab:           153036 kB
PageTables:       6316 kB
NFS_Unstable:        0 kB
Bounce:              0 kB
CommitLimit:   4625484 kB
Committed_AS:   977132 kB
VmallocTotal:   116728 kB
VmallocUsed:      4492 kB
VmallocChunk:   112124 kB
HugePages_Total:     0
HugePages_Free:      0
HugePages_Rsvd:      0
Hugepagesize:     2048 kB

To check the model and serial name of the server:-
=======================================
[root@localhost ~]# dmidecode | egrep -i "product name|Serial number"
Product Name: PowerEdge R710
Serial Number: AB8CDE1


To check the host name:-
=====================
[root@localhost ~]# uname -n
localhost

[root@localhost ~]# hostname
localhost

To check the kernel version:-
========================
[root@localhost ~]# uname -r
2.6.18-238.9.1.el5PAE

Bioinformatics WalkIn at NII

Fri, 04 Sep 2015 21:48:15 -0500

ADVERTISEMENT OF WALK-IN-INTERVIEW

NAME OF THE POST : Bioinformatician (Part time 3 days in a week) (One Position only)

DURATION : One Year

NAME OF THE PROJECT : Next generation sequencing facility

EDUCATIONAL QUALIFICATIONS : At least a Masters degree in Bioinformatics and Bachelors degree in any stream of life sciences

REQUIREMENTS :

Around 5 years of experience and proven track record in next generation sequence data analysis (supported by publications in peer-reviewed journals), ability to analyze transcriptomics, Chip-seq, and small RNA –seq data.

: Should have the ability to analyze raw primary data generated by Illumina next generation sequencing platforms and create / troubleshoot custom analysis Pipelines.

Should have ability to handle all downstream secondary and tertiary data analysis using commercially available as well as open source softwares (transcriptomics, ChIP-seq, small RNA-seq)

Apart from these, the applicant should have knowledge of the following: Programming: Perl and Python. Operating system:

Linux and Windows. NGS Analysis tools: Maq, BWA, Bowtie, SAM tools, BEDTools, MACS, Galaxy, FastQC, Bismark, MEDIPS, Tophat, Cufflinks, AvadisNGS, CLC Genomics Workbench, Galaxy, BaseSpace, Trinity Statistics: Microsoft Excel and R. Database: MySQL Genome Browser: UCSC, Ensemble, IGV, IGB Motif Analysis Tools: MEME Suite, Transfac and RSAT Functional Annotation Tools: DAVID, GeneCodis, Gene Cards Networking Tools: Cytoscape

EMOLUMENTS : The incumbent will be paid a fee of Rs. 2000/- per sitting/ per day.

SCIENTIST NAME : Dr. Arnab Mukhopadhyay,

Staff Scientific V Next generation sequencing facility

SCIENTIST’S E-MAIL ID : arnab@nii.ac.in

WALK IN INTERVIEW ON : 18th September, 2015

REGISTRATION OF CANDIDATES: 10.30 AM to 11.00 AM

PLEASE NOTE- 1. CANDIDATE MAY FILL UP APPLICATION IN THE PRECRIBED FORMAT ALONG WITH NECESSARY DOCUMENTS FOR VERIFICATION. 2. APPLICATIONS CONTAINING INCOMPLETE INFORMATION SHALL NOT BE ENTERTAINED. 3. DATE OF PASSING THE EXAMINATIONS MUST BE INDICATED CLEARLY. 4. ONLY REGISTERED CANDIDATES WILL BE INTERVIEWED. 5. NO TA/DA WILL BE PAID FOR ATTENDING THE INTERVIEW PRESCRIBED FORM 1. NAME 2. FATHER’S NAME 3. MOTHER’S NAME 4. DATE OF BIRTH 5. SEX (MALE/FEMALE) 6. CATEGORY (SC/ ST/ OBC/ PH) 7. ADDRESS a. (CORRSPONDENCE) b. (PERMANENT) 8. E MAIL, TELEPHONE NO. & MOBILE No (if any) 9. ACADEMIC & PROFESSIONAL QUALIFICATIONS NAME OF EXAMINATION PASSED WITH SUBJECTS YEAR OF PASSING BOARD/ UNIVERSITY PERCENTAGE/ DIVISION REMARKS 10. PAST EXPERIENCE & PRESENT EMPLOYMENT, IF ANY 11. CANDIDATES SHOULD STATE CLEARLY WHETHER THEY HAVE BEEN AWARDED PH.D DEGREE OR THESIS HAS BEEN SUBMITTED. 12. HAVE YOU APPLIED FOR A POSITION EARLIER IN THE INSTITUTE? IF SO:- (1) THE DETAILS OF THE PROJECT AND PROJECT INVESTIGATOR (2) IF CALLED FOR INVERVIEW, RESULTS THEREOF

More at http://www1.nii.res.in/sites/default/files/walkininterview-18sept2015.pdf

Linux Commands Cheat Sheet for Bioinformatics and Computational Biology Professionals

Rahul Nayak — Mon, 05 Feb 2018 18:50:41 -0600

The purpose of this cheat sheet is to introduce biologist and bioinformatician to the frequently used tools for NGS analysis as well as giving experience in writing one-liners.

File System
ls — list items in current directory
ls -l — list items in current directory and show in long format to see perimissions, size, and modification date
ls -a — list all items in current directory, including hidden files
ls -F — list all items in current directory and show directories with a slash and executables with a star
ls dir — list all items in directory dir
cd dir — change directory to dir
cd .. — go up one directory
cd / — go to the root directory
cd ~ — go to to your home directory
cd - — go to the last directory you were just in
pwd — show present working directory
mkdir dir — make directory dir
rm file — remove file
rm -r dir — remove directory dir recursively
cp file1 file2 — copy file1 to file2
cp -r dir1 dir2 — copy directory dir1 to dir2 recursively
mv file1 file2 — move (rename) file1 to file2
ln -s file link — create symbolic link to file
touch file — create or update file
cat file — output the contents of file
less file — view file with page navigation
head file — output the first 10 lines of file
tail file — output the last 10 lines of file
tail -f file — output the contents of file as it grows, starting with the last 10 lines
vim file — edit file
alias name 'command' — create an alias for a command
System
shutdown — shut down machine
reboot — restart machine
date — show the current date and time
whoami — who you are logged in as
finger user — display information about user
man command — show the manual for command
df — show disk usage
du — show directory space usage
free — show memory and swap usage
whereis app — show possible locations of app
which app — show which app will be run by default
Process Management
ps — display your currently active processes
top — display all running processes
kill pid — kill process id pid
kill -9 pid — force kill process id pid
Permissions
ls -l — list items in current directory and show permissions
chmod ugo file — change permissions of file to ugo - u is the user's permissions, g is the group's permissions, and o is everyone else's permissions. The values of u, g, and o can be any number between 0 and 7.
7 — full permissions
6 — read and write only
5 — read and execute only
4 — read only
3 — write and execute only
2 — write only
1 — execute only
0 — no permissions
chmod 600 file — you can read and write - good for files
chmod 700 file — you can read, write, and execute - good for scripts
chmod 644 file — you can read and write, and everyone else can only read - good for web pages
chmod 755 file — you can read, write, and execute, and everyone else can read and execute - good for programs that you want to share
Networking
wget file — download a file
curl file — download a file
scp user@host:file dir — secure copy a file from remote server to the dir directory on your machine
scp file user@host:dir — secure copy a file from your machine to the dir directory on a remote server
scp -r user@host:dir dir — secure copy the directory dir from remote server to the directory dir on your machine
ssh user@host — connect to host as user
ssh -p port user@host — connect to host on port as user
ssh-copy-id user@host — add your key to host for user to enable a keyed or passwordless login
ping host — ping host and output results
whois domain — get information for domain
dig domain — get DNS information for domain
dig -x host — reverse lookup host
lsof -i tcp:1337 — list all processes running on port 1337
Searching
grep pattern files — search for pattern in files
grep -r pattern dir — search recursively for pattern in dir
grep -rn pattern dir — search recursively for pattern in dir and show the line number found
grep -r pattern dir --include='*.ext — search recursively for pattern in dir and only search in files with .ext extension
command | grep pattern — search for pattern in the output of command
find file — find all instances of file in real system
locate file — find all instances of file using indexed database built from the updatedb command. Much faster than find
sed -i 's/day/night/g' file — find all occurrences of day in a file and replace them with night - s means substitude and g means global - sed also supports regular expressions
Compression
tar cf file.tar files — create a tar named file.tar containing files
tar xf file.tar — extract the files from file.tar
tar czf file.tar.gz files — create a tar with Gzip compression
tar xzf file.tar.gz — extract a tar using Gzip
gzip file — compresses file and renames it to file.gz
gzip -d file.gz — decompresses file.gz back to file
Shortcuts
ctrl+a — move cursor to beginning of line
ctrl+f — move cursor to end of line
alt+f — move cursor forward 1 word
alt+b — move cursor backward 1 word

Next generation sequencing in R or bioconductor environment

John Parker — Mon, 02 Jun 2014 18:03:09 -0500

There are many R software and bioconductor packages for NGS data analysis, some of them are as follows

Biostrings

The Biostrings package from Bioconductor provides an advanced environment for efficient sequence management and analysis in R. It contains many speed and memory effective string containers, string matching algorithms, and other utilities, for fast manipulation of large sets of biological sequences. The objects and functions provided by Biostrings form the basis for many other sequence analysis packages. Documentation

IRanges Overview

IRanges provides the low-level infrastructure and containers for handling sets of integer ranges within Bioconductor's BioC-Seq domain. Its classes and methods provide support for many more high-level packages like GenomicRanges, ShortRead, Rsamtools, etc. Documentation

GenomicRanges Overview

The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project. It is built upon the IRanges infrastructure and defines three major data containers - GRanges, GRangesList and GappedAlignments - which are supporting other important BioC-Seq packages including ShortRead, Rsamtools, rtracklayer, GenomicFeatures and BSgenome. Compared to the IRanges container, the GRanges/GRangesList classes are more flexible and extensible to store additional information about sequence ranges, such as chromosome identifiers (sequence space), strand information and annotation data. Documentation

Motif Discovery

cosmo

The cosmo package allows to search a set of unaligned DNA sequences for a shared motif that may function as transcription factor binding site. The algorithm extends the popular motif discovery tool MEME (Bailey and Elkan, 1995) in that it allows the search to be supervised by specifying a set of constraints that the motif to be discovered must satisfy. Documentation

BCRANK

BCRANK is a method that takes a ranked list of genomic regions as input and outputs short DNA sequences that are overrepresented in some part of the list. The algorithm was developed for detecting transcription factor (TF) binding sites in a large number of enriched regions from high-throughput ChIP-chip or ChIP-seq experiments, but it can be applied to any ranked list of DNA sequences. Documentation

rGADEM: Documentation

MotIV: Documentation

ShortRead

The ShortRead package provides input, quality control, filtering, parsing, and manipulation functionality for short read sequences produced by high throughput sequencing technologies. While support is provided for many sequencing technologies, this package is primairly focused on Solexa/Illumina reads. Documentation

Rsamtools

Rsamtools provides functions for parsing and inspecting samtools BAM formatted binary alignment data. SAM/BAM is quickly becoming a universal standard alignment format, and is now supported by a wide variety of alignment tools. Documentation

Samtools Website
BWA (Burrows-Wheeler Alignment) Website

Additional tools for SNP analysis:

snpMatrix

BSgenome

BSgenome provides an object oriented infrastructure for interacting with a Biostring based genome sequence. BSgenome packages exist for many common genomes, and can be created to represent custom genomes. See the "How to forge a BSgenome data package" Vignette for instructions to create a new BSgenome package if a prebuilt package does not exist for your organism. Documentation

rtracklayer

rtracklayer provides an interface for exporting annotation feature data to various genome browsers and file formats (such as GFF). See the Small RNA Profiling exercise for an example of using rtracklayer to visualize alignment coverage. Documentation

biomaRt

The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite (http:// www.biomart.org). The package enables online retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas. This data is retrieved automatically via the Internet, so it's recommended that you cache the data locally, or check versions if your code will be adversely affected by updates to these data. Documentation

ChIP-Seq Analysis Packages

Bioconductor provides various packages for analyzing and visualizing ChIP-Seq data. Only a small selection of these packages is introduced here. Additional useful introductions to this topic are: BioC ChIP-seq Case Study and BioC ChIP-Seq.

chipseq

The chipseq package combines a variety of HT-Seq packages to a pipeline for ChIP-Seq data analysis. Documentation

BayesPeak

BayesPeak is a peak calling package for identifying DNA binding sites of proteins in ChIP-Seq experiments. Its algorithm uses hidden Markov models (HMM) and Bayesian statistical methods. The following sample code introduces the identification of peaks with the BayesPeak package as well as the incorporation of read coverage information obtained by the chipseq package. Documentation [ Publication ]

PICS

The PICS package applies probabilistic inference to aligned-read ChIP-Seq data in order to identify regions bound by transcription factors. PICS identifies enriched regions by modeling local concentrations of directional reads, and uses DNA fragment length prior information to discriminate closely adjacent binding events via a Bayesian hierarchical t-mixture model. The following sample code uses the test data set from the above BayesPeak package in order to compare the results from both methods by identifying their consensus peak set. Documentation [ Publication ]

ChIPpeakAnno

The ChIPpeakAnno package provides. batch annotation of the peaks identified from either ChIP-seq or ChIP-chip experiments. It includes functions to retrieve the sequences around peaks, obtain enriched Gene Ontology (GO) terms, find the nearest gene, exon, miRNA or custom features such as most conserved elements and other transcription factor binding sites supplied by users. The package leverages the biomaRt, IRanges, Biostrings, BSgenome, GO.db, multtest and stat packages. Documentation

Additional ChIP-Seq Packages

DiffBind: Documentation

MOSAICS: Documentation

iSeq: Documentation

ChIPseqR: Documentation

ChiPsim: Documentation

CSAR: Documentation

ChIP-Seq Pipeline: PICS, rGADEM and MotIV (developer web site)

SPP: ChIP-seq processing pipeline

SPP Tutorial

MACS

SIPeS

RNA-Seq Analysis

Counting Reads that Overlap with Annotation Ranges

The GenomicRanges package provides support for importing into R short read alignment data in BAM format (via Rsamtools) and associating them with genomic feature ranges, such as exons or genes. This way one can quantify the number of reads aligning to annotated genomic regions. The package defines general purpose containers for storing genomic intervals as well as more specialized containers for storing alignments against a reference genome. The two main functions for read counting provided by this infrastructure are countOverlaps and summarizeOverlaps. For their proper usage, it is important to read the corresponding PDF manual. Documentation

Differential Gene Expression Analysis with DESeq

The DESeq package contains functions to call differentially expressed genes (DEGs) in count tables based on a model using the negative binomial distribution. It expects as input a data frame with the raw read counts per region/gene of interest (rows) for each test sample (columns). Such a count table can be imported into R or generated from BAM alignment files using the countOverlaps function as introduced above. Documentation

Differential Gene Expression Analysis with edgeR

The edgeR package uses empirical Bayes estimation and exact tests based on the negative binomial distribution to call differentially expressed genes (DEGs) in count data.

Documentation

A variety of additional R packages are available for normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG):

easyRNASeq (simplifies read counting per genome feature)

DEXSeq (Inference of differential exon usage); parathyroidSE explains how to generate exon read counts in R

DEGseq

baySeq (also see: segmentSeq)

Genominator (Bullard et al. 2010)

Detection of Alternative Splice Junctions

Another utility of RNA-Seq experiments is the analysis of splice junctions. The following software suggestions provide this utility:

ERANGE
TopHat

SpliceMap

SplitSeek

DNA-Methylation Data Analysis

methylPipe
bsseq
BiSeq
Much more under BiocViews

HT-Seq Data Visualization

ggbio: ggplot2 extension for genomics data (online manual) Gviz: Plotting data and annotation information along genomic coordinates HilbertVis: Hilbert genome plots

GenomeGraphs: Plotting genomic information from Ensembl

TileQC: Flow Cell Quality Visualization

rtracklayer: R interface to genome browsers

genoPlotR: Plotting maps of genes and genomes

Genominator: Tools for storing, accessing, analyzing and visualizing genomic data.

To install all packages

source("http://bioconductor.org/biocLite.R")
biocLite()
biocLite(c("ShortRead", "Biostrings", "IRanges", "BSgenome", "rtracklayer", "biomaRt", "chipseq", "ChIPpeakAnno", "Rsamtools", "BayesPeak", "PICS", "GenomicRanges", "DESeq", "edgeR", "leeBamViews", "GenomicFeatures", "BSgenome.Celegans.UCSC.ce2"))

Regular Expression Cheat Sheet

Jitendra Narayan — Tue, 09 Jul 2013 17:38:42 -0500

The Regular Expression are the sole of Perl language, and for bioinformatician it is just a magical stick to resolve gingatic string data. We did not find any good and user friendly regular expression cheat sheet, hence write our own cheat sheet. The Regular Expressions Cheat Sheet, a quick reference guide for regular expressions, including symbols, ranges, grouping, assertions and some sample patterns to get you started.

Perl Module Installation

Jitendra Narayan — Fri, 12 Jul 2013 11:19:41 -0500

Nice step wide information on perl module installation.

Address of the bookmark: http://bioinformaticsonline.com/blog/view/710/how-to-install-perl-modules-manually-using-cpan-command-and-other-quick-ways

Citrus Perl

Jitendra Narayan — Wed, 14 Aug 2013 14:57:44 -0500

Citrus Perl is a binary distribution of Perl created for GUI application developers. The distribution includes wxPerl, the Perl wrapper for wxWidgets. Where supported by the operating system wxWidgets is available as a package for the 2.8.x stable branch and the 2.9.x development branch.

Address of the bookmark: http://www.citrusperl.com/

Perl and BioPerl Tutorials

Jitendra Narayan — Wed, 28 Aug 2013 05:51:38 -0500

This bookmark is created to store the useful Perl and BioPerl tutorial links at one place. Feel free to share and add more useful tutorial links here ....

Address of the bookmark: http://cbb.sjtu.edu.cn/course/database/beginning.pdf

Perl one-liner for bioinformatician !!!

Abhimanyu Singh — Fri, 30 May 2014 05:49:07 -0500

With the emergence of NGS technologies, and sequencing data most of the bioinformaticians mung and wrangle around massive amounts of genomics text. There are several "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Perl onliner is extremely helpful.

Perl one-liners are small and awesome Perl programs that fit in a single line of code and they do one thing really well. These things include changing line spacing, numbering lines, doing calculations, converting and substituting text, deleting and printing certain lines, parsing logs, editing files in-place, doing statistics, carrying out system administration tasks, updating a bunch of files at once, and many more. Perl one-liners will make you the shell warrior. Anything that took you minutes to solve, will now take you seconds!

perl -pe '$\="\n"'
#double space a file

perl -pe '$_ .= "\n" unless /^$/'
#double space a file except blank lines

perl -pe '$_.="\n"x7'
#7 space in a line.

perl -ne 'print unless /^$/'
#remove all blank lines

perl -lne 'print if length($_) < 20'
#print all lines with length less than 20.

perl -00 -pe ''
#If there are multiple spaces, delete all leaving one(make the file a single spaced file).

perl -00 -pe '$_.="\n"x4'
#Expand single blank lines into 4 consecutive blank lines

perl -pe '$_ = "$. $_"'
#Number all lines in a file

perl -pe '$_ = ++$a." $_" if /./'
#Number only non-empty lines in a file

perl -ne 'print ++$a." $_" if /./'
#Number and print only non-empty lines in a file

perl -pe '$_ = ++$a." $_" if /regex/'
#Number only lines that match a pattern

perl -ne 'print ++$a." $_" if /regex/'
#Number and print only lines that match a pattern

perl -ne 'printf "%-5d %s", $., $_ if /regex/'
#Left align lines with 5 white spaces if matches a pattern (perl -ne 'printf "%-5d %s", $., $_' : for all the lines)

perl -le 'print scalar(grep{/./}<>)'
#prints the total number of non-empty lines in a file

perl -lne '$a++ if /regex/; END {print $a+0}'
#print the total number of lines that matches the pattern

perl -alne 'print scalar @F'
#print the total number fields(words) in each line.

perl -alne '$t += @F; END { print $t}'
#Find total number of words in the file

perl -alne 'map { /regex/ && $t++ } @F; END { print $t }'
#find total number of fields that match the pattern

perl -lne '/regex/ && $t++; END { print $t }'
#Find total number of lines that match a pattern

perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m'
#will calculate the GCD of two numbers.

perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m'
#will calculate lcd of 20 and 35.

perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n'
#Generates 10 random numbers between 5 and 15.

perl -le 'print map { ("a".."z",”0”..”9”)[rand 36] } 1..8'
#Generates a 8 character password from a to z and number 0 – 9.

perl -le 'print map { ("a",”t”,”g”,”c”)[rand 4] } 1..20'
#Generates a 20 nucleotide long random residue.

perl -le 'print "a"x50'
#generate a string of ‘x’ 50 character long

perl -le 'print join ", ", map { ord } split //, "hello world"'
#Will print the ascii value of the string hello world.

perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)'
#converts ascii values into character strings.

perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"'
#Generates an array of odd numbers.

perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"'
#Generate an array of even numbers

perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file
#Convert the entire file into 13 characters offset(ROT13)

perl -nle 'print uc'
#Convert all text to uppercase:

perl -nle 'print lc'
#Convert text to lowercase:

perl -nle 'print ucfirst lc'
#Convert only first letter of first word to uppercas

perl -ple 'y/A-Za-z/a-zA-Z/'
#Convert upper case to lower case and vice versa

perl -ple 's/(\w+)/\u$1/g'
#Camel Casing

perl -pe 's|\n|\r\n|'
#Convert unix new lines into DOS new lines:

perl -pe 's|\r\n|\n|'
#Convert DOS newlines into unix new line

perl -pe 's|\n|\r|'
#Convert unix newlines into MAC newlines:

perl -pe '/regexp/ && s/foo/bar/'
#Substitute a foo with a bar in a line with a regexp.

Reference/Sources:

http://genomics-array.blogspot.in/2010/11/some-unixperl-oneliners-for.html

http://genomespot.blogspot.com/2013/08/a-selection-of-useful-bash-one-liners.html

http://biowize.wordpress.com/2012/06/15/command-line-magic-for-your-gene-annotations/

http://genomics-array.blogspot.com/2010/11/some-unixperl-oneliners-for.html

http://bioexpressblog.wordpress.com/2013/04/05/split-multi-fasta-sequence-file/

Pattern Matching Problem Solution with Perl

Jit — Tue, 09 Jun 2015 23:58:45 -0500

Problem at http://rosalind.info/problems/1c/

#Find all occurrences of a pattern in a string.
#Given: Strings Pattern and Genome.
#Return: All starting positions in Genome where Pattern appears as a substring. Use 0-based indexing.

use strict;
use warnings;

my $string="GATATATGCATATACTT";
my $subStr="ATAT";
my $kmer=length($subStr);

kmerMatch ($string, $subStr, $kmer);

sub kmerMatch { #Check the exact matching kmers with sliding window
my ($string, $myStr, $kmer)=@_;
for (my $aa=0; $aa<=(length($string)-$kmer); $aa++) {
    my $myWin=substr $string, $aa,$kmer;
    if ($myWin eq $myStr) {
        #print "$myWin eq $myStr\n";
        print $aa;
    }
}
}