BOL: Related items

Perl one-liner for bioinformatician !!!

Abhimanyu Singh — Fri, 30 May 2014 05:49:07 -0500

With the emergence of NGS technologies, and sequencing data most of the bioinformaticians mung and wrangle around massive amounts of genomics text. There are several "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Perl onliner is extremely helpful.

Perl one-liners are small and awesome Perl programs that fit in a single line of code and they do one thing really well. These things include changing line spacing, numbering lines, doing calculations, converting and substituting text, deleting and printing certain lines, parsing logs, editing files in-place, doing statistics, carrying out system administration tasks, updating a bunch of files at once, and many more. Perl one-liners will make you the shell warrior. Anything that took you minutes to solve, will now take you seconds!

perl -pe '$\="\n"'
#double space a file

perl -pe '$_ .= "\n" unless /^$/'
#double space a file except blank lines

perl -pe '$_.="\n"x7'
#7 space in a line.

perl -ne 'print unless /^$/'
#remove all blank lines

perl -lne 'print if length($_) < 20'
#print all lines with length less than 20.

perl -00 -pe ''
#If there are multiple spaces, delete all leaving one(make the file a single spaced file).

perl -00 -pe '$_.="\n"x4'
#Expand single blank lines into 4 consecutive blank lines

perl -pe '$_ = "$. $_"'
#Number all lines in a file

perl -pe '$_ = ++$a." $_" if /./'
#Number only non-empty lines in a file

perl -ne 'print ++$a." $_" if /./'
#Number and print only non-empty lines in a file

perl -pe '$_ = ++$a." $_" if /regex/'
#Number only lines that match a pattern

perl -ne 'print ++$a." $_" if /regex/'
#Number and print only lines that match a pattern

perl -ne 'printf "%-5d %s", $., $_ if /regex/'
#Left align lines with 5 white spaces if matches a pattern (perl -ne 'printf "%-5d %s", $., $_' : for all the lines)

perl -le 'print scalar(grep{/./}<>)'
#prints the total number of non-empty lines in a file

perl -lne '$a++ if /regex/; END {print $a+0}'
#print the total number of lines that matches the pattern

perl -alne 'print scalar @F'
#print the total number fields(words) in each line.

perl -alne '$t += @F; END { print $t}'
#Find total number of words in the file

perl -alne 'map { /regex/ && $t++ } @F; END { print $t }'
#find total number of fields that match the pattern

perl -lne '/regex/ && $t++; END { print $t }'
#Find total number of lines that match a pattern

perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m'
#will calculate the GCD of two numbers.

perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m'
#will calculate lcd of 20 and 35.

perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n'
#Generates 10 random numbers between 5 and 15.

perl -le 'print map { ("a".."z",”0”..”9”)[rand 36] } 1..8'
#Generates a 8 character password from a to z and number 0 – 9.

perl -le 'print map { ("a",”t”,”g”,”c”)[rand 4] } 1..20'
#Generates a 20 nucleotide long random residue.

perl -le 'print "a"x50'
#generate a string of ‘x’ 50 character long

perl -le 'print join ", ", map { ord } split //, "hello world"'
#Will print the ascii value of the string hello world.

perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)'
#converts ascii values into character strings.

perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"'
#Generates an array of odd numbers.

perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"'
#Generate an array of even numbers

perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file
#Convert the entire file into 13 characters offset(ROT13)

perl -nle 'print uc'
#Convert all text to uppercase:

perl -nle 'print lc'
#Convert text to lowercase:

perl -nle 'print ucfirst lc'
#Convert only first letter of first word to uppercas

perl -ple 'y/A-Za-z/a-zA-Z/'
#Convert upper case to lower case and vice versa

perl -ple 's/(\w+)/\u$1/g'
#Camel Casing

perl -pe 's|\n|\r\n|'
#Convert unix new lines into DOS new lines:

perl -pe 's|\r\n|\n|'
#Convert DOS newlines into unix new line

perl -pe 's|\n|\r|'
#Convert unix newlines into MAC newlines:

perl -pe '/regexp/ && s/foo/bar/'
#Substitute a foo with a bar in a line with a regexp.

Reference/Sources:

http://genomics-array.blogspot.in/2010/11/some-unixperl-oneliners-for.html

http://genomespot.blogspot.com/2013/08/a-selection-of-useful-bash-one-liners.html

http://biowize.wordpress.com/2012/06/15/command-line-magic-for-your-gene-annotations/

http://genomics-array.blogspot.com/2010/11/some-unixperl-oneliners-for.html

http://bioexpressblog.wordpress.com/2013/04/05/split-multi-fasta-sequence-file/

MIX: Combining multiple assemblies from NGS data

Rahul Nayak — Tue, 08 May 2018 04:58:05 -0500

Mix is a tool that combines two or more draft assemblies, without relying on a reference genome and has the goal to reduce contig fragmentation and thus speed-up genome finishing. The proposed algorithm builds an extension graph where vertices represent extremities of contigs and edges represent existing alignments between these extremities. These alignment edges are used for contig extension. The resulting output assembly corresponds to a path in the extension graph that maximizes the cumulative contig length.

The Mix algorithm, approach and results were published in BMC bioinformatics : http://www.biomedcentral.com/1471-2105/14/S15/S16.

Address of the bookmark: https://github.com/cbib/MIX

Stephen Friend: The hunt for "unexpected genetic heroes"

Sat, 31 May 2014 14:31:47 -0500

What can we learn from people with the genetics to get sick — who don't? With most inherited diseases, only some family members will develop the disease, while others who carry the same genetic risks dodge it. Stephen Friend suggests we start studying those family members who stay healthy. Hear about the Resilience Project, a massive effort to collect genetic materials that may help decode inherited disorders. TEDTalks is a daily video podcast of the best talks and performances from the TED Conference, where the world's leading thinkers and doers give the talk of their lives in 18 minutes (or less). Look for talks on Technology, Entertainment and Design -- plus science, business, global issues, the arts and much more. Find closed captions and translated subtitles in many languages at http://www.ted.com/translate Follow TED news on Twitter: http://www.twitter.com/tednews Like TED on Facebook: https://www.facebook.com/TED Subscribe to our channel: http://www.youtube.com/user/TEDtalksDirector

vcfR: a package to manipulate and visualize VCF data in R

Jit — Thu, 25 Oct 2018 09:05:59 -0500

VcfR is an R package intended to allow easy manipulation and visualization of variant call format (VCF) data. Functions are provided to rapidly read from and write to VCF files. Once VCF data is read into R a parser function extracts matrices from the VCF data for use with typical R functions. This information can then be used for quality control or other purposes. Additional functions provide visualization of genomic data. Once processing is complete data may be written to a VCF file or converted into other popular R objects (e.g., genlight, DNAbin). VcfR provides a link between VCF data and the R environment connecting familiar software with genomic data.

Address of the bookmark: https://github.com/knausb/vcfR

INSPIRE Faculty Scheme: a component of “Assured Opportunity for Research Career (AORC)” under INSPIRE.

Sat, 19 Jul 2014 14:59:30 -0500

Ministry of Science and Technology, Department of Science and Technology

7th ADVERTISEMENT – 2014 (2)

INSPIRE Faculty Scheme: a component of “Assured Opportunity for Research Career (AORC)” under INSPIRE.

The Department of Science and Technology, Government of India, has launched the “Innovation in Science Pursuit for Inspired Research (INSPIRE)” [http://www.inspire-dst.gov.in] program in 2008.

The program aims to attract talent for study of science and careers with research. INSPIRE includes many components. The importance of Assured Career Opportunity in R&D sector has been recognized.

INSPIRE Faculty Scheme opens up an “Assured Opportunity for Research Career (AORC)” for young researchers in the age group of 27-32 years. It offers a contractual research awards to young achievers and opportunity for independent research in the near term and emerge as a future leader in the long term.

Eligibility

Essential Indian citizens and people of Indian origin including NRI/PIO status with PhD (in science, mathematics, engineering, pharmacy, medicine, and agriculture related subjects) from any recognized university in the world,

Those who have submitted their PhD Theses and are awaiting award of the degree are also
eligible. However, the award will be conveyed only after confirmation of the awarding the
PhD degree.

The upper age limit as on 1st July 2014 should be 32 years for considering support for a
period of 5 years. However, for SC and ST candidates, upper age limit will be 35 years.

Publication(s) in highly reputed Journals demonstrating research potential of the candidate.

Desirable

Candidates who are within top 1% at the School Leaving Examination, IIT-JEE rank, 1st Rank Holder either in graduation or post-graduation level university examination (which are used presently for identifying INSPIRE Scholars at under-graduate level and INSPIRE Fellows for doctoral degree)

More at http://www.inspire-dst.gov.in/faculty_scheme.html

Tools for RNA classification

Abhi — Tue, 08 Nov 2022 03:39:11 -0600

barrnap - https://github.com/tseemann/barrnap

CPAT - https://github.com/liguowang/cpat, http://lilab.research.bcm.edu/ (web server)

CPC2 - https://github.com/gao-lab/CPC2_standalone, http://cpc2.gao-lab.org/ (web server)

Infernal - http://eddylab.org/infernal/, https://github.com/EddyRivasLab/infernal

NCBI RefSeq - https://www.ncbi.nlm.nih.gov/refseq/

Rfam - http://rfam.xfam.org/, https://docs.rfam.org/en/latest/index.html

SILVA - https://www.arb-silva.de/

RNAmmer - http://www.cbs.dtu.dk/services/RNAmmer/ (web server, standalone download link)

List of bioinformatics workflow management tools !

Rahul Nayak — Sat, 20 Mar 2021 00:15:25 -0500

Here are list of Workflow Managers

BigDataScript – A cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities. [ paper-2014 | web ]
Bpipe – A small language for defining pipeline stages and linking them together to make pipelines. [ web ]
Common Workflow Language – a specification for describing analysis workflows and tools that are portable and scalable across a variety of software and hardware environments, from workstations to cluster, cloud, and high performance computing (HPC) environments. [ web ]
Cromwell – A Workflow Management System geared towards scientific workflows. [ web ]
Galaxy – a popular open-source, web-based platform for data intensive biomedical research. Has several features, from data analysis to workflow management to visualization tools. [ paper-2018 | web ]
Nextflow (recommended) – A fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner. [ paper-2018 | web ]
Ruffus – Computation Pipeline library for python widely used in science and bioinformatics. [ paper-2010 | web ]
SeqWare – Hadoop Oozie-based workflow system focused on genomics data analysis in cloud environments. [ paper-2010 | web ]
Snakemake – A workflow management system in Python that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment. [ paper-2018 | web ]
Workflow Descriptor Language – Workflow standard developed by the Broad. [ web ]

NCBI Webinar

Jit — Sun, 08 Jun 2014 02:47:01 -0500

In less than two weeks, NCBI will offer a webinar entitled "Introducing 3 NCBI Resources to Navigate Testing for Disease Linked Variants: MedGen, GTR and ClinVar". This webinar will delve into the lifecycle of genetic testing and teach attendees how to navigate the NIH Genetic Testing Registry, ClinVar, and MedGen resources. These resources can be used to prepare for clinical cases, access detailed information about orderable genetic tests, interpret test results, and more.

More at https://attendee.gotowebinar.com/register/8452228815737989634

ECTOOLS: Long Read Correction and other Correction tools

Jit — Fri, 05 Jan 2018 04:02:22 -0600

Long Read Correction and other Correction tools

This package is a loose collection of scripts. To run the correction
routine see the section below. Descriptions of the other scripts
are at the bottom of this file.

Contact: gurtowsk@cshl.edu

In short, the correction algorithm takes as input the unitigs from a short read assembly and uses them to correct long read data. More background information for the algorithm can be found:
http://schatzlab.cshl.edu/presentations/2013-06-18.PBUserMeeting.pdf

Address of the bookmark: https://github.com/jgurtowski/ectools

Internship program with ArrayGen Technolgies

Sun, 22 Jun 2014 23:18:31 -0500

Internship Program for Bioinformatics / Biotechnology Professionals Currently we offer positions to outstanding students interested in Next Generation Sequencing (NGS) data analysis. Applications are accepted throughout the year. Accepted students will be listed on web with their schedules. Accepted students can attend our future workshops and trainings freely at the specified venue.

Interested candidates may email their resume along with a cover letter to careers@arraygen.com

Official website: http://www.arraygen.com/