BOL: Related items

Linux Sort Commands for Bioinformatics

Rahul Nayak — Sat, 31 May 2014 15:41:16 -0500

Almost all the scripting languages such as Perl, Python etc have built-in sort, but unfortunately none of them are as flexible as sort command. But one when it come to space efficiency GNU sort stands at the top. It can sort a 20Gb file with less than 2Gb memory. It is not trivial to implement so powerful a sort by yourself.

sort a space-delimited file based on its first column, then the second if the first is the same, and so on:
sort input.txt

sort a huge file (GNU sort ONLY):
sort -S 1500M -t $HOME/tmp input.txt > sorted.txt

sort starting from the third column, skipping the first two columns:
sort +2 input.txt

sort the second column as numbers, descending order; if identical, sort the 3rd as strings, ascending order:
sort -k2,2nr -k3,3 input.txt

sort starting from the 4th character at column 2, as numbers:
sort -k2.4n input.txt

More Linxu sort command information

If you have any sort commands you'd like to share, please add them to our comments section below. For more help, you can also type:

man sort

or

sort --help

on your Unix/Linux system.

INC-Seq: accurate single molecule reads using nanopore sequencing

Jit — Mon, 27 Nov 2017 10:38:56 -0600

INC-Seq reads enabled accurate species-level classification, identification of species at 0.1 % abundance and robust quantification of relative abundances, providing a cheap and effective approach for pathogen detection and microbiome profiling on the MinION system.

Address of the bookmark: https://github.com/CSB5/INC-Seq

HISAT2: a fast and sensitive alignment program for mapping next-generation sequencing reads

Rahul Nayak — Tue, 08 May 2018 04:27:22 -0500

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).

more at https://ccb.jhu.edu/software/hisat2/index.shtml

Address of the bookmark: https://github.com/infphilo/hisat2

Search Shell Command History

Rahul Nayak — Thu, 12 Jun 2014 17:43:34 -0500

We use couple of hundreads of command in daily basis. Most of them are actually repeated several time. The question remain open how do I search old command history under bash shell and modify or reuse it?

Now a days almost all modern shell allows you to search command history if enabled by user. Use history command to display the history list with line numbers. Lines listed with with a * have been modified by user.

Shell history search command

Type history at a shell prompt:
$ history

It will display the list of all used commandline history with an serial number.

To search particular command, enter:
$ history | grep command-name
$ history | egrep -i 'scp|ssh|ftp'
Emacs Line-Edit Mode Command History Searching

To get previous command containing string, hit [CTRL]+[r] followed by search string:

(reverse-i-search):

To get previous command, hit [CTRL]+[p]. You can also use up arrow key.

CTRL-p

To get next command, hit [CTRL]+[n]. You can also use down arrow key.

CTRL-n

fc command

Apart from hostory command there are fc command to extract the command from history. The fc stands for either "find command" or "fix command.

For example list last 10 command, enter:
$ fc -l 10
To list commands 130 through 150, enter:
$ fc -l 130 150
To list all commands since the last command beginning with ssh, enter:
$ fc -l ssh
You can edit commands 1 through 5 using vi text editor, enter:
$ fc -e vi 1 5

Delete command history

The -c option causes the history list to be cleared by deleting all of the entries:
$ history -c

BlasR Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): Theory and Application,

Jit — Wed, 23 May 2018 06:54:32 -0500

BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands to tens of thousands of bases long with divergence between the read and genome dominated by insertion and deletion error.

Here is how I use the blasr to align PacBio reads to the contigs (target.fasta). The “target.fasta.sa” is the suffix array from “target.fasta” generated by sawriter.

blasr query.fa ./target.fasta -sa ./target.fasta.sa -bestn 40 -maxScore -500 -m 4 -nproc 24 -out target.m4 -maxLCPLength 15

the output format option “-m 4″ generate the alignment coordinate. Not fully documented, but I can explain that to you.

I use a 24 cores / 48G ram server for the alignment. It took about 2 to 3 hours aligning 3G PacBio Reads to 10^6 sequences of short read contigs with a mean 3.5kbp length.

Address of the bookmark: http://bix.ucsd.edu/projects/blasr/

Bioinformatician’s Pocket Reference !!

RAJESH DETROJA — Sun, 08 Jun 2014 09:56:58 -0500

It is amusing how brain of bioinformaticians work! Learning a new programming language for days feels so much of fun that making 5 minute discussion with neighbours (unless under special circumstances!) in our own mother-tongue. Today every bioinformatician keeps more than few languages and core IT toolkits on their plate. It has become mandatory to be able to mould different code snippets to build our own custom workflows, and thus keeping syntax at our fingertips has become essential.Although Google is best way to get syntax problem solved, it is not a bad idea to keep reference sheets is our smartphones or stick out some printed sheets on the back of your door, in the old fashion way!!

Address of the bookmark: http://infoplatter.wordpress.com/2014/04/06/bioinformaticians-pocket-reference/

npScarf: real-time scaffolder using SPAdes contigs and Nanopore sequencing reads

Shruti Paniwala — Mon, 11 Jun 2018 05:14:57 -0500

npScarf (jsa.np.npscarf) is a program that connect contigs from a draft genomes to generate sequences that are closer to finish. These pipelines can run on a single laptop for microbial datasets. In real-time mode, it can be integrated with simple structural analyses such as gene ordering, plasmid forming.

Address of the bookmark: http://japsa.readthedocs.io/en/latest/tools/jsa.np.npscarf.html

Assistant Professor in Medical Bioinformatics

Tue, 24 Jun 2014 01:46:36 -0500

Advt. No : ME-I/A-IV/03/14
No.of Posts:01 (SC)
Pay Scale:
Pay Band of Rs.15600-39100 + Rs.6000/- GP +NPA @ 25% of Basic Pay +Learning Resource Allowance @ Rs.20,000/-P.A.+ Conveyance Allowance @ Rs. 1650/-P.M.+ Academic Allowance @ Rs.2500/- P.M. and other admissible allowances.
Qualifications:
Area of Specialization:-
Bioinformatics/Computational/Biology/Genomics/ Proteomics/ Structural Biology
1. Postgraduate qualification, e.g. Master’s Degree in Biotechnology/Bioinformatics/ Biophysics.
2. A Doctorate Degree of recognized University/Institute in a basic or allied Medical Science subject e.g. Medical Biotechnology/Biophysics. Bioinformatics/X-ray Crystallography/
Immunology/Structural Biology etc
Experience:
1.Minimum three years teaching and/or research experience in a recognized medical/research Institution in an allied medical subject after obtaining doctorate degree and preferably in Medical
Molecular Biology/ Biophysics/Structural Biology/Genomics and Clinical Proteomics/Computational Biology.
2. Minimum two publication with atleast one in international journal and atleast one as first author
Desirable:-
Consistently excellent scholastic/academic record, demonstrated ability to write grant proposal/(s) successfully, Post Doctoral training in a frontier area of medical Bioinformatics Research and of direct relevance to clinical diagnosis or patient care (preferably from a recognized top-ranking medical institution abroad)
Send your applications to O/O, Deputy Registrar, Recruitment & Establishment Cell, University of Health Sciences, Rohtak by 08.7.2014
For more details,please visit website: http://pgimsrohtak.nic.in/2014%20AP%20Advt.pdf
Last Apply Date: 08 Jul 2014

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

Jit — Mon, 20 Aug 2018 14:14:11 -0500

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Results: We designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platformtextquoterights error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads. Availability:

Hercules source code is available at https://github.com/BilkentCompGen/Hercules

Address of the bookmark: https://github.com/BilkentCompGen/Hercules

Postdoc position at Centre Méditerranéen de Médecine Moléculaire

Sun, 06 Jul 2014 11:23:06 -0500

The research group of Dr. Michele Trabucchi at the Centre Méditerranéen de Médecine Moléculaire (C3M) at INSERM U1065 (University of Nice Sophia-Antipolis, France) is seeking candidates for a Postdoctoral fellow position to start on October 2014 for 3 years funded by FRM (Fondation pour la Recherche Médicale).
The broad interest of the lab is in understanding the expression control and function of small RNAs in activated myeloid cells (visit our webpage to check research interests and publications of the group : http://www.unice.fr/c3m/EN/Equipe10.html ).

The work will focus on the functional studies of small RNAs by using next-generation sequencing approaches.

Candidates should hold a Ph.D. degree and have strong background in bioinformatics.
The University of Nice Sophia-Antipolis provides a wide range of facilities and training essential for biomedical research.
Interested applicants should send a PDF with a cover letter stating research interests and qualifications, an updated CV, a summary of previous research experience and contact information for two references to Michele Trabucchi ( mtrabucchi@unice.fr )