BOL: Related items

Linux Sort Commands for Bioinformatics

Rahul Nayak — Sat, 31 May 2014 15:41:16 -0500

Almost all the scripting languages such as Perl, Python etc have built-in sort, but unfortunately none of them are as flexible as sort command. But one when it come to space efficiency GNU sort stands at the top. It can sort a 20Gb file with less than 2Gb memory. It is not trivial to implement so powerful a sort by yourself.

sort a space-delimited file based on its first column, then the second if the first is the same, and so on:
sort input.txt

sort a huge file (GNU sort ONLY):
sort -S 1500M -t $HOME/tmp input.txt > sorted.txt

sort starting from the third column, skipping the first two columns:
sort +2 input.txt

sort the second column as numbers, descending order; if identical, sort the 3rd as strings, ascending order:
sort -k2,2nr -k3,3 input.txt

sort starting from the 4th character at column 2, as numbers:
sort -k2.4n input.txt

More Linxu sort command information

If you have any sort commands you'd like to share, please add them to our comments section below. For more help, you can also type:

man sort

or

sort --help

on your Unix/Linux system.

RNA-seq Analysis Workshop Course Materials

Jit — Tue, 03 Jul 2018 08:14:14 -0500

RNAseq can be roughly divided into two "types": Reference genome-based - an assembled genome exists for a species for which an RNAseq experiment is performed. It allows reads to be aligned against the reference genome and significantly improves our ability to reconstruct transcripts. This category would obviously include humans and most model organisms but excludes the majority of truly biologically intereting species (e.g., Hyacinth macaw); Reference genome-free - no genome assembly for the species of interest is available. In this case one would need to assemble the reads into transcripts using de novo approaches. This type of RNAseq is as much of an art as well as science because assembly is heavily parameter-dependent and difficult to do well. In this lesson we will focus on the Reference genome-based type of RNA seq. http://chagall.med.cornell.edu/RNASEQcourse/

Address of the bookmark: http://chagall.med.cornell.edu/RNASEQcourse/

Convert EnsEMBL GTF to Annotation table (Geneid, GeneSymbol, GeneWiseChrLocation, GeneClass, Strand) Raw

EagleEye — Fri, 24 Jun 2016 18:08:49 -0500

Bash Script source:

https://gist.github.com/santhilalsubhash/367befcf5216be4b1fd9

Information:

This script converts EnsEMBL GTF (Ex: https://gist.githubusercontent.com/santhilalsubhash/1e7cca357e52a181dc25/raw/cfb803e07900a2baefbb6534f1299fd30cb57a29/sample.GTF) file to annotation table format. It generated two files
1) Transcript wise chromosome location with information about transcripts (Ex: https://gist.githubusercontent.com/santhilalsubhash/c7dec516e0338503a4b6/raw/de0af1a39f0005c4ce7321c5ae57fc8b4a14c7f4/sample.GTF_enst_annotation.txt)
2) Gene wise chromosome location with information about genes (Ex: https://gist.githubusercontent.com/santhilalsubhash/c92006c5080f0333bec2/raw/d16e0b2440d73b09b486d3c9751cdb248a73fa0b/sample.GTF_ensg_annotation.txt)

Note: You can download GTF files from http://www.ensembl.org/info/data/ftp/index.html

Postdoc position at Centre Méditerranéen de Médecine Moléculaire - Nice - France

Wed, 04 Jun 2014 07:20:57 -0500

The research group of Dr. Michele Trabucchi at the Centre Méditerranéen de Médecine Moléculaire (C3M) at INSERM U1065 (University of Nice Sophia-Antipolis, France) is seeking candidates for a Postdoctoral fellow position to start on October 2014 for 3 years funded by FRM (Fondation pour la Recherche Médicale).
The broad interest of the lab is in understanding the expression control and function of small RNAs in activated myeloid cells (visit our webpage to check research interests and publications of the group : http://www.unice.fr/c3m/EN/Equipe10.html ).

The work will focus on the functional studies of small RNAs by using next-generation sequencing approaches.

Candidates should hold a Ph.D. degree and have strong background in bioinformatics.
The University of Nice Sophia-Antipolis provides a wide range of facilities and training essential for biomedical research.

Interested applicants should send a PDF with a cover letter stating research interests and qualifications, an updated CV, a summary of previous research experience and contact information for two references to Michele Trabucchi ( mtrabucchi@unice.fr )

Homepage: http://www.unice.fr/c3m/EN/Equipe10.html

ComparativeGenomics Exercise2

Neel — Wed, 22 Aug 2018 22:10:56 -0500

COMPARATIVE MICROBIAL GENOMICS ANALYSIS WORKSHOP @ cbs.dtu.dk

Free Bioinformatics workbench https://www.mn.uio.no/ifi/english/research/networks/clsi/earlier_seminars/2012/tammivesth_osloseminarfinal.pdf

Ten recommendations for creating usable bioinformatics command line software

RAJESH DETROJA — Sun, 08 Jun 2014 10:06:26 -0500

Bioinformatics software varies greatly in quality. In terms of usability, the command line interface is the first experience a user will have of a tool. Unfortunately, this is often also the last time a tool will be used. Here I present ten recommendations for command line software author’s tools to follow, which I believe would greatly improve the uptake and usability of their products, waste less user’s time, and improve the quality of scientific analyses.

Address of the bookmark: http://www.gigasciencejournal.com/content/2/1/15?utm_content=buffer25ee0&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer

BASE: a practical de novo assembler for large genomes using long NGS reads

Rahul Nayak — Fri, 19 Oct 2018 07:25:21 -0500

new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs.

Address of the bookmark: https://github.com/dhlbh/BASE

CANU genome assembly parameters !

Rahul Nayak — Mon, 07 Jan 2019 08:40:37 -0600

Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets [] are optional, symbol | stands for "or".

usage:   canu [-correct | -trim | -assemble | -trim-assemble] \
              [-s ] \
               -p  \
               -d  \
               genomeSize=[g|m|k] \
               -maxThreads=2 \
               useGrid=false \
              [other-options] \
               read_file.fastq.gz

A default Canu run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.

canu -p test_asmbl \
     -d asm_test3 \
     genomeSize=2m \
     -maxThreads=2 useGrid=false \
     -pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz

There is a brilliant section in documentation about parameter tweaking.

The output directory contains will contain many files. The most interesting ones are:

*.correctedReads.fasta.gz : file containing the input sequences after correction, trim and split based on consensus evidence.
*.trimmedReads.fastq : file containing the sequences after correction and final trimming
*.layout : file containing informations about read inclusion in the final assembly
*.gfa : file containing the assembly graph by Canu
*.contigs.fasta : file containing everything that could be assembled and is part of the primary assembly

The basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.

More at https://canu.readthedocs.io/en/latest/faq.html

Workshop On Molecular Modeling and Dynamics Simulation Analyses

Fri, 04 Jul 2014 13:38:13 -0500

Workshop On Molecular Modeling and Dynamics Simulation Analyses

August1-2, 2014

Organised By

Centre of Excellence in Bioinformatics
Bioinformatics Infrastructure Facility
Department of Biochemistry
University of Lucknow
Lucknow-226007

Course Contents

Molecular Modeling
Homology Modeling
Molecular Docking
Post-structural Analyses

Molecular Dynamics (MD)
Simulation
Linux Introduction
Gromacs Installation

MD Simulation of Protein ligand complex
Analyses of MD
Trajectories
Visualization of Dynamic
complexes

Important Dates

Registration Begins June 25, 2014
Registration Closes July 25, 2014

Brochure : www.lkouniv.ac.in/conference/Brochure_August,%202014.pdf

Simka and SimkaMin are comparative metagenomics method dedicated to NGS datasets

Neel — Sat, 06 Jul 2019 13:56:10 -0500

Simka is a de novo comparative metagenomics tool. Simka represents each dataset as a k-mer spectrum and compute several classical ecological distances between them.

Developper: Gaëtan Benoit, PhD, former member of the Genscale team at Inria.

Contact: claire dot lemaitre at inria dot fr

Simka and SimkaMin are comparative metagenomics method dedicated to NGS datasets. https://gatb.inria.fr/software/simka/

Address of the bookmark: https://github.com/GATB/simka