BOL: Related items

Linux for bioinformatician !!!

Rahul Nayak — Thu, 13 Mar 2014 16:59:26 -0500

Linux, free operating system for computers, provides several powerful admin tools and utilities which will help you to manage your systems effectively and handle huge amount of genomic/biological data with an ease. The field of bioinformatics relies heavily on Linux-based computers and software. Although most bioinformatics programs can be compiled to run. If you don’t know what these no so user-friendly tools are and how to use them, you could be spending lot of time trying to perform even the basic admin tasks. The focus of this linux series is to help you understand system admin as well as basic tools, which will help you to become an effective bioinformatician and computational biologist.

For knowledge about Linux and their importance amongst bioinformatician plesae read this article "An introduction to Linux for bioinformatics" by Paul Stothard.

Linux cheat sheet at http://bioinformaticsonline.com/file/view/87/linux-cheat-sheet

Please browse for futher useful linux pages on right hand side ...

SLURM Commands

Shruti Paniwala — Wed, 06 Jul 2022 07:40:07 -0500

SLURM commands

The following table shows SLURM commands on the SOE cluster.

Command	Description
sbatch	Submit batch scripts to the cluster
scancel	Signal jobs or job steps that are under the control of Slurm.
sinfo	View information about SLURM nodes and partitions.
squeue	View information about jobs located in the SLURM scheduling queue
smap	Graphically view information about SLURM jobs, partitions, and set configurations parameters
sqlog	View information about running and finished jobs
sacct	View resource accounting information for finished and running jobs
sstat	View resource accounting information for running jobs

For more information, run man on the commands above. See some examples below.

1. Info about the partitions and nodes
List all the partitions available to you and the nodes therein:

sinfo

Nodes in state idle can accept new jobs.

Show a partition configuratuin, for example, SOE_main

scontrol show partition=SOE_main

Show current info about a specific node:

scontrol show node=

You can also specify a group of nodes in the command above. For example, if your MPI job is running across soenode05,06,35,36, you can execute the command below to get the info on the nodes you are interested in:

scontrol show node=soenode[05-06,35-36]

An informative parameter in the output to look at would be CPULoad. It allows you to see how your application utilizes the CPUs on the running nodes.

2. Submit scripts
The header in a submit script specifies job name, partition (queue), time limit, memory allocation, number of nodes, number of cores, and files to collect standard output and error at run time, for example

#!/bin/bash

#SBATCH --job-name=OMP_run     # job name, "OMP_run"
#SBATCH --partition=SOE_main   # partition (queue)
#SBATCH -t 0-2:00              # time limit: (D-HH:MM) 
#SBATCH --mem=32000            # memory per node in MB 
#SBATCH --nodes=1              # number of nodes
#SBATCH --ntasks-per-node=16   # number of cores
#SBATCH --output=slurm.out     # file to collect standard output
#SBATCH --error=slurm.err      # file to collect standard errors

If the time limit is not specified in the submit script, SLURM will assign the default run time, 3 days. This means the job will be terminated by SLURM in 72 hrs. The maximum allowed run time is two weeks, 14-0:00.
If the memory limit is not requested, SLURM will assign the default 16 GB. The maximum allowed memory per node is 128 GB. To see how much RAM per node your job is using, you can run commands sacct or sstat to query MaxRSS for the job on the node - see examples below.
Depending on a type of application you need to run, the submit script may contain commands to create a temporary space on a computational node - see the discussion about using the file systems on the cluster.
Then it sets the environment specific to the application and starts the application on one or multiple nodes - see sbatch sample scripts in directory /usr/local/Samples on soemaster1.hpc.rutgers.edu.
You can submit your job to the cluster with sbatch command:

sbatch myscript.sh

3. Query job information
List all currently submitted jobs in running and pending states for a user:

squeue -u

Command squeue can be run with format options to expose specific information, for example, when pending job #706 is scheduled to start running:

squeue -j 706 --format="%S"

START_TIME
2015-04-30T09:54:32

More info can be shown by placing additional format options, for example:

squeue -j 706 --format="%i %P %j %u %T %l %C %S"

JOBID PARTITION   NAME    USER STATE   TIMELIMIT  CPUS START_TIME
706   SOE_main  Par_job_3 mike PENDING 3-00:00:00 64   2015-04-30T09:54:32

To see when all the jobs, pending in the queue, are scheduled to start:

squeue --start

List all running and completed jobs for a user

sqlog -u

sqlog -j

The following appreviations are used for the job states:

       CA   CANCELLED      Job was cancelled.

       CD   COMPLETED      Job completed normally.

       CG   COMPLETING     Job is in the process of completing.

       F    FAILED         Job termined abnormally.

       NF   NODE_FAIL      Job terminated due to node failure.

       PD   PENDING        Job is pending allocation.

       R    RUNNING        Job currently has an allocation.

       S    SUSPENDED      Job is suspended.

       TO   TIMEOUT        Job terminated upon reaching its time limit.

You can specify the fields you would like to see in the output of sqlog:

sqlog --format=list

The command below, for example, provides Job ID, user name, exit state, start date-time, and end date-time for job #2831:

sqlog -j 2831 --format=jid,user,state,start,end

List status info for a currently running job:

sstat -j

A formatted output can be used to gain only a specific info, for example, the maximum resident RAM usage on a node:

sstat --format="JobID,MaxRSS" -j

To get statistics on completed jobs by jobID:

sacct --format="JobID,JobName,MaxRSS,Elapsed" -j

To view the same information for all jobs of a user:

sacct --format="JobID,JobName,MaxRSS,Elapsed" -u

To print a list of fields that can be specified with the --format option:

sacct --helpformat

For example, to get Job ID, Job name, Exit state, start date-time, and end date-time for job #2831:

sacct -j 2831 --format="JobID,JobName,State,Start,End"

Another useful command to gain information about a running job is scontrol:

scontrol show job=

4. Cancel a job
To cancel one job:

scancel

To cancel one job and delete the TMP directory created by the submit script on a node:

sdel

To cancel all the jobs for a user:

scancel -u

To cancel one or more jobs by name:

scancel --name

MGRA: Breakpoint graphs and ancestral genome reconstructions

Jit — Tue, 25 Jul 2017 08:48:25 -0500

MGRA (Multiple Genome Rearrangements and Ancestors) is a tool for reconstruction of ancestor genomes and evolutionary history of extant genomes.

It takes as an input a set of genomes represented as sequences of genes (or synteny blocks) and produces such sequences for ancestral genomes at the internal nodes of the phylogenetic tree.

The phylogenetic tree may be also specified completely or partially, in the latter case MGRA can reconstruct conserved ancestral regions (CARs) of the ancestral genome of interest.

Since version 2 MGRA supports gene insertion and deletions in addition to genome rearrangements and allows the input genomes to have different gene content.

It also can reconstruct most plausible phylogenetic tree based on the rearrangement characters.

Address of the bookmark: http://mgra.cblab.org/

Genomicus: genome browser that enables users to navigate in genomes in several dimensions

Jit — Sat, 18 Nov 2017 16:10:16 -0600

Genomicus is a genome browser that enables users to navigate in genomes in several dimensions: linearly along chromosome axes, transversaly across different species, and chronologicaly along evolutionary time.

Once a query gene has been entered, it is displayed in its genomic context in parallel to the genomic context of all its orthologous and paralogous copies in all the other sequenced metazoan genomes. Moreover, Genomicus stores and displays the predicted ancestral genome structure in all the ancestral species within the phylogenetic range of interest.

All the data on extant species displayed in this browser are from Ensembl.

Address of the bookmark: http://genomicus.biologie.ens.fr/genomicus-90.01/cgi-bin/search.pl

Scripts for the analysis of HGT in genome sequence data.

Jit — Wed, 29 Nov 2017 16:44:10 -0600

Scripts for the analysis of HGT in genome sequence data

Address of the bookmark: https://github.com/reubwn/hgt

kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome

Jit — Fri, 08 Dec 2017 16:48:40 -0600

Sept. 20, 2017 Version 3.1 released. Major upgrade. Version 3.1 fixes the problems with SNP annotation that arose when NCBI discontinued use of GI numbers. Please read carefully the Preface (page 3) and the File of annotated genomes section (pages 9-10) in the version 3.1 User Guide. Thanks to Tom Slezak for revsing the get_genbank_file3 script and to Tod Stuber (USDA) for testing version 3.1 even though he doesn't need the annotation feature. All users are encouraged to upgrade to version 3.1.

Address of the bookmark: https://sourceforge.net/projects/ksnp/files/

String graph based genome assembly software and tools !

Rahul Nayak — Tue, 19 Dec 2017 17:17:38 -0600

In graph theory, a string graph is an intersection graph of curves in the plane; each curve is called a "string". String graphs were first proposed by E. W. Myers in a 2005 publication. In recent Genome Research paper describing an innovative approach for assembling large genomes from NGS data caught our attention for several reasons. i) it give different "string graph" prospective of long lasting genome assembly problem ii) the paper is coauthored by Jared Simpson, the developer of ABySS assembler and Richard Durbin. iii) Simpson-Durbin algorithm is that it does not rely on de Bruijn graphs, and instead employs a different graph construction approach called ‘string graph’.

Following are the genome assembly tools based on string graph:

1.SGA (String Graph Assembler) https://github.com/jts/sga

Assembles large genomes from high coverage short read data. SGA is designed as a modular set of programs, which are used to form an assembly pipeline. SGA implements a set of assembly algorithms based on the FM-index. As the FM-index is a compressed data structure, the algorithms are very memory efficient. The SGA assembly has three distinct phases. The first phase corrects base calling errors in the reads. The second phase assembles contigs from the corrected reads. The third phase uses paired end and/or mate pair data to build scaffolds from the contigs. The output of this software is a PDF report that allows the properties of the genome and data quality to be visually explored. By providing more information to the user at the start of an assembly project, this software will help increase awareness of the factors that make a given assembly easy or difficult, assist in the selection of software and parameters and help to troubleshoot an assembly if it runs into problems.

2. SAGE: String-overlap Assembly of GEnomes https://github.com/lucian-ilie/SAGE2

SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers.

3. FSG: Fast String Graph

The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.

4. BASE https://github.com/dhlbh/BASE

It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.

5. Fermi https://github.com/lh3/fermi/

Fermi is a de novo assembler with a particular focus on assembling Illumina short sequence reads from a mammal-sized genome. In addition to the role of a typical assembler, fermi also aims to preserve heterozygotes which are often collapsed by other assemblers. Its ultimate goal is to find a minimal set of unitigs to represent all the information in raw reads.

If you want to learn about String Graph assembler, please read the following papers -

i) The Fragment Assembly String Graph - E. W. Myers

This paper describes the String Graph concept.

ii) Efficient construction of an assembly string graph using the FM-index - Jared T. Simpson and Richard Durbin

This earlier paper from Simpson and Durbin

iii) Efficient de novo assembly of large genomes using compressed data structures - Jared T. Simpson and Richard Durbin

AliTV—interactive visualization of whole genome comparisons

Jit — Wed, 10 Jan 2018 07:08:17 -0600

AliTV, which provides interactive visualization of whole genome alignments. AliTV reads multiple whole genome alignments or automatically generates alignments from the provided data. Optional feature annotations and phylo- genetic information are supported. The user-friendly, web-browser based and highly customizable interface allows rapid exploration and manipulation of the visualized data as well as the export of publication-ready high-quality figures. AliTV is freely available at https://github.com/AliTVTeam/AliTV

https://alitvteam.github.io/AliTV/

Address of the bookmark: https://github.com/AliTVTeam/AliTV

GenomeTools: The versatile open source genome analysis software

Jit — Wed, 07 Feb 2018 10:44:18 -0600

The GenomeTools genome analysis system is a free collection of bioinformatics tools (in the realm of genome informatics) combined into a single binary named gt. It is based on a C library named “libgenometools” which consists of several modules.

If you are interested in gene prediction, have a look at GenomeThreader.

Address of the bookmark: http://genometools.org/

AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references

Manisha Mishra — Tue, 17 Apr 2018 16:21:20 -0500

AlignGraph is a software that extends and joins contigs or scaffolds by reassembling them with help provided by a reference genome of a closely related organism.

Using AlignGraph

AlignGraph --read1 reads_1.fa --read2 reads_2.fa --contig contigs.fa --genome genome.fa --distanceLow distanceLow --distanceHigh distancehigh --extendedContig extendedContigs.fa --remainingContig remainingContigs.fa [--kMer k --insertVariation insertVariation --coverage coverage --part p --fastMap --ratioCheck --iterativeMap --misassemblyRemoval --resume]

Address of the bookmark: https://github.com/baoe/AlignGraph