BOL: Related items

Monitor running jobs on Linux server

Jitendra Narayan — Fri, 06 Jun 2014 16:18:43 -0500

You as a bioinformatican run lots of program on your servers. Sometime the shared server is also used by your colleague. If server is busy you sometime need to check the running programs and want to monitor the running programs as well. The "top" command will come in handy when you need to find out if things are still running, how long they’ve been running, or how much memory is being used.

‘top’ is very simple to run: type

%% top

You’ll get a screen that looks like this, and is updated regularly:

Simple, right? Heh.

First! Note that you can use ‘q’ or ‘CTRL-C’ to exit from ‘top’.

Now let’s read and understand at each line independently.

The first line:

top - 23:00:48 up 39 days, 2 user, load average: 0.00, 0.00, 0.00

The first line tells you the current time, how long the machine has been up, how many users are logged in, and the short/medium/long-term compute load on the machine. If you run something for a long time, you’ll see these numbers go up. Right now, the machine is basically just sitting there, so these are all close to 0.

The second line:

Tasks: 239 total,   1 running, 238 sleeping,   0 stopped,   0 zombie

This line tells you how many processes are running. If you are using laptops machines it’s not so interesting because you really are the only one using this machine.

Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

This line contains the CPU load. The first two numbers are how busy the system is doing computation (“us” stands for “user”) and how busy the system is doing system-y things like accessing disks or network (“sy” stands for “system”). We’ll talk more about this later.

Mem:   49457320k total,    3492174k used, 14535596k free,    1435148k buffers

This should be easy to understand – how much memory you’re using!

Swap:   539356k total,   28332k used,   836562k free,    29862014k cached

Swap is just on-disk memory that can be used to “swap” out programs from main memory. Again, we’ll talk about this later.:

PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
1 root      39 19 0 0 0 S 0.0 0.0   246:57.22 kipmi0
2 root      RT   0     0    0    0 S 0.0 0.0   0:00.00 migration/0

And... finally! What’s actually running! The two most important numbers are the %CPU and %MEM towards the right, as well as the COMMAND. This tells you how compute- and memory-intensive your program is. Right now, nothing’s running so the numbers aren’t very interesting, but just wait until we run something...

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

Address of the bookmark: http://mitra.stanford.edu/

Assistant Professor in Bioinformatics at Dr. D. Y. Patil Biotechnology & Bioinformatics Institute

Tue, 03 Jun 2014 19:54:15 -0500

Dr. D. Y. Patil Biotechnology & Bioinformatics Institute
Tathawade, Pune 411033.

Assistant Professor in Bioinformatics

Essential :
First Class Master’s Degree in the appropriate branch of Life Sciences / Technology (Tech.)
OR
Ph.D in Life Sciences or in the respective subject area of specialization
OR
Good Academic record with at least 55% marks (or an equivalent grade) at the Master’s Degree level, in the relevant subject or an equivalent degree from an Indian / Foreign University.
Besides fulfilling the above qualifications, candidates should have cleared the eligibility test (NET) for lecturers conducted by the UGC, CSIR or similar test accredited by the UGC and as per the requirements of UGC guidelines.

Desirable :
Teaching, research industrial and/or professional experience in a reputed organization.
Papers presented at Conferences and/or in refereed journals

Note : Application are invited in prescribed form Click here for Application Form
Kindly send your applications to “Registrar, Dr. D. Y. Patil Vidyapeeth, Pune, Sant Tukaram Nagar, Pimpri, Pune – 411018., Maharashtra, India.” should reach in the University office within 15 days from the publication.

More Info: http://www.dpu.edu.in/BiotechResearchPositions.aspx

Ribbon: Visualizing complex genome alignments and structural variation:

Jit — Wed, 29 Nov 2017 07:40:22 -0600

Ribbon can be used for long reads, short reads, paired-end reads, and assembly/genome alignments. Instructions for each data format are available by clicking on "instructions" in each tab on the right.

Local installation:

You can install Ribbon locally from Github by following the instructions here: https://github.com/MariaNattestad/Ribbon

Address of the bookmark: http://genomeribbon.com/

Faculty post at Zhejiang University

Tue, 10 Jun 2014 03:40:40 -0500

Zhejiang University (ZJU) is seeking faculty candidates for its newly launched, highly competitive and well funded “Hundred Talents Program”. This search covers all colleges and departments at ZJU. Applicants, expected to be about 35 years old, should hold PhD degree, and postdoctoral experiences are preferred for applicants in most fields. Applicants should have demonstrated commitment to excellence in teaching and research at a level comparable to the academic achievement of assistant professor or associate professor in world-renowned universities. Successful candidates must work full-time and are expected to establish internationally competitive and independent research program in cutting-edge areas of the relevant field at ZJU.

As one of the leading research-intensive universities in China, ZJU is located in the beautiful city of Hangzhou. Successful candidates will be employed as Principal Investigators and are qualified to supervise doctoral students. ZJU will offer an internationally competitive salary and the opportunity to purchase university's apartment at a price much lower than the market price, and will provide office and laboratory spaces as well as internationally competitive research startup packages.

Qualified applicants are strongly encouraged to submit their applications electronically to tr@zju.edu.cn. Applicants should include the following materials in pdf format: a comprehensive CV, a statement of research and teaching plan, and a list of 3 to 5 references with detailed contact information.

Contact：Talents Office, ZJU

Tel：+86-571-88981345, +86-571-88981390

Fax：+86-571-88981976

E-mail:tr@zju.edu.cn

jobTree based python wrapper to run the genome simulation tool suite Evolver

Jit — Fri, 08 Dec 2017 16:26:32 -0600

evolverSimControl (eSC) can be used to simulate multi-chromosome genome evolution on an arbitrary phylogeny (Newick format). In addition to simply running evolver, eSC also automatically creates statistical summaries of the simulation as it runs including text and image files. Also included are convenience scripts to: check on a running simulation and see detailed status and logging information; extract fasta sequence files from the leaf nodes of a completed simulation; extract pairwise multiple alignment files (.maf) from leaf and branch nodes from a completed simulation and with the help of mafJoin, join them together into a single maf covering the entire simulation.

Address of the bookmark: https://github.com/dentearl/evolverSimControl

XAMPP: Starting Apache fail Ubuntu

Ram Yash Pal — Sat, 07 Jun 2014 05:52:35 -0500

Once you install XAMMP on linux, the most common problem you face is Apache failure. To fix the issues please use following command to first stop and then again start it.

sudo /etc/init.d/apache2 stop

sudo /etc/init.d/mysql stop

sudo /etc/init.d/proftpd stop

sudo /opt/lampp/lampp start

PhpMyAdmin “Wrong permissions on configuration file, should not be world writable!”

Once the Xammp is installed, it might be possible to set up the configuration file in writable mode. Try the following steps:

Just chmod 0755 the file

sudo chmod 0755 config.inc.php

String graph based genome assembly software and tools !

Rahul Nayak — Tue, 19 Dec 2017 17:17:38 -0600

In graph theory, a string graph is an intersection graph of curves in the plane; each curve is called a "string". String graphs were first proposed by E. W. Myers in a 2005 publication. In recent Genome Research paper describing an innovative approach for assembling large genomes from NGS data caught our attention for several reasons. i) it give different "string graph" prospective of long lasting genome assembly problem ii) the paper is coauthored by Jared Simpson, the developer of ABySS assembler and Richard Durbin. iii) Simpson-Durbin algorithm is that it does not rely on de Bruijn graphs, and instead employs a different graph construction approach called ‘string graph’.

Following are the genome assembly tools based on string graph:

1.SGA (String Graph Assembler) https://github.com/jts/sga

Assembles large genomes from high coverage short read data. SGA is designed as a modular set of programs, which are used to form an assembly pipeline. SGA implements a set of assembly algorithms based on the FM-index. As the FM-index is a compressed data structure, the algorithms are very memory efficient. The SGA assembly has three distinct phases. The first phase corrects base calling errors in the reads. The second phase assembles contigs from the corrected reads. The third phase uses paired end and/or mate pair data to build scaffolds from the contigs. The output of this software is a PDF report that allows the properties of the genome and data quality to be visually explored. By providing more information to the user at the start of an assembly project, this software will help increase awareness of the factors that make a given assembly easy or difficult, assist in the selection of software and parameters and help to troubleshoot an assembly if it runs into problems.

2. SAGE: String-overlap Assembly of GEnomes https://github.com/lucian-ilie/SAGE2

SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers.

3. FSG: Fast String Graph

The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.

4. BASE https://github.com/dhlbh/BASE

It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.

5. Fermi https://github.com/lh3/fermi/

Fermi is a de novo assembler with a particular focus on assembling Illumina short sequence reads from a mammal-sized genome. In addition to the role of a typical assembler, fermi also aims to preserve heterozygotes which are often collapsed by other assemblers. Its ultimate goal is to find a minimal set of unitigs to represent all the information in raw reads.

If you want to learn about String Graph assembler, please read the following papers -

i) The Fragment Assembly String Graph - E. W. Myers

This paper describes the String Graph concept.

ii) Efficient construction of an assembly string graph using the FM-index - Jared T. Simpson and Richard Durbin

This earlier paper from Simpson and Durbin

iii) Efficient de novo assembly of large genomes using compressed data structures - Jared T. Simpson and Richard Durbin

Faculty Positions at Central University of Punjab

Mon, 07 Jul 2014 23:33:33 -0500

Faculty Positions: Rolling/Open Advertisement Advt.No: T-10 (2013)

Pay Scale: Pay Band Rs.15600-39100 with AGP of Rs.6,000/-

Essential Qualifications for Professors, Associate Professors, and Assistant Professors: As per “UGC REGULATIONS ON MINIMUM QUALIFICATIONS FOR APPOINTMENT OF TEACHERS AND OTHER ACADEMIC STAFF IN UNIVERSITIES AND COLLEGES AND MEASURES FOR THE MAINTENANCE OF STANDARDS IN HIGHER EDUCATION 2010“ and the 2nd Amendments to the regulation issued in June 2013.

For details: http://www.ugc.ac.in/oldpdf/regulations/revised_finalugcregulationfinal10.pdf http://www.ugc.ac.in/pdfnews/8539300_English.pdf and University rules.

Procedure to apply:

Application forms along with API form complete in all respect along with necessary documents and application fee of Rs. 500/-. (Rs. 250/- for Scheduled Caste/Scheduled Tribe/Person with disabilities) should be sent to:

Registrar, Central University of Punjab, City Campus, Mansa Road, Bathinda-151001

For more info visit: http://www.centralunipunjab.com/Teaching/Final%20Details-t10-2013.pdf, http://www.centralunipunjab.com/Teaching/Advertisement-t10-2013.jpg

Last Apply Date: 31 Dec 2014

GenomeTools: The versatile open source genome analysis software

Jit — Wed, 07 Feb 2018 10:44:18 -0600

The GenomeTools genome analysis system is a free collection of bioinformatics tools (in the realm of genome informatics) combined into a single binary named gt. It is based on a C library named “libgenometools” which consists of several modules.

If you are interested in gene prediction, have a look at GenomeThreader.

Address of the bookmark: http://genometools.org/