BOL: Related items

Python and BioPython Tutorial

Manshi Raghubanshi — Fri, 23 Aug 2013 06:47:40 -0500

A quickstart tutorial that allows to become familiar with the Python language. The exercises expect knowledge of basic concepts of programming. A group of 2nd year computer science students with no previous Python knowledge required 60'-90' to complete the exercises. With about 3 hours time, the exercise is suitable for non-programmers as well.

Address of the bookmark: http://www.biotnet.org/training-materials/python-programmers

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

Address of the bookmark: http://mitra.stanford.edu/

Research Project Posts for CSIR Project, Delhi

Tue, 27 Aug 2013 04:31:41 -0500

Positions Open For Temporary Research Project Posts for CSIR Project, Delhi
CSIR is looking for bright young candidates to get involved in building algorithms and platforms for large biological data analyses in the areas of comparative genomics, computational workflows, disease association studies, simulating virtual organelles, etc. Anyone who fulfills the eligibility criteria mentioned below may appear for a walk-in interview on 3rd September 2013 at CSIR Headquarters, Anusandhan Bhawan, 2 Rafi Marg, Delhi – 110001.
you can go to link for details or download PDF

http://www.csir.res.in/External/Heads/aboutcsir/announcements/ProjectPost_130813.pdf

Bioinformatics Algorithm Demonstrations and Tutorials

Jitendra Narayan — Thu, 29 Aug 2013 09:23:51 -0500

Abstract

This project presents demonstrations of selected computer science algorithms important in bioinformatics, implemented in the spreadsheet program Microsoft Excel. Spreadsheets provide an interesting platform for demonstration of algorithms, since various steps of the calculations can be exposed in a manner that is easily comprehensible to users with little programming experience. The algorithms demonstrated include two approaches to approximate string matching (dynamic programming and Shift-AND numeric approximate matching), Hierarchical Clustering (used in phylogenetic studies and microarray analysis of gene expression), a Naive Bayes Classifier for simulated microarray gene expression data, and a simple Neural Network. These demonstrations are designed to serve as instructional aids in bioinformatics courses.

Tutorial @ http://www.cybertory.org/downloads/bae/BioinformaticsAlgorithmsInExcel.zip

One of the best resource for online bioinformatics learning is https://stepic.org/Bioinformatics-Algorithms-2 Enjoy the online learning.

Reference : cybertory

" Please add your favourite bioinformatics algorithms and tutorial links below in the comment section, for the benefit of bioinformatics and computational biology community ".

Address of the bookmark: http://www.cybertory.org/downloads/bae/BioinformaticsAlgorithmsExcelDoc.pdf

jobTree based python wrapper to run the genome simulation tool suite Evolver

Jit — Fri, 08 Dec 2017 16:26:32 -0600

evolverSimControl (eSC) can be used to simulate multi-chromosome genome evolution on an arbitrary phylogeny (Newick format). In addition to simply running evolver, eSC also automatically creates statistical summaries of the simulation as it runs including text and image files. Also included are convenience scripts to: check on a running simulation and see detailed status and logging information; extract fasta sequence files from the leaf nodes of a completed simulation; extract pairwise multiple alignment files (.maf) from leaf and branch nodes from a completed simulation and with the help of mafJoin, join them together into a single maf covering the entire simulation.

Address of the bookmark: https://github.com/dentearl/evolverSimControl

4273π: Bioinformatics education on low cost ARM hardware

Jitendra Narayan — Mon, 02 Sep 2013 07:02:43 -0500

Are you teaching bioinformatics at universities and found it complicated by typical computer classroom settings. As well as running software locally and online, students should gain experience of systems administration. Hmm don't worry there is one new OS for the rescue. 4273π, an operating system image for Raspberry Pi based on Raspbian Linux. It provides an attractive, general-purpose computing environment, within which the course 4273π Bioinformatics for Biologists is embedded.

Though far slower than current desktop and laptop computers, the Raspberry Pi is notably faster than the Cray 1 supercomputer, a marvel of computer speed in its day. The Raspberry Pi approach includes all the benefits of the laptop approach, above, but at lower cost. In addition, the Raspberry Pi is a new and exciting computer system, which in itself can add interest to the course.

As the Raspbian operating system, Raspberry Pi firmware and hardware and 4273π Bioinformatics for Biologists teaching material develop, further releases of 4273π will be made available. It is anticipated that there will be a minimum of two releases per year during the next four years.

4273π is a means to teach bioinformatics, including systems administration tasks, to undergraduates at low cost.

Descriptive paper @ http://www.biomedcentral.com/1471-2105/14/243

Image source: BMC Bioinformatics

Magic-BLAST: a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome.

Jit — Tue, 26 Dec 2017 22:23:39 -0600

Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.

Magic-BLAST incorporates within the NCBI BLAST code framework ideas developed in the NCBI Magic pipeline, in particular hit extensions by local walk and jump (http://www.ncbi.nlm.nih.gov/pubmed/26109056), and recursive clipping of mismatches near the edges of the reads, which avoids accumulating artefactual mismatches near splice sites and is needed to distinguish short indels from substitutions near the edges.

Address of the bookmark: https://ncbi.github.io/magicblast/

Huber Lab

Mon, 09 Sep 2013 21:57:03 -0500

The Huber group develops computational and statistical methods to design and analyse novel experimental approaches in genetics and cell biology.

Future projects and goals

Large-scale systematic maps of gene-gene and gene-environment interactions by automated phenotyping, using image analysis, machine learning, sparse model building and causal inference.
DNA-, RNA- and ChIP-Seq and their applications to gene expression regulation: statistical and computational foundations.
Cancer genomics, genomes as biomarkers, cancer phylogeny.
Image analysis for systems biology: measuring the dynamics of cell cycle and of cell migration of individual cells under normal conditions and many different perturbations (RNAi, drugs).

More @ http://www.embl.de/research/units/genome_biology/huber/index.html

Carefully opt for human reference genome

biogeek — Tue, 18 Feb 2020 07:43:32 -0600

Heng Li posted several issues with the human reference genomes given in these resources and suggests the following compressed FASTA file to be used as hg38/GRCh38 human reference genome.

if you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here are a collection of potential issues:

More at http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Address of the bookmark: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Asst. PROF IN BIOINFORMATICS at JAIPUR NATIONAL UNIVERSITY

Thu, 12 Sep 2013 07:18:02 -0500

JAIPUR NATIONAL UNIVERSITY, SCHOOL OF LIFE SCIENCES (SIILAS CAMPUS) URGENTLY REQUIRES

Asst. PROF IN BIOINFORMATICS.

QUALIFICATION: AS PER UGC

DESIRABLE: 1 YEAR EXPERIENCE IN ACADEMICS

CONTACT immediately

Prof D.S.Bhatia
Director
9351288070

Last date within 7 days of the publication.

Find more @ http://jnujaipur.ac.in/downloads/AdvtDec2012.jpg