BOL: Related items

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

Address of the bookmark: http://mitra.stanford.edu/

04- Informatics Approach to Cancer - Interview with Dr. Joel Saltz

Mon, 07 Oct 2013 14:35:43 -0500

For additional information visit http://www.cancerquest.org/joel-saltz-interview. Dr. Joel Saltz is a Professor in the Departments of Pathology, Biostatistics and Bioinformatics, and Mathematics and Computer Science at Emory University. Dr. Saltz's research on bioinformatics spans several disciplines. One project involves applying computer analysis to medical imaging to yield better results for patients. As an example, a computer program may able to help doctors detect small cancers in a CT scan or mammogram. In this interview segment, Dr. Saltz discusses the informatics approach to cancer. To learn more about cancer and watch additional interviews, please visit the CancerQuest website at http://www.cancerquest.org.

Evolution and Cancer

Fri, 27 Sep 2013 11:28:49 -0500

Air date: Wednesday, January 04, 2012, 3:00:00 PM Time displayed is Eastern Time, Washington DC Local Category: Wednesday Afternoon Lectures Description: There is a broad consensus that cancer is the result of somatic cells having serially gained, by a series of mutations, the ability to grow independently, to recruit resources from the circulation and the stroma, to invade local tissues, and to found anatomically distant metastases, ultimately killing the host. From the point of view of the cancer-causing somatic cell population, this is evolution driven by mutation and selection. Genomics has resulted in a parallel consensus that the central functions of all eukaryotes are highly conserved, not only at the level of individual protein functions, but also complex biological pathways and systems. These ideas motivated a comparison between results of molecular genetic studies of experimental evolution in yeast and the molecular genetic phenomena associated with tumorigenesis and tumor progression. We find some very striking similarities, including recurring genomic rearrangements, alterations of the regulation of specific growth-promoting genes, population-genetic features that affect the fitness trajectories of growth rate variants in evolving populations, and physiological and metabolic similarities derived from the conservation of the basic plan of growth and cell multiplication among all eukaryotes. It is hoped that some of the insights from yeast will aid the interpretation of sequence changes found in tumors, especially in the urgent necessity to distinguish 'driver' from 'passenger' mutations." David Botstein's fundamental contributions to modern genetics include the development of genetic methods for understanding biological functions and the discovery of the functions of many yeast and bacterial genes. In 1980, Botstein and three colleagues proposed a method for mapping human genes that laid the groundwork for the Human Genome Project. The basic principle of the mapping scheme was to develop, by recombinant DNA techniques, random single-copy DNA probes capable of detecting DNA sequence polymorphisms when hybridized to restriction digests, or specific fragments, of an individual's DNA. The method was used in subsequent years to identify several human disease genes, such as Huntington's and BRCA1. Variations of this method enabled the sequencing phase of the Human Genome Project. In the 1990s Botstein, having moved to Stanford University School of Medicine, collaborated with Patrick O. Brown of Stanford in exploiting DNA microarrays to study genome-wide gene expression patterns in yeast and in human cancers. This required developing a new statistical method and graphical interface, widely used today to interpret genomic data. Botstein also has helped to create, with Michael Ashburner and Gerald Rubin, a bioinformatics initiative to unify the representation of gene and gene product attributes across all species, called Gene Ontology. He graduated from Harvard College and earned his doctorate from the University of Michigan. He worked at Massachusetts Institute of Technology from 1967 to 1988; served as vice president for science at Genentech from 1988 to 1990; chaired the Department of Genetics at the Stanford University School of Medicine from 1990 to 2003; and joined the Princeton University faculty in 2003. He has sat on numerous editorial boards and was the founding editor of Molecular Biology of the Cell. Among recent major awards, Bostein won the Peter Gruber Foundation Prize in Genetics in 2003, the Apple Science Innovator Award in 2008, and the Albany Medical Center Prize in 2010. The NIH Wednesday Afternoon Lecture Series includes weekly scientific talks by some of the top researchers in the biomedical sciences worldwide. For more information, visit: The NIH Director's Wednesday Afternoon Lecture Series Author: Dr. David Botstein, Princeton University Runtime: 00:59:58 Permanent link: http://videocast.nih.gov/launch.asp?17046

Ribbon: Visualizing complex genome alignments and structural variation:

Jit — Wed, 29 Nov 2017 07:40:22 -0600

Ribbon can be used for long reads, short reads, paired-end reads, and assembly/genome alignments. Instructions for each data format are available by clicking on "instructions" in each tab on the right.

Local installation:

You can install Ribbon locally from Github by following the instructions here: https://github.com/MariaNattestad/Ribbon

Address of the bookmark: http://genomeribbon.com/

Pre- or postdoctoral research fellowship in Structural Bioinformatics in Padova

Wed, 02 Oct 2013 15:12:22 -0500

University of Padova (URL: http://protein.bio.unipd.it/)

A research fellowship is available at the BioComputing Laboratory, University of Padova (URL: http://protein.bio.unipd.it/). A highly motivated and creative candidate is sought to work on structural bioinformatics. Specifically, the project entails the development of novel methods, tools and databases for the analysis of protein structures. The BioComputing Laboratory is a group of a dozen people working on several aspects of prediction of protein structure & function employing techniques at the intersection between biology, medicine, chemistry, physics & computer science. Our aim is to integrate the development of novel methods and their application to biologically relevant problems. We are looking for candidates with a solid Bioinformatics background, programming experience (Python, Perl, C++ and/or Java) and good knowledge of molecular biology (protein structure/function, signalling pathways). Candidates should have a degree with top marks, optionally hold a PhD, and be highly motivated to work on interdisciplinary research. Good knowledge of English, an open-minded spirit, being collaborative and creative are crucial. The fellowship, which should start in late 2013, is initially for one year. It will be commensurate to experience, can be extended depending on performance and may lead to a PhD degree. The successful candidate will be located at the BioComputing Laboratory, University of Padova. Travel support for conferences and/or research visits abroad may be provided. To apply, please send your CV, a brief description of your research background and the names of two (or more) references to Prof. Silvio Tosatto (Email: silvio.tosatto@unipd.it).

Contact Person (Referent): Silvio Tosatto
Ref. E-Mail: silvio.tosatto@unipd.it
Tel: +39 049 827 6269
Fax: +39 049 827 6260
Group Web Page: http://protein.bio.unipd.it/

Mugsy: multiple whole genome alignment tool

Jit — Fri, 08 Dec 2017 17:41:14 -0600

Mugsy is a multiple whole genome aligner. Mugsy uses Nucmer for pairwise alignment, a custom graph based segmentation procedure for identifying collinear regions, and the segment-based progressive multiple alignment strategy from Seqan::TCoffee. Mugsy accepts draft genomes in the form of multi-FASTA files and does not require a reference genome.

To cite Mugsy, use:

Angiuoli SV and Salzberg SL. Mugsy: Fast multiple alignment of closely related whole genomes.Bioinformatics 2011 27(3):334-4

Address of the bookmark: http://mugsy.sourceforge.net/

Contract Faculty-Bioinformatics at Maulana Azad National Institute of Technology

Thu, 12 Dec 2013 20:46:52 -0600

Contract Faculty-Bioinformatics at Maulana Azad National Institute of Technology

Job Description:F.No.11/10(1)/929 Qualifications: Candidates should have Ph.D. degree. If Ph.D. candidates are not available at least Post Graduate degree with GATE/NET qualification is a must. Walk-in-Interview on 19.12.2013 at 2.30 P.M. to 5.30 P.M .. at Maulana Azad National Institute of Technology: Bhopal For more details,please visit website:http://www.manit.ac.in/manitbhopal/Year2013/Recruitment/Contract_faculty/contract%20faculty%202013-2014.pdf

For more @ http://www.manit.ac.in/manitbhopal/Year2013/Recruitment/Contract_faculty/contract%20faculty%202013-2014.pdf

Web address @ :http://www.manit.ac.in

Magic-BLAST: a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome.

Jit — Tue, 26 Dec 2017 22:23:39 -0600

Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.

Magic-BLAST incorporates within the NCBI BLAST code framework ideas developed in the NCBI Magic pipeline, in particular hit extensions by local walk and jump (http://www.ncbi.nlm.nih.gov/pubmed/26109056), and recursive clipping of mismatches near the edges of the reads, which avoids accumulating artefactual mismatches near splice sites and is needed to distinguish short indels from substitutions near the edges.

Address of the bookmark: https://ncbi.github.io/magicblast/

Yau Group

Tue, 15 Oct 2013 13:05:15 -0500

Yau Group are a new research group based at the Wellcome Trust Centre for Human Genetics and the Department of Statistics at the University of Oxford.

Yau Group develops statistical and computational methods for the analysis of genomic datasets with a particular interest in cancer sequencing applications and the use of Bayesian Statistics.

Yau Group are currently have projects in somatic mutation analysis of heterogeneous cancers, data fusion or integration techniques and single cell genomics.

More @ http://www.well.ox.ac.uk/~cyau/index.html

Carefully opt for human reference genome

biogeek — Tue, 18 Feb 2020 07:43:32 -0600

Heng Li posted several issues with the human reference genomes given in these resources and suggests the following compressed FASTA file to be used as hg38/GRCh38 human reference genome.

if you map reads to GRCh38 or hg38, use the following:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

There are several other versions of GRCh37/GRCh38. What’s wrong with them? Here are a collection of potential issues:

More at http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

Address of the bookmark: http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use