BOL: Related items

REST API

Neel — Mon, 04 Oct 2021 12:46:40 -0500

REST API

The Representational State Transfer (REST) sample clients are provided for a number of programming languages. For details of how to use these clients, download the client and run the program without any arguments.

Language	Download	Requirements
Perl	psiblast.pl	LWP and XML::Simple
Python	psiblast.py	xmltramp2

For details see Environment setup for REST Web Services and Examples for Perl REST Web Services Clients pages.

BEAP: Blast Extension and Assembly Program

Shruti Paniwala — Mon, 11 Jun 2018 04:52:56 -0500

The Blast Extension and Assembly Program (BEAP) is a computer program that uses a short starting DNA fragment, often a EST or partial gene segment, as "primer", to recursively blast nucleotide databases in an attempt to obtain all sequences that overlaps, directly or indirectly, with the "primer" therefore help to "extend" the length of the original sequence for constructing a "full length" sequence for functional analysis, or at least to obtain neighboring regions of the segment for SNP discovery and linkage disequilibrium analysis. The confidence of assembling the resulting sequences is achieved by using a known genome, such as human genome, as a reference. https://www.animalgenome.org/tools/beap/

Address of the bookmark: https://www.animalgenome.org/tools/beap/

PhylomeDB

Poonam Mahapatra — Mon, 12 Aug 2013 11:55:39 -0500

PhylomeDB is a public database for complete collections of gene phylogenies (phylomes). It allows users to interactively explore the evolutionary history of genes through the visualization of phylogenetic trees and multiple sequence alignments.

Moreover, phylomeDB provides genome-wide orthology and paralogy predictions which are based on the analysis of the phylogenetic trees. The automated pipeline used to reconstruct trees aims at providing a high-quality phylogenetic analysis of different genomes , including Maximum Likelihood or Bayesian tree inference, alignment trimming and evolutionary model testing. PhylomeDB includes also a public download section with the complete set of trees, alignments and orthology predictions.

More at http://phylomedb.org/

BioDBnet

Jit — Thu, 02 Jun 2016 11:11:47 -0500

Database to Database Conversions

db2db allows for conversions of identifiers from one database to other database identifiers or annotations. To use db2db select the input type of your data, changing the input type automatically changes the output options to the ones specific for the input selected. Then select one or more output types and add your identifiers in the ID list box. Set the remove duplicate values to 'No' if you do not want duplicates to be removed. Clicking on submit then returns a table of your inputs matched against all the outputs selected in the exact order as entered. Results can be limited to a particular taxon by entering it's Taxon ID. The performance will vary widely depending on the number of outputs and the options selected. Conversions to a single output with the default options should complete in a few seconds

Address of the bookmark: https://biodbnet-abcc.ncifcrf.gov/db/db2db.php

BioDownloader

Surabhi Chaudhary — Sat, 25 Feb 2017 17:52:33 -0600

BioDownloader is a program for downloading and/or updating files from ftp/http servers. The program has unique features that are specifically designed to deal with bioinformatics data files and servers:

optimized to work with vast amount of data and very large file sets (~ 10,000 - 100,000).
allows the selective retrieval of only the required files (file masks, ls-lR parsing, recursive search, updates)
has a built-in repository containing the settings for the most common bioinformatics download needs
built-in wizard for batch post-processing of downloaded files (archive extraction, file conversion, etc.)
capable of performing multiple download or update tasks simultaneously

BioDownloader has a built-in repository containing the settings for common bioinformatics file-synchronization needs, including the Protein Data Bank (PDB) and National Center for Biotechnology Information (NCBI) databases. It can post-process downloaded files, including archive extraction and file conversions.

http://dunbrack.fccc.edu/BioDownloader/

Address of the bookmark: http://dunbrack.fccc.edu/BioDownloader/

HIV genome database !

Rahul Nayak — Fri, 21 Jan 2022 05:40:15 -0600

HIV resources

https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html

Address of the bookmark: https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

Address of the bookmark: http://mitra.stanford.edu/

Updated ranking of institutes and countries based on developed biological databases

Rahul Nayak — Fri, 11 Jan 2019 09:35:26 -0600

Updated ranking of institutes and countries based on developed biological databases is available at https://lnkd.in/fiVAdM6 , India is maintaing 4th position and "Institute of Microbial Technology, Chandigarh" is on 3rd Position (after EBI and NCBI). This is a big achievement for any institute to reach on 3rd position in the world.

More at http://bigd.big.ac.cn/databasecommons/stat

DEG 5.0: a database of essential genes in both prokaryotes and eukaryotes

Rahul Nayak — Tue, 30 Mar 2021 11:47:28 -0500

Essential genes are those indispensable for the survival of an organism, and their functions are therefore considered a foundation of life. Determination of a minimal gene set needed to sustain a life form, a fundamental question in biology, plays a key role in the emerging field, synthetic biology.

DEG is freely available at the website http://tubic.tju.edu.cn/deg or http://www.essentialgene.org.

Address of the bookmark: http://www.essentialgene.org/

ChIPBase: open database for studying the transcription factor binding sites and motifs

Abhi — Wed, 29 Dec 2021 05:36:03 -0600

ChIPBase v2.0 is an open database for studying the transcription factor binding sites and motifs, and decoding the transcriptional regulatory networks of lncRNAs, miRNAs, other ncRNAs and protein-coding genes from ChIP-seq data. Our database currently contains ~10,200 curated peak datasets derived from ChIP-seq methods in 10 species.

Address of the bookmark: https://rna.sysu.edu.cn/chipbase/

BOL: Related items

REST API

REST API

Python

BEAP: Blast Extension and Assembly Program

PhylomeDB

BioDBnet

BioDownloader

HIV genome database !

Public Databases for Bioinformatics !

Updated ranking of institutes and countries based on developed biological databases

DEG 5.0: a database of essential genes in both prokaryotes and eukaryotes

ChIPBase: open database for studying the transcription factor binding sites and motifs