BOL: Related items

dnaPipeTE: de-novo assembly & annotation Pipeline for Transposable Elements

Rahul Nayak — Sat, 02 Dec 2017 18:25:44 -0600

dnaPipeTE (for de-novo assembly & annotation Pipeline for Transposable Elements), is a pipeline designed to find, annotate and quantify Transposable Elements in small samples of NGS datasets. It is very useful to quantify the proportion of TEs in newly sequenced genomes since it does not require genome assembly and works on small datasets (< 1X).

dnaPipeTE is developped by Clément Goubert, Laurent Modolo and the TREEP team of the LBBE: http://lbbe.univ-lyon1.fr/-Equipe-Elements-transposables-.html?lang=en
You can find the original publication in GBE here: https://academic.oup.com/gbe/article/7/4/1192/533768

output examples of quantification and TE landscape (relative age) produced by dnaPipeTE

Address of the bookmark: https://github.com/clemgoub/dnaPipeTE

KOBAS: a web server for gene/protein functional annotation and functional gene set enrichment

Jit — Fri, 19 Oct 2018 09:36:11 -0500

KOBAS 3.0 is a web server for gene/protein functional annotation (Annotate module) and functional gene set enrichment(Enrichment module). For Annotate module, it accepts gene list as input, including IDs or sequences, and generates annotations for each gene based on multiple databases about pathways, diseases, and Gene Ontology. For Enrichment module, it can accept either gene list or gene expression data as input, and generates enriched gene sets, corresponding name, p-value or a probability of enrichment and enrichment score based on results of multiple methods.

Address of the bookmark: http://kobas.cbi.pku.edu.cn/

multiPhATE: bioinformatics pipeline for functional annotation of phage isolates

Abhimanyu Singh — Thu, 16 May 2019 00:17:39 -0500

multiple-genome Phage Annotation Toolkit and Evaluator (multiPhATE). multiPhATE is a throughput pipeline driver that invokes an annotation pipeline (PhATE) across a user-specified set of phage genomes. This tool incorporates a de novo phage gene-calling algorithm and assigns putative functions to gene calls using protein-, virus-, and phage-centric databases.

Address of the bookmark: https://github.com/carolzhou/multiPhATE

EUKulele: Taxonomic annotation of the unsung eukaryotic microbes

Shruti Paniwala — Sat, 26 Dec 2020 12:10:17 -0600

EUKulele, an open-source software tool designed to assign taxonomy to microeukaryotes detected in meta-omic samples, and complement analysis approaches in other domains by accommodating assembly output and providing concrete metrics reporting the taxonomic completeness of each sample.

Address of the bookmark: https://github.com/AlexanderLabWHOI/EUKulele

Microscope

Jitendra Narayan — Fri, 04 Mar 2016 05:26:31 -0600

Microscope Platform user documentation.

The MicroScope platform is available at this URL:

https://www.genoscope.cns.fr/agc/microscope

Address of the bookmark: http://microscope.readthedocs.org/en/latest/index.html

3D Genome Browser: explore chromatin interaction data, such as Hi-C, ChIA-PET, Capture Hi-C, PLAC-Seq, and more

BioStar — Fri, 22 Jan 2021 20:19:32 -0600

Beside visualizing chromatin interaction data, you can also seamlessly browse other omics data such as ChIP-Seq or RNA-Seq for the same genomic region, and gain a complete view of both regulatory landscape and 3D genome structure for any given gene. You can also check the expression of any queried gene across hundreds of tissue/cell types measured by the ENCODE consortium. Finally, please check out the virtual 4C page, where we provide multiple methods to link distal cis-regulatory elements with their potential target genes, including virtual 4C, ChIA-PET and cross-cell-type correlation of proximal and distal DHSs.

Address of the bookmark: http://3dgenome.fsm.northwestern.edu/index.html

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

Address of the bookmark: http://mitra.stanford.edu/

coursera genome assembly tutorial

Jit — Sat, 25 Nov 2017 08:57:25 -0600

Solutions to Coursera Genome Sequencing (Bioinformatics II)

Address of the bookmark: https://github.com/iansealy/coursera-assembly

Bandage: interactive visualization of de novo genome assemblies

Shruti Paniwala — Mon, 04 Dec 2017 10:09:37 -0600

Bandage (a Bioinformatics Application for Navigating De novo Assembly Graphs Easily) is a tool for visualizing assembly graphs with connections. Users can zoom in to specific areas of the graph and interact with it by moving nodes, adding labels, changing colors and extracting sequences. BLAST searches can be performed within the Bandage graphical user interface and the hits are displayed as highlights in the graph. By displaying connections between contigs, Bandage presents new possibilities for analyzing de novo assemblies that are not possible through investigation of contigs alone.

Availability and implementation: Source code and binaries are freely available at https://github.com/rrwick/Bandage. Bandage is implemented in C++ and supported on Linux, OS X and Windows. A full feature list and screenshots are available at http://rrwick.github.io/Bandage.

Address of the bookmark: http://rrwick.github.io/Bandage/

Mugsy: multiple whole genome alignment tool

Jit — Fri, 08 Dec 2017 17:41:14 -0600

Mugsy is a multiple whole genome aligner. Mugsy uses Nucmer for pairwise alignment, a custom graph based segmentation procedure for identifying collinear regions, and the segment-based progressive multiple alignment strategy from Seqan::TCoffee. Mugsy accepts draft genomes in the form of multi-FASTA files and does not require a reference genome.

To cite Mugsy, use:

Angiuoli SV and Salzberg SL. Mugsy: Fast multiple alignment of closely related whole genomes.Bioinformatics 2011 27(3):334-4

Address of the bookmark: http://mugsy.sourceforge.net/