BOL: Related items

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

Address of the bookmark: http://mitra.stanford.edu/

DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication

Jit — Tue, 14 Nov 2017 10:26:16 -0600

We developed a prokaryotic genome annotation pipeline, DFAST, that also supports genome submission to public sequence databases. DFAST was originally started as an on-line annotation server, and to date, over 7,000 jobs have been processed since its first launch in 2016. Here, we present a newly implemented background annotation engine for DFAST, which is also available as a standalone command-line program. The new engine can annotate a typical-sized bacterial genome within 10 minutes, with rich information such as pseudogenes, translation exceptions, and orthologous gene assignment between given reference genomes. In addition, the modular framework of DFAST allows users to customize the annotation workflow easily and will also facilitate extensions for new functions and incorporation of new tools in the future.

Availability and Implementation

The software is implemented in Python 3 and runs in both Python 2.7 and 3.4– on Macintosh and Linux systems. It is freely available at https://github.com/nigyta/dfast_core/ under the GPLv3 license with external binaries bundled in the software distribution. An on-line version is also available at https://dfast.nig.ac.jp/.

Address of the bookmark: https://dfast.nig.ac.jp/

mScaffolder: A comparative genome scaffolding tool

Jit — Fri, 15 Jun 2018 04:48:01 -0500

A comparative genome scaffolding tool based on MUMmer

mScaffolder scaffolds a genome using an existing high quality genome as the reference. It aligns the two genomes using nucmer utility from MUMmer and then orders and orients the contigs of the candidate genome guided by their alignments to the reference genome. Please send your questions and comments to mchakrab@uci.edu.

Citation https://www.nature.com/articles/s41588-017-0010-y

Address of the bookmark: https://github.com/mahulchak/mscaffolder

Genome U-Plot: a whole genome visualization

Rahul Nayak — Fri, 13 Jul 2018 19:50:41 -0500

Genome U-Plot for producing clear and intuitive graphs that allows researchers to generate novel insights and hypotheses by visualizing SVs such as deletions, amplifications, and chromoanagenesis events. The main features of the Genome U-Plot are its layered layout, its high spatial resolution and its improved aesthetic qualities.

https://github.com/gaitat/GenomeUPlot

Address of the bookmark: https://github.com/gaitat/GenomeUPlot

GRSR: a tool for deriving genome rearrangement scenarios from multiple unichromosomal genome sequences

Jit — Fri, 28 Sep 2018 09:35:10 -0500

GRSR is a Tool for Deriving Genome Rearrangement Scenarios for Multiple Uni-chromosomal Genomes. This tool will do the following steps:

Step 1. Run mugsy to get multiple sequence alignment results.
Step 2 & 3. Extraction of the Coordinates of Core Blocks, Construction of Synteny Blocks and Generating Signed Permutations.
Step 4. Generate pairwise genome rearrangement scenarios and find repeats at the breakpoints of each rearrangement events.

https://github.com/DanwangJessica/GRSR

Address of the bookmark: https://github.com/DanwangJessica/GRSR

Cogent: a tool for reconstructing the coding genome using high-quality full-length transcriptome sequences.

Jit — Tue, 18 Jun 2019 05:33:04 -0500

Cogent is a tool that identifies gene families and reconstructs the coding genome using high-quality transcriptome data without a reference genome, and can be used to check assemblies for the presence of these known coding sequences.

Cogent is a tool for reconstructing the coding genome using high-quality full-length transcriptome sequences. It is designed to be used on Iso-Seq data and in cases where there is no reference genome or the ref genome is highly incomplete.

See a recent presentation on Cogent being applied to the Cuttlefish Iso-Seq data.

Cogent preliminary draft paper (updated 2016Dec version), Supplementary

Please see wiki for details on usage.

Address of the bookmark: https://github.com/Magdoll/Cogent

5700 year-old human genome !

Jit — Thu, 19 Dec 2019 11:22:18 -0600

A Landmark in genomics, scientists have done something that hasn't been done ever.

Scientists have reconstructed the genome of an ancient human who lived nearly 5,700 years ago in Southern Denmark from the birch pitch- an ancient tar-like substance.

By sequencing the sample, researchers not only discovered the ancient human DNA but also microbial DNA reflecting the oral microbiome of the person who chewed the pitch, along with plant and animal DNA that could be the recent meal she might have consumed.

The DNA sample is comparable in quality to well-preserved teeth and skull bones. The DNA suggests that the chewer was a female, most likely with dark skin, dark brown hair and blue eyes.

https://www.nature.com/articles/s41467-019-13549-9

Artistic reconstruction. (Tom Björklund)

More at https://gizmodo.com/scientists-reconstruct-lola-after-finding-her-dna-in-1840481633

Genomicus: genome browser that enables users to navigate in genomes in several dimensions

Jit — Mon, 28 Feb 2022 23:27:37 -0600

Genomicus is a genome browser that enables users to navigate in genomes in several dimensions: linearly along chromosome axes, transversaly across different species, and chronologicaly along evolutionary time.

Once a query gene has been entered, it is displayed in its genomic context in parallel to the genomic context of all its orthologous and paralogous copies in all the other sequenced metazoan genomes. Moreover, Genomicus stores and displays the predicted ancestral genome structure in all the ancestral species within the phylogenetic range of interest.

All the data on extant species displayed in this browser are from Ensembl.

Summary statistics of Genomicus version 105.01: (view species tree in pdf or newick)


Number of extant species	200
Number of extant genes	4303993
Number of ancestral species	196
Number of ancestral genes	4624213
Number of ancestral synteny blocks	83342

Address of the bookmark: https://www.genomicus.bio.ens.psl.eu/genomicus-105.01/cgi-bin/search.pl

Perl one-liner for bioinformatician !!!

Abhimanyu Singh — Fri, 30 May 2014 05:49:07 -0500

With the emergence of NGS technologies, and sequencing data most of the bioinformaticians mung and wrangle around massive amounts of genomics text. There are several "standardized" file formats (FASTQ, SAM, VCF, etc.) and some tools for manipulating them (fastx toolkit, samtools, vcftools, etc.), there are still times where knowing a little bit of Perl onliner is extremely helpful.

Perl one-liners are small and awesome Perl programs that fit in a single line of code and they do one thing really well. These things include changing line spacing, numbering lines, doing calculations, converting and substituting text, deleting and printing certain lines, parsing logs, editing files in-place, doing statistics, carrying out system administration tasks, updating a bunch of files at once, and many more. Perl one-liners will make you the shell warrior. Anything that took you minutes to solve, will now take you seconds!

perl -pe '$\="\n"'
#double space a file

perl -pe '$_ .= "\n" unless /^$/'
#double space a file except blank lines

perl -pe '$_.="\n"x7'
#7 space in a line.

perl -ne 'print unless /^$/'
#remove all blank lines

perl -lne 'print if length($_) < 20'
#print all lines with length less than 20.

perl -00 -pe ''
#If there are multiple spaces, delete all leaving one(make the file a single spaced file).

perl -00 -pe '$_.="\n"x4'
#Expand single blank lines into 4 consecutive blank lines

perl -pe '$_ = "$. $_"'
#Number all lines in a file

perl -pe '$_ = ++$a." $_" if /./'
#Number only non-empty lines in a file

perl -ne 'print ++$a." $_" if /./'
#Number and print only non-empty lines in a file

perl -pe '$_ = ++$a." $_" if /regex/'
#Number only lines that match a pattern

perl -ne 'print ++$a." $_" if /regex/'
#Number and print only lines that match a pattern

perl -ne 'printf "%-5d %s", $., $_ if /regex/'
#Left align lines with 5 white spaces if matches a pattern (perl -ne 'printf "%-5d %s", $., $_' : for all the lines)

perl -le 'print scalar(grep{/./}<>)'
#prints the total number of non-empty lines in a file

perl -lne '$a++ if /regex/; END {print $a+0}'
#print the total number of lines that matches the pattern

perl -alne 'print scalar @F'
#print the total number fields(words) in each line.

perl -alne '$t += @F; END { print $t}'
#Find total number of words in the file

perl -alne 'map { /regex/ && $t++ } @F; END { print $t }'
#find total number of fields that match the pattern

perl -lne '/regex/ && $t++; END { print $t }'
#Find total number of lines that match a pattern

perl -le '$n = 20; $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $m'
#will calculate the GCD of two numbers.

perl -le '$a = $n = 20; $b = $m = 35; ($m,$n) = ($n,$m%$n) while $n; print $a*$b/$m'
#will calculate lcd of 20 and 35.

perl -le '$n=10; $min=5; $max=15; $, = " "; print map { int(rand($max-$min))+$min } 1..$n'
#Generates 10 random numbers between 5 and 15.

perl -le 'print map { ("a".."z",”0”..”9”)[rand 36] } 1..8'
#Generates a 8 character password from a to z and number 0 – 9.

perl -le 'print map { ("a",”t”,”g”,”c”)[rand 4] } 1..20'
#Generates a 20 nucleotide long random residue.

perl -le 'print "a"x50'
#generate a string of ‘x’ 50 character long

perl -le 'print join ", ", map { ord } split //, "hello world"'
#Will print the ascii value of the string hello world.

perl -le '@ascii = (99, 111, 100, 105, 110, 103); print pack("C*", @ascii)'
#converts ascii values into character strings.

perl -le '@odd = grep {$_ % 2 == 1} 1..100; print "@odd"'
#Generates an array of odd numbers.

perl -le '@even = grep {$_ % 2 == 0} 1..100; print "@even"'
#Generate an array of even numbers

perl -lpe 'y/A-Za-z/N-ZA-Mn-za-m/' file
#Convert the entire file into 13 characters offset(ROT13)

perl -nle 'print uc'
#Convert all text to uppercase:

perl -nle 'print lc'
#Convert text to lowercase:

perl -nle 'print ucfirst lc'
#Convert only first letter of first word to uppercas

perl -ple 'y/A-Za-z/a-zA-Z/'
#Convert upper case to lower case and vice versa

perl -ple 's/(\w+)/\u$1/g'
#Camel Casing

perl -pe 's|\n|\r\n|'
#Convert unix new lines into DOS new lines:

perl -pe 's|\r\n|\n|'
#Convert DOS newlines into unix new line

perl -pe 's|\n|\r|'
#Convert unix newlines into MAC newlines:

perl -pe '/regexp/ && s/foo/bar/'
#Substitute a foo with a bar in a line with a regexp.

Reference/Sources:

http://genomics-array.blogspot.in/2010/11/some-unixperl-oneliners-for.html

http://genomespot.blogspot.com/2013/08/a-selection-of-useful-bash-one-liners.html

http://biowize.wordpress.com/2012/06/15/command-line-magic-for-your-gene-annotations/

http://genomics-array.blogspot.com/2010/11/some-unixperl-oneliners-for.html

http://bioexpressblog.wordpress.com/2013/04/05/split-multi-fasta-sequence-file/

deepTools

Martin Jones — Sat, 08 Nov 2014 15:02:08 -0600

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. To do so, deepTools contains useful modules to process the mapped reads data to create coverage files in standard bedGraph and bigWig file formats. By doing so, deepTools allows the creation of normalized coverage files or the comparison between two files (for example, treatment and control). Finally, using such normalized and standardized files, multiple visualizations can be created to identify enrichments with functional annotations of the genome.

Publicaton: http://nar.oxfordjournals.org/content/early/2014/05/05/nar.gku365.full

Source Code and Wiki: https://github.com/fidelram/deepTools/wiki

Galaxy Tool Shed repository: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools

and example Galaxy workflows: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools_workflows