BOL: Related items

Troyanskaya Lab

Tue, 04 Feb 2020 06:40:36 -0600

The goal of our research is to interpret and distill this complexity through accurate analysis and modeling of molecular pathways, particularly those in which malfunctions lead to the manifestation of disease. We are inventing integrative methods for systems-level pathway modeling through integrative analysis of genome-scale datasets. We apply these approaches in studying challenging biological problems, such as how pathways function in diverse cell types and how they change dynamically.

https://function.princeton.edu/

Frequent parameters for bioinformatics tools !

BioStar — Tue, 27 Oct 2020 19:42:32 -0500

Third party executable parameters and options.

Trimmomatic

“ILLUMINACLIP:...:2:30:10”

“LEADING:15”

“TRAILING:15”

“SLIDINGWINDOW:4:20”

“MINLEN:20”

“TOPHRED33”

Filtlong

--min_length 500

--min_mean_q 85

--min_window_q 65

FastQ Screen

--aligner bowtie2' (bwa for PacBio)

--subset 1000 (for PacBio)

SPAdes

--careful

--disable-gzip-output

--cov-cutoff auto

--phred-offset 33

HGAP

Pbalign.task_options.min_accuracy: 70

Pbalign.task_options.no_split_subreads: false

Genomic_consensus.task_options.min_confidence: 40

falcon_ns.task_options.HGAP_GenomeLength_str:

6000000

Pbcoretools.task_options.read_length: 0

Genomic_consensus.task_options.use_score: 0

Pbalign.task_options.min_length: 50

Pbalign.task_options.algorithm_options: --minMatch 12

--bestn 10 --minPctSimilarity 70.0

Pbalign.task_options.hit_policy: randombest

Pbcoretools.task_options.other_filters: rq >= 0.7

Pbalign.task_options.concordant: false

Genomic_consensus.task_options.min_coverage: 5

falcon_ns.task_options.HGAP_SeedCoverage_str: 30

falcon_ns.task_options.HGAP_AggressiveAsm_bool: false

Genomic_consensus.task_options.algorithm: best

falcon_ns.task_options.HGAP_SeedLengthCutoff_str: -1

Genomic_consensus.task_options.diploid: false

MeDuSa

-random 100

Prokka

--usegenus

--force

--addgenes

--rfam

--rawproduct

cmsearch (taxonomy, 16S)

--rfam

--noali

blastn (taxonomy, 16S)

-evalue 1E-10

blastn (MLST)

-ungapped

-dust no

-evalue 1E-20

-word_size 32

-culling_limit 2

-perc_identity 95

blastp (VF)

-culling_limit 2

RGI (ABR)

--input_type contig

bowtie2 (mapping)

--sensitive

minimap2 (mapping)

-a

-x map-ont

samtools mpileup (SNP detection)

-uRI

bcftools call (SNP detection)

--variants-only

--skip-variants indels

--output-type v

--ploidy 1

-c

SNPsift filter (SNP detection)

"( QUAL >= 30 ) & (( na FILTER ) | (FILTER = 'PASS')) &

( DP >= 20 ) & ( MQ >= 20 )"

SNPeff ann (SNP detection)

-nodownload

-no-intron

-no-downstream

-no SPLICE_SITE_REGION

-upDownStreamLen 250

bcftools consensus

(phylogenetic tree)

--haplotype 1

fasttreemp

-nt

-boot 100

roary

-e

-n

-cd 100

-g 100000

Upset plots !

BioStar — Fri, 24 Mar 2023 22:30:23 -0500

Upset plots are a type of visualization used to analyze the intersection of sets or categories. They are particularly useful for displaying data with multiple categories and analyzing their overlaps.

In an upset plot, each row represents a category or set, and each column represents a data point. The length of the bar for each category indicates the number of data points that belong to that category. The plot also shows the intersections between categories, represented by overlapping bars.

Upset plots are useful for visualizing complex data with multiple categories and intersections, and can help identify patterns and relationships between categories. They are often used in fields such as bioinformatics, where they can be used to analyze gene expression data or to compare the results of different experimental conditions.

https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html#example-with-the-genomic-regions

Address of the bookmark: https://jokergoo.github.io/ComplexHeatmap-reference/book/upset-plot.html#example-with-the-genomic-regions

Interview with a bioinformatician series ...

Jitendra Narayan — Sat, 27 Dec 2014 13:30:15 -0600

The aim of this series to interviews some notable bioinformaticians to get their views on various aspects of bioinformatics research. Hopefully these answers will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

This series will be available at BOL every fortnight.

Simka and SimkaMin are comparative metagenomics method dedicated to NGS datasets

Neel — Sat, 06 Jul 2019 13:56:10 -0500

Simka is a de novo comparative metagenomics tool. Simka represents each dataset as a k-mer spectrum and compute several classical ecological distances between them.

Developper: Gaëtan Benoit, PhD, former member of the Genscale team at Inria.

Contact: claire dot lemaitre at inria dot fr

Simka and SimkaMin are comparative metagenomics method dedicated to NGS datasets. https://gatb.inria.fr/software/simka/

Address of the bookmark: https://github.com/GATB/simka

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

Address of the bookmark: http://mitra.stanford.edu/

Ribbon: Visualizing complex genome alignments and structural variation:

Jit — Wed, 29 Nov 2017 07:40:22 -0600

Ribbon can be used for long reads, short reads, paired-end reads, and assembly/genome alignments. Instructions for each data format are available by clicking on "instructions" in each tab on the right.

Local installation:

You can install Ribbon locally from Github by following the instructions here: https://github.com/MariaNattestad/Ribbon

Address of the bookmark: http://genomeribbon.com/

Mash: fast genome and metagenome distance estimation using MinHash

Jit — Tue, 12 Dec 2017 17:30:12 -0600

Mash is normally distributed as a dependency-free binary for Linux or OSX (see https://github.com/marbl/Mash/releases). This source distribution is intended for other operating systems or for development. Mash requires c++11 to build, which is available in and GCC >= 4.8 and OSX >= 10.7.

See http://mash.readthedocs.org for more information.

Address of the bookmark: https://github.com/marbl/Mash/releases

GIGGLE: a search engine for large-scale integrated genome analysis

Jit — Wed, 10 Jan 2018 03:10:45 -0600

GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.

https://www.nature.com/articles/nmeth.4556

Address of the bookmark: https://github.com/ryanlayer/giggle

GenomeTools: The versatile open source genome analysis software

Jit — Wed, 07 Feb 2018 10:44:18 -0600

The GenomeTools genome analysis system is a free collection of bioinformatics tools (in the realm of genome informatics) combined into a single binary named gt. It is based on a C library named “libgenometools” which consists of several modules.

If you are interested in gene prediction, have a look at GenomeThreader.

Address of the bookmark: http://genometools.org/