BOL: Related items

Comparative genomics educational material and papers bookmarks

Jit — Wed, 09 Nov 2016 16:23:30 -0600

Alignment of the porcine genome against seven other mammalian genomes (Supplementary Information) identified homologous synteny blocks (HSBs). Using porcine HSBs and stringent filtering criteria, 192 pig-specific evolutionary breakpoint regions (EBRs) were located. The number of porcine EBRs is comparable to the number of bovine-lineage-specific EBRs (100) reported earlier using a slightly lower resolution (500 kilobases (kb)), indicating that both lineages evolved with an average rate of ~2.1 large-scale rearrangements per million years after the divergence from a common cetartiodactyl ancestor ~60 Myr ago². This rate compares to ~1.9 rearrangements per million years within the primate lineage (Supplementary Table 11). A total of 20 and 18 cetartiodactyl EBRs (shared by pigs and cattle) were detected using the pig and human genomes as a reference, respectively.

Address of the bookmark: http://www.nature.com/nature/journal/v491/n7424/abs/nature11622.html

The Sheppard Lab

Fri, 09 Aug 2024 02:48:34 -0500

Ineos Oxford Institute of Antimicrobial Research – Department of Biology – University of Oxford

Our research centres on the use of genetics/genomics and phenotypic studies to address complex questions in the ecology, epidemiology and evolution of microbes. Our most recent interest focuses upon comparative genome analysis to describe the core and flexible genome of pathogenic bacteria (Campylobacter, Acinetobacter, Escherichia coli, Helicobacter, Staphylococcus and Streptococcus suis) and how this is related to population genetic structuring, the maintenance of species, and the evolution of host/niche adaptation and virulence.

More at https://sheppardlab.com/research/

Ancestral sequence reconstruction (ASR) or ancestral gene/sequence reconstruction/resurrection tools to study molecular evolution

Jit — Tue, 30 May 2017 04:20:05 -0500

Ancestral sequence reconstruction (ASR) – also known as ancestral gene/sequence reconstruction/resurrection – is a technique used in the study of molecular evolution. The method consists of the synthesis of an ancestral gene and expression of the corresponding ancestral protein. The idea of protein 'resurrection' was suggested in 1963 by Pauling and Zuckerkandl. Some early efforts were made in the eighties-nineties, led by the laboratory of Steven A. Benner, showing the potential of this technique – one that only started to be fulfilled in the post-genomic era. Thanks to the improvement of algorithms and of better sequencing and synthesis techniques, the method was developed further in the early 2000s to allow the resurrection of a greater variety of and much more ancient genes. Over the last decade, ancestral protein resurrection has developed as a strategy to reveal the mechanisms and dynamics of protein evolution.

Following are the list of Ancestral /sequence/ reconstruction (ASR) tools:

inferCars

Reconstructs contiguous regions of an ancestral genome. Given information about adjacencies between conserved segments in each modern species, our goal is to infer segment order in the ancestral genome. To get a clean and precise statement of the problem, we formalize it using graph theory. We develop an algorithm that identifies a most parsimonious scenario for the history of each individual adjacency, although the whole-genome prediction is not guaranteed to optimize traditional measures like the number of breakpoints. We introduce weights to the graph edges to model the reliability of each adjacency.

ANGES:reconstructing ANcestral GEnomeS maps

A suite of Python programs that allows reconstructing ancestral genome maps from the comparison of the organization of extant-related genomes. ANGES can reconstruct ancestral genome maps for multichromosomal linear genomes and unichromosomal circular genomes. It implements methods inspired from techniques developed to compute physical maps of extant genomes.

Cocos

Constructs phylogenies of multi-domain proteins. With a given species tree and domain phylogenies, the procedure infers the composition of ancestral multi-domain proteins. Cocos implements and extend a suggested algorithmic approach by Behzadi and Vingron in an easy-to-use program. Such method could be applied to reconstruction of partial homologous units such as bacterial operons or protein complexes.

MySSP

Constructs an initial DNA sequence at the root of the tree and simulates evolution across the tree using a variety of common models of DNA evolution. MySSP is a program for the simulation of DNA sequence evolution across a phylogenetic tree. It is designed for large-scale studies, including simulation of multiple replicates and outputs sequences into NEXUS, MEGA, or FASTA formats. MySSP has a fairly simple graphical user interface (GUI) for basic use, but also has a specialized batch script interpreter to allow for more complicated or large-scale simulations.

PARANA: Parsimonious Ancestral Reconstruction And Network Analysis

Performs parsimony based inference of ancestral biological networks. Given multiple extant networks and phylogenetic information relating extant nodes, PARANA finds a parsimonious set of ancestral interaction events (edge gains and losses) which explain the extant networks. The framework adopted by PARANA is able to represent network evolution under models that support gene duplication and loss and independent interaction gain and loss. The method works on both directed and undirected networks and can incorporate asymmetric interaction gain and loss costs. In contrast to previous approaches, PARANA does not require knowing the relative ordering of unrelated duplication events and thus, works on phylogenetic trees even where branch lengths are not provided.

GapAdj: Gapped Adjacencies

A synteny-based method that is flexible enough to handle a model of evolution involving whole genome duplication events, in addition to rearrangements, gene insertions, and losses. Ancestral relationships between markers are defined in term of Gapped Adjacencies, i.e. pairs of markers separated by up to a given number of markers. It improves on a previous restricted to direct adjacencies, which revealed a high accuracy for adjacency prediction, but with the drawback of being overly conservative, i.e. of generating a large number of contiguous ancestral regions (CARs).

ANCESTOR

A web server allowing one to easily and quickly perform the last three steps of the ancestral genome reconstruction procedure. Ancestors implements several alignment algorithms, an indel maximum likelihood solver and a context-dependent maximum likelihood substitution inference algorithm. The results presented by the server include the posterior probabilities for the last two steps of the ancestral genome reconstruction and the expected error rate of each ancestral base prediction.

ProCARs

Reconstructs ancestral gene orders as contiguous ancestral regions (CARs) with a progressive homology-based method. ProCARs runs from a phylogeny tree (without branch lengths needed) with a marked ancestor and a block file. This homology-based method is based on iteratively detecting and assembling ancestral adjacencies, while allowing some micro-rearrangements of synteny blocks at the extremities of the progressively assembled CARs. The method starts with a set of blocks as the initial set of CARs, and detects iteratively the potential ancestral adjacencies between extremities of CARs, while building up the CARs progressively by adding, at each step, new non-conflicting adjacencies that induce the less homoplasy phenomenon. The species tree is used, in some additional internal steps, to compute a score for the remaining conflicting adjacencies, and to detect other reliable adjacencies, in order to reach completely assembled ancestral genomes.

FastML

A user-friendly tool for the reconstruction of ancestral sequences. FastML implements various novel features that differentiate it from existing tools: (i) FastML uses an indel-coding method, in which each gap, possibly spanning multiples sites, is coded as binary data. FastML then reconstructs ancestral indel states assuming a continuous time Markov process. FastML provides the most likely ancestral sequences, integrating both indels and characters; (ii) FastML accounts for uncertainty in ancestral states: it provides not only the posterior probabilities for each character and indel at each sequence position, but also a sample of ancestral sequences from this posterior distribution, and a list of the k-most likely ancestral sequences; (iii) FastML implements a large array of evolutionary models, which makes it generic and applicable for nucleotide, protein and codon sequences; and (iv) a graphical representation of the results is provided, including, for example, a graphical logo of the inferred ancestral sequences.

maxAlike

Reconstructs a genomic sequence for a specific taxon based on sequence homologs in other species. The input is a multiple sequence alignment and a phylogenetic tree that also contains the target species. For this target species, the algorithm computes nucleotide probabilities at each sequence position. Consensus sequences are then reconstructed based on a certain confidence level.

MLGO: Maximum Likelihood for Gene Order Analysis

A web tool for the reconstruction of phylogeny and/or ancestral genomes from gene-order data. MLGO was designed for analysis of large-scale genomic changes including not only rearrangements but also gene insertions, deletions and duplications. MLGO can be used to infer a phylogeny from genome rearrangement and gene order data, and can also obtain an estimation of ancestral genomes, given an input tree. MLGO takes the advantage of binary encoding on gene-order data, supports a fairly general model of genomic evolution (rearrangements plus duplications, insertions, and losses of genomic regions), and successfully accommodates itself into the framework of maximized likelihood.

Image Reference : Wiki

Sidow Lab

Fri, 07 Dec 2018 09:06:30 -0600

We study mechanisms of cancer evolution by using state-of-the-art genomic approaches at the bench and in analysis. Accurate genome reconstruction is our other major area of interest. We also collaborate on important questions for which our expertise in genomics and computation is relevant. Arend's biosketch highlights some of our past contributions.

http://www.sidowlab.org/

TMRCA Calculator

BioStar — Wed, 03 Feb 2021 05:07:30 -0600

This program calculates the probability that two people have a certain number of generations between them, based on the standard infinite alleles formula of Walsh. It calculates both the probability of being at an exact number of generations back to the Most Recent Common Ancestor (MRCA) of a certain pair of people and the cumulative probability that the actual number of generations is less than a certain value. Note that the convention using generations is changed from an earlier version of this calculator which used "transmission events". It can list both result types in a table or graph. In either case the horizontal axis stops at the point where the cumulative probability reaches 95% or 10 generations, whichever is longer, or an absolute max of 50,000. Beyond 90% the calculation becomes inaccurate.

https://clandonaldusa.org/index.php/tmrca-calculator

Address of the bookmark: https://clandonaldusa.org/index.php/tmrca-calculator

SRBreak: A Read-Depth and Split-Read Framework to Identify Breakpoints of Different Events Inside Simple Copy-Number Variable Regions

Jit — Tue, 15 May 2018 04:42:11 -0500

SRBreak is a read-depth and split-read package written in R for identifying copy-number variants in next-generation sequencing datasets. Note: SBReak was designed to work for multiple samples. It can work for >= 2 samples, but we suggest that users should use >= 5 samples as in the work tested in our paper.

Address of the bookmark: https://github.com/hoangtn/SRBreak

mojolicious: a next generation web framework for the Perl programming language.

Jit — Fri, 12 Jan 2018 16:48:10 -0600

Back in the early days of the web, many people learned Perl because of a wonderful Perl library called CGI. It was simple enough to get started without knowing much about the language and powerful enough to keep you going, learning by doing was much fun. While most of the techniques used are outdated now, the idea behind it is not. Mojolicious is a new endeavor to implement this idea using bleeding edge technologies.

Features

An amazing real-time web framework, allowing you to easily grow single file prototypes into well-structured MVC web applications.
- Powerful out of the box with RESTful routes, plugins, commands, Perl-ish templates, content negotiation, session management, form validation, testing framework, static file server, CGI/PSGI detection, first class Unicode support and much more for you to discover.
A powerful web development toolkit, that you can use for all kinds of applications, independently of the web framework.
- Full stack HTTP and WebSocket client/server implementation with IPv6, TLS, SNI, IDNA, HTTP/SOCKS5 proxy, UNIX domain socket, Comet (long polling), Promises/A+, keep-alive, connection pooling, timeout, cookie, multipart and gzip compression support.
- Built-in non-blocking I/O web server, supporting multiple event loops as well as optional pre-forking and hot deployment, perfect for building highly scalable web services.
- JSON and HTML/XML parser with CSS selector support.
Very clean, portable and object-oriented pure-Perl API with no hidden magic and no requirements besides Perl 5.24.0 (versions as old as 5.10.1 can be used too, but may require additional CPAN modules to be installed)
Fresh code based upon years of experience developing Catalyst, free and open source.
Hundreds of 3rd party extensions and high quality spin-off projects like the Minion job queue.

http://mojolicious.org/

Address of the bookmark: http://mojolicious.org/

Biotite: A general framework for computational biology

Jit — Mon, 17 Dec 2018 18:52:27 -0600

The package is open source and freely available at GitHub (https://github.com/biotite-dev/biotite). This package is simple to use especially for the beginners in programming and computationally efficient because of the implementation of Numpy and Cython. Biotite consists of four sub packages: sequence, structure, databases, and application. The sequence and structure modules serve for the analysis of sequence and structural data analysis respectively, database downloads files from the other databases such as RCSB PDB, and application provides interface for external software.

The Biotite package bundles popular tasks in computational biology into an unifying framework, which is easy to use on the one hand side, but is also computationally efficient due to intensive usage of NumPy and Cython. This package focuses on working with sequence and structure data and supports various file formats and analysis and manipulation functions.

Address of the bookmark: https://github.com/biotite-dev/biotite

RosettaAntibodyDesign (RAbD): A general framework for computational antibody design

BioStar — Sun, 20 Sep 2020 06:03:42 -0500

RosettaAntibodyDesign (RAbD) is a generalized framework for the design of antibodies, in which a user can easily tailor the run to their project needs. The algorithm is meant to sample the diverse sequence, structure, and binding space of an antibody-antigen complex. It can be used for a multitude of project types, from denovo design to redesigns that improve binding affinity, optimize stability, or manipulate function.

The framework is based on rigorous bioinformatic analysis and rooted very much on our recent clustering of antibody CDR regions. It uses the North/Dunbrack CDR definition as outlined in the North/Dunbrack clustering paper.

More at

https://www.rosettacommons.org/docs/latest/application_documentation/antibody/RosettaAntibodyDesign

https://bio-jade.readthedocs.io/en/latest/installation.html

Address of the bookmark: https://www.rosettacommons.org/docs/latest/application_documentation/antibody/RosettaAntibodyDesign

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

Address of the bookmark: http://mitra.stanford.edu/