BOL: Related items

Entire Human Genome Sequencing !

LEGE — Tue, 02 Apr 2024 01:19:29 -0500

Cost-effective whole human genome sequencing has revolutionized the landscape of genetic research and personalized medicine by making comprehensive genetic analysis accessible to a wider population. Through advancements in sequencing technologies, such as next-generation sequencing (NGS), costs have significantly decreased, enabling researchers and healthcare providers to analyze an individual's complete genetic makeup with greater efficiency and affordability. This has profound implications for disease diagnosis, prognosis, and treatment, as it allows for the identification of genetic predispositions and the customization of healthcare interventions based on an individual's unique genetic profile. Moreover, as the cost continues to decline, the potential for population-scale genomic studies and large-scale screening programs becomes increasingly feasible, promising to further enhance our understanding of human genetics and improve healthcare outcomes on a global scale.

Here are few companies:

https://mynucleus.com/

https://myome.com/

https://nebula.org/whole-genome-sequencing-dna-test/

piRNA and Bioinformatics: Decoding the Guardians of the Genome

LEGE — Sat, 07 Dec 2024 02:15:11 -0600

In the symphony of small RNAs, PIWI-interacting RNAs (piRNAs) stand out as the protectors of genomic integrity. These small, non-coding RNAs play critical roles in silencing transposable elements, regulating gene expression, and maintaining germline stability. The rise of bioinformatics has revolutionized our understanding of piRNAs, enabling researchers to decipher their biogenesis, functions, and evolutionary significance.

What Are piRNAs?

piRNAs are the largest class of small non-coding RNAs, typically 24–32 nucleotides in length. Unlike microRNAs (miRNAs) and small interfering RNAs (siRNAs), piRNAs do not rely on Dicer enzymes for maturation. Instead, they are processed from long single-stranded precursors and associate with PIWI proteins, a subclass of the Argonaute protein family.

The primary functions of piRNAs include:

Silencing Transposable Elements: By targeting transposons, piRNAs prevent genomic instability, particularly in germline cells.
Regulating Gene Expression: piRNAs modulate gene expression at transcriptional and post-transcriptional levels.
Epigenetic Modulation: They guide epigenetic modifications, such as DNA methylation, to specific genomic loci.

Challenges in piRNA Research

Studying piRNAs is fraught with challenges, including:

Short Length: Their small size complicates sequencing and alignment.
Lack of Sequence Conservation: Unlike miRNAs, piRNAs exhibit limited sequence conservation across species.
Complex Biogenesis: The intricate pathways of piRNA generation require sophisticated computational tools to unravel.

Bioinformatics: Illuminating the World of piRNAs

Bioinformatics has emerged as an indispensable tool for studying piRNAs, facilitating their discovery, annotation, and functional analysis. Here's how bioinformatics is transforming piRNA research:

1. Identification and Annotation

The discovery of piRNAs relies on next-generation sequencing (NGS) data. Bioinformatics tools such as piRNApredictor and Piano identify piRNA clusters and predict potential targets. Databases like piRBase and piRNAdb curate information about known piRNAs, their sequences, and associated proteins.

2. Mapping and Alignment

piRNAs often originate from repetitive regions, making their alignment challenging. Tools like Bowtie and STAR handle the unique mapping requirements of piRNAs, enabling accurate identification of piRNA clusters in genomes.

3. Functional Analysis

Bioinformatics approaches predict piRNA functions by analyzing their interactions with transposons, genes, and epigenetic marks. Algorithms such as TargetFinder and RIblast explore piRNA-mRNA interactions, shedding light on regulatory networks.

4. Evolutionary Studies

piRNAs are evolutionarily diverse, reflecting their roles in species-specific genomic defense. Comparative genomics tools help trace the evolution of piRNA clusters and their associated PIWI proteins across species.

5. Epigenomic Insights

piRNAs are key players in epigenetic regulation. Bioinformatics pipelines integrate piRNA data with chromatin immunoprecipitation sequencing (ChIP-seq) and DNA methylation data to uncover their role in shaping the epigenome.

Case Study: piRNAs in Germline Integrity

One of the hallmark functions of piRNAs is the suppression of transposable elements in the germline. For example, in Drosophila melanogaster, piRNAs target retrotransposons like gypsy and copia. Bioinformatics analyses revealed that these piRNAs guide PIWI proteins to transposon-derived RNA, ensuring genome stability during gametogenesis.

Clinical Relevance of piRNAs

Recent studies suggest that piRNAs may serve as biomarkers for diseases such as cancer, infertility, and neurodegenerative disorders. For instance:

Cancer: Dysregulated piRNA expression has been linked to tumorigenesis, making them potential targets for cancer therapies.
Infertility: Aberrant piRNA pathways are implicated in male infertility due to their role in spermatogenesis.
Neurodegeneration: piRNAs may regulate neuronal gene expression, highlighting their potential in neurological research.

Future Directions

The integration of bioinformatics with emerging technologies offers exciting opportunities for piRNA research:

Single-Cell Sequencing: Unveiling cell-specific piRNA expression and function.
Machine Learning: Predicting piRNA functions and targets with greater accuracy.
CRISPR-Based Tools: Editing piRNA clusters to explore their roles in vivo.

Conclusion

piRNAs are the unsung guardians of the genome, safeguarding genetic material from transposable elements and contributing to gene regulation and epigenetic programming. Bioinformatics has opened the floodgates of discovery, unraveling the complexities of piRNAs and their myriad roles in biology and disease.

As we continue to decode the piRNA landscape, these small RNAs promise to unveil big secrets about genome stability, evolution, and human health, cementing their place as a fascinating frontier in molecular biology.

NVIDIA and Arc Institute Unveil Evo 2: A Breakthrough AI for DNA Design

BioStar — Fri, 21 Feb 2025 10:39:47 -0600

NVIDIA and the Arc Institute have introduced Evo 2, a groundbreaking AI model designed to understand, predict, and generate DNA sequences. This marks a major advancement in computational biology, offering scientists an unprecedented tool to decode the genetic blueprint of life and even design entirely new biological systems.

The Power of Evo 2: AI Meets DNA

Evo 2 is the largest AI model for biology ever created, trained on an astonishing 9.3 trillion DNA "letters" (nucleotides) carefully selected from genomes spanning the entire tree of life. This massive dataset ensures that Evo 2 can recognize patterns and relationships in genetic sequences at an unparalleled scale.

For the first time, scientists can design DNA with AI, moving beyond simple sequence analysis to active DNA generation. Evo 2 enables researchers to predict, modify, and even create entire genetic sequences, opening new possibilities in medicine, agriculture, and synthetic biology.

Decoding the Dark Genome

One of the biggest challenges in genetics is understanding the non-coding regions of DNA—vast stretches of the genome that do not code for proteins but play crucial roles in regulating gene expression. These regions control when and how genes are activated, influencing everything from development to disease.

Evo 2 is designed to decode these non-coding elements, helping researchers uncover their functions and use this knowledge to develop gene-based therapies, synthetic life forms, and precision agriculture solutions.

From Reading DNA to Writing It

To put Evo 2’s impact into perspective:

Previous AI models could "read" DNA like a book, analyzing genetic sequences and identifying patterns.
Evo 2 can "write" entirely new DNA, designing functional genes, chromosomes, and even full genomes from scratch.

This means scientists can now engineer biological systems with AI, designing new proteins, metabolic pathways, and genetic circuits to address real-world challenges.

A Step Toward Generative Biology

The Arc Institute describes Evo 2 as a major step toward "generative biology"—a revolutionary approach where AI is used to create novel biological structures rather than just analyzing existing ones. This could lead to breakthroughs such as:

New medicines: AI-generated enzymes and proteins tailored for targeted therapies.
Disease-resistant crops: Genetically optimized plants for higher yield and climate resilience.
Synthetic organisms: Custom-designed microbes for bioremediation, biofuel production, and industrial applications.

An Open-Source Revolution

Unlike many proprietary AI models, Evo 2 is open source, making its capabilities accessible to researchers worldwide. This democratization of AI-driven biology means that scientists from different disciplines can collaborate, experiment, and innovate, accelerating discoveries in genetic engineering and synthetic biology.

With Evo 2, the boundaries of what’s possible in DNA design, genetic engineering, and biological innovation are being redrawn. The future of life sciences is no longer just about understanding life’s code—it’s about writing it.

maftools : Summarize, Analyze and Visualize MAF Files

Neel — Wed, 23 Dec 2020 05:29:33 -0600

With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widely accepted and used to store somatic variants detected. The Cancer Genome Atlas Project has sequenced over 30 different cancers with sample size of each cancer type being over 200. Resulting data consisting of somatic variants are stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner from either TCGA sources or any in-house studies as long as the data is in MAF format.

Address of the bookmark: https://www.bioconductor.org/packages/release/bioc/vignettes/maftools/inst/doc/maftools.html

GSP4PDB: a web tool to visualize, search and explore protein-ligand structural patterns

Neel — Sun, 15 Mar 2020 03:41:12 -0500

GSP4PDB is a user-friendly and efficient application to search and discover new patterns of protein-ligand interaction.

GSP4PDB is part of the services provided by the Bioinformatic Group of the University of Talca

http://gdblab.com/gsp4pdb/gsp4pdb2/

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3352-x

Address of the bookmark: http://gdblab.com/gsp4pdb/gsp4pdb2/

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

Address of the bookmark: http://mitra.stanford.edu/

Ribbon: Visualizing complex genome alignments and structural variation:

Jit — Wed, 29 Nov 2017 07:40:22 -0600

Ribbon can be used for long reads, short reads, paired-end reads, and assembly/genome alignments. Instructions for each data format are available by clicking on "instructions" in each tab on the right.

Local installation:

You can install Ribbon locally from Github by following the instructions here: https://github.com/MariaNattestad/Ribbon

Address of the bookmark: http://genomeribbon.com/

jobTree based python wrapper to run the genome simulation tool suite Evolver

Jit — Fri, 08 Dec 2017 16:26:32 -0600

evolverSimControl (eSC) can be used to simulate multi-chromosome genome evolution on an arbitrary phylogeny (Newick format). In addition to simply running evolver, eSC also automatically creates statistical summaries of the simulation as it runs including text and image files. Also included are convenience scripts to: check on a running simulation and see detailed status and logging information; extract fasta sequence files from the leaf nodes of a completed simulation; extract pairwise multiple alignment files (.maf) from leaf and branch nodes from a completed simulation and with the help of mafJoin, join them together into a single maf covering the entire simulation.

Address of the bookmark: https://github.com/dentearl/evolverSimControl

Mash: fast genome and metagenome distance estimation using MinHash

Jit — Tue, 12 Dec 2017 17:30:12 -0600

Mash is normally distributed as a dependency-free binary for Linux or OSX (see https://github.com/marbl/Mash/releases). This source distribution is intended for other operating systems or for development. Mash requires c++11 to build, which is available in and GCC >= 4.8 and OSX >= 10.7.

See http://mash.readthedocs.org for more information.

Address of the bookmark: https://github.com/marbl/Mash/releases

GIGGLE: a search engine for large-scale integrated genome analysis

Jit — Wed, 10 Jan 2018 03:10:45 -0600

GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.

https://www.nature.com/articles/nmeth.4556

Address of the bookmark: https://github.com/ryanlayer/giggle