BOL: Related items

Human Complete Genome

Shruti Paniwala — Wed, 06 Jul 2022 06:42:55 -0500

Telomere-to-telomere consortium

We have sequenced the CHM13hTERT human cell line with a number of technologies. Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. The data includes 30x PacBio HiFi, 120x coverage of Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. Most raw data is available from this site, with the exception of the PacBio data which was generated by the University of Washington/PacBio and is available from NCBI SRA.

A UCSC browser is available for v2.0 (as well as legacy v1.0 and v1.1 versions). An interactive dotplot visualization of all genomic repeats is also available from resgen.io. Known issues identified in the assembly are tracked at CHM13 issues.

MORE at https://github.com/marbl/CHM13

Address of the bookmark: https://www.science.org/doi/10.1126/science.abj6987

Genome Context Viewer (GCV)

LEGE — Sun, 21 May 2023 19:33:43 -0500

The Genome Context Viewer (GCV) is a web-app that visualizes genomic context data provided by third party services. Specifically, it uses functional annotations as a unit of search and comparison. By adopting a common set of annotations, data-store operators can deploy federated instances of GCV, allowing users to compare genomes from different providers in a single interface.

Address of the bookmark: https://github.com/legumeinfo/gcv

CGView.js is a Circular Genome Viewing tool

LEGE — Wed, 27 Mar 2024 11:16:24 -0500

CGView.js is a Circular Genome Viewing tool for visualizing and interacting with small genomes. This software is an adaptation of the Java program CGView.

CGView.js is the genome viewer of Proksee, an expert system for genome assembly, annotation and visualization.

Features

Circular and linear views of genomes
Capable of drawing genomes up to 10 Mbp with 1000's of features and 100's contigs
Smooth zooming down to the sequence level
Easily generate features and plots directly form the sequence (e.g. ORFs, GC-content and GC-Skew)
Save high resolution PNG maps up to 8000x8000px
Fully documented API for interacting with CGView.js maps

Address of the bookmark: https://js.cgview.ca/

The Role of lncRNA in Bioinformatics: Unlocking the Secrets of the Genome

LEGE — Sat, 07 Dec 2024 02:09:47 -0600

In the intricate dance of molecular biology, long non-coding RNAs (lncRNAs) have emerged as key players, capturing the interest of researchers worldwide. These RNA molecules, once dismissed as "junk," have proven to be vital in the regulation of gene expression, cellular processes, and the progression of diseases. The intersection of lncRNA studies and bioinformatics is transforming our understanding of these enigmatic molecules, offering profound insights into their structure, function, and therapeutic potential.

What Are lncRNAs?

lncRNAs are RNA transcripts longer than 200 nucleotides that do not code for proteins. Despite their non-coding nature, they play diverse roles in gene regulation, including chromatin remodeling, transcriptional control, and post-transcriptional processing. Unlike messenger RNAs (mRNAs), lncRNAs often function as scaffolds, decoys, or guides in cellular machinery, influencing biological processes such as cell differentiation, immune response, and even cancer metastasis.

Challenges in lncRNA Research

Identifying and understanding lncRNAs pose unique challenges:

High Sequence Variability: Unlike protein-coding genes, lncRNAs exhibit low sequence conservation across species, making functional predictions difficult.
Low Expression Levels: lncRNAs are often expressed at low levels, complicating their detection in transcriptomic data.
Diverse Functions: The multifunctional nature of lncRNAs requires advanced computational tools to decipher their roles in complex networks.

Bioinformatics: A Crucial Ally in lncRNA Research

Bioinformatics bridges the gap between raw biological data and meaningful insights, making it indispensable in lncRNA research. Here’s how:

1. Identification and Annotation

High-throughput sequencing technologies like RNA-seq generate vast amounts of data. Bioinformatics tools such as StringTie, Cufflinks, and HISAT2 help assemble and annotate lncRNAs from this data. Additionally, databases like NONCODE, LNCipedia, and Ensembl provide curated repositories of lncRNA sequences and annotations.

2. Functional Prediction

Bioinformatics algorithms predict the potential functions of lncRNAs by analyzing their interactions with DNA, RNA, and proteins. Tools like LncRNA2Function and RIblast utilize sequence motifs and secondary structure predictions to hypothesize about the roles of specific lncRNAs.

3. Network Construction

lncRNAs often act as regulatory hubs. Bioinformatics platforms such as Cytoscape enable the visualization of lncRNA-mediated networks, elucidating their roles in pathways like cell cycle regulation and apoptosis.

4. Epigenetic Studies

lncRNAs are known to interact with chromatin-modifying complexes, influencing gene expression epigenetically. Tools like ChIP-seq and ATAC-seq, combined with computational pipelines, identify these interactions and map them to the genome.

5. Clinical Applications

Bioinformatics aids in the discovery of lncRNA biomarkers for diseases like cancer and neurodegenerative disorders. Machine learning models analyze differential expression profiles, helping prioritize lncRNAs with therapeutic potential.

Case Study: lncRNAs in Cancer Research

lncRNAs such as HOTAIR and MALAT1 have been implicated in cancer progression. Bioinformatics analyses have revealed their roles in promoting metastasis and altering the tumor microenvironment. For example, transcriptome analysis in cancer patients identifies lncRNA expression signatures, enabling precision medicine approaches.

Future Directions

The fusion of bioinformatics with experimental biology is unlocking the secrets of lncRNAs. Advances in artificial intelligence, single-cell sequencing, and structural modeling promise to overcome current limitations. Here are some promising directions:

Integrative Analysis: Combining multi-omics data to understand the interplay of lncRNAs with other biomolecules.
CRISPR Screens: Leveraging bioinformatics to design CRISPR-based functional screens for lncRNAs.
Therapeutic Development: Using bioinformatics to design lncRNA-based therapeutics, including antisense oligonucleotides and RNA interference tools.

Conclusion

lncRNAs are the hidden gems of the genome, and bioinformatics is the key to unearthing their full potential. As research progresses, lncRNAs could pave the way for novel diagnostics, targeted therapies, and personalized medicine, revolutionizing our approach to complex diseases.

The journey into the world of lncRNAs is only beginning, and bioinformatics will continue to play a pivotal role in decoding these molecular mysteries. Whether you’re a researcher, clinician, or bioinformatics enthusiast, the study of lncRNAs offers a fascinating frontier of discovery.

Genome Simulation with SLiM and msprime

BioStar — Fri, 31 Jan 2025 12:47:43 -0600

Genome simulation is an essential tool in population genetics, enabling researchers to model evolutionary processes and study genetic variation. Two widely used simulation tools in this field are SLiM and msprime. While both serve different purposes, they can be used together with the slendr framework to compare simulation outputs effectively.

Overview of SLiM and msprime

SLiM: Forward Genetic Simulator

SLiM is a free, open-source tool designed for forward genetic simulations. It allows researchers to model complex evolutionary scenarios, including selection, recombination, and demographic events, making it particularly useful for studying adaptation and selection in populations.

Key Features of SLiM:

Simulates population evolution forward in time
Supports custom evolutionary models using an embedded scripting language
Allows modeling of spatial and ecological dynamics
Provides high flexibility and extensibility for user-defined scenarios
Available on GitHub as an open-source project

msprime: Ancestry and Mutation Simulator

msprime is an efficient, open-source tool that simulates ancestry and mutations using a coalescent framework. It is known for its high-speed performance and low memory requirements, making it a popular choice for large-scale genomic simulations.

Key Features of msprime:

Implements coalescent simulations for ancestry modeling
Efficiently simulates large population histories
Supports the addition of mutations to genealogies
Developed using an open-source community model
Often faster and more memory-efficient than alternative simulators

Using SLiM and msprime with slendr

Both SLiM and msprime can be integrated with slendr, a framework that facilitates structured population genetic simulations. This integration allows for seamless comparison of simulation outputs.

How They Work Together:

SLiM and msprime simulations can be analyzed within slendr.
The ts_read() function in slendr enables loading and comparing tree sequence outputs from both simulators.
This integration allows researchers to validate simulation results and gain deeper insights into evolutionary processes.

Performance Considerations

While SLiM offers powerful forward simulations with extensive customization, msprime is often preferred for its speed and memory efficiency when simulating ancestry and mutations. The choice between the two depends on the research goals:

For detailed evolutionary modeling with selection and recombination: Use SLiM.
For large-scale coalescent simulations with mutations: Use msprime.
For comparing different simulation models and their outputs: Use slendr to integrate SLiM and msprime results.

Conclusion

SLiM and msprime are valuable tools for genome simulation, each serving distinct but complementary purposes in population genetics research. By leveraging the strengths of both simulators with slendr, researchers can conduct robust and efficient evolutionary simulations, enhancing our understanding of genetic diversity and adaptation.

For more information, check out the official GitHub repositories for SLiM and msprime, and explore the slendr framework for streamlined simulation workflow

Submit your SARS-CoV-2 sequence data to GenBank

Neel — Thu, 09 Apr 2020 18:28:25 -0500

Submit your SARS-CoV-2 sequence data to GenBank and SRA with our new submission landing page. Submission is simple and streamlined *and* there’s a rapid turnaround. https://submit.ncbi.nlm.nih.gov/sarscov2/

Quickly and easily add your SARS-CoV-2 sequence data to the growing public archive with new, special features and support from NCBI. new SARS-CoV-2 sequence submission landing page will help you get started. GenBank submissions are accessioned and released in approximately 1-2 working days, and Sequence Read Archive (SRA) submissions typically processed and released within hours. Submission is simple!

More information is available on NCBI Insights. https://ncbiinsights.ncbi.nlm.nih.gov/2020/04/09/sars-cov2-data-streamlined-submission-rapid-turnaround/

Encode sequencing data freely available to download and use for academic means

Rahul Agarwal — Thu, 13 Mar 2014 18:18:08 -0500

In Encode, regulatory elements investigated via DNA hypersensitivity assays, assays of DNA methylation, and chromatin immunoprecipitation (ChIP) of proteins that interact with DNA, including modified histones and transcription factors, followed by sequencing (ChIP-Seq).

More information:

https://genome.ucsc.edu/ENCODE/pilot.html

Address of the bookmark: https://genome.ucsc.edu/ENCODE/

SAMHAR-COVID19 Hackathon

BioStar — Fri, 17 Apr 2020 06:47:10 -0500

Centre for Development of Advanced Computing (C-DAC) under the aegis of the National Supercomputing Mission (NSM), a Ministry of Electronics & Information Technology (MeitY) and Department of Science & Technology (DST) initiative, in association with NVIDIA & OpenACC, announces the SAMHAR-COVID19 Hackathon.

Pandemic outbreak such as Coronavirus outbreak can create huge challenges for the Government and Public Health Officials to gather information quickly and coordinate a response. In such a situation, Artificial Intelligence (AI) can play a huge role in predicting, minimizing and stalling its spread of the virus.

C-DAC has embarked on a program SAMHAR-COVID19 (Supercomputing using AI, ML, Healthcare Analytics based Research for combating COVID19). This opportunity will provide researchers to find solutions for Identifying, Tracking and Forecasting outbreaks of COVID19 and Facilitating Drug Discovery as well.

Participants can update submissions multiple times till the Registration End date.
Each entry can be submitted by a Team comprising of minimum 3 and maximum 5 members (Including the Team Lead).
Participants will have to share the complete work activities with C-DAC. And C-DAC will have right to use the submitted application/solution for SAMHAR-COVID19 programs.
The Award will be given to the Selected/Winning Entry irrespective of the number of members in the Team (members may choose to distribute the amount among themselves).
The decision of the Eminent Jury on the I3 Award will be final and binding.
Award can be for the Team/Company/Institution, as submitted in the Application and cannot be changed later.
Submissions will be considered void if they are in whole or part ill-eligible, incomplete, damaged, altered, counterfeit, obtained through fraud or late submission.

More at https://samhar-covid19hackathon.cdac.in/

Public Databases for Bioinformatics !

Jit — Tue, 23 Mar 2021 05:32:15 -0500

https://www.nature.com/articles/s41467-020-17155-y

Server Infrastructure:

File Server:

dhara: Synology 3614 Storage Appliance
4 Core Xeon
108TB disk storage
10Gb ethernet to SCG3
Access atx: dhara:5000
Has btsync server (try it - its much better than dropbox)

Compute Servers:

nandi: Kundaje and Phi Server
24 intel cores
256GB RAM
500GB of SSD storage 
36TB RAID6 local storage
4 Intel Phi's (space for 4 more GPU's)


durga: Montgomery and sensitive data
24 intel cores
256GB RAM
500GB of SSD RAID0 storage 
60TB RAID6 local storage

mitra: Bassik and Web/DB Server
24 core
256GB RAM 
500GB of SSD RAID0 storage 
36TB RAID6 local storage

vayu: Kundaje GPU server
4 core
64GB RAM 
200GB of SSD storage 
8TB RAID10 local storage
4 Nvidia GTX 970 4GB GPUs

amold: Bickel and SGE server
32 AMD core
128GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

wotan: Bickel and SGE server
64 AMD core
256GB RAM 
200GB of SSD storage 
12TB RAID5 local storage

Filesystem:

/users/$USER
default home directory
full backups nightly 
nfs mount to dhara
should store code, papers, and other highly processed data here

/mnt/data/
globally accessible data
should store common data here
e.g. genomes and indexes, annotations, ENCODE data  
if you dont want this to count towards your quote you must chown

/mnt/lab_data/$LAB/
lab accessible data
should store lab project data here 
e.g. ATAC-seq prediction data, enhancer prediction, motif calls

/srv/scratch/$USER
fast local storage
not backed up, but on raid and data will never be deleted
most analysis should be performed here

/srv/persistent/$USER
fast local storage
synced nightly, but not backed up
       ie if the hard drives fail or you delete something and notice 
       within 24 hours we can recover. Otherwise not. (vs home which is 
       properly backed up )  
intermediate analysis products that would be hard to recover should be stored here 
       e.g. stochastic analysis results that need to be kept so that paper 
       results can be reproduced

/srv/www/$LABNAME/
web accessible from mitra.stanford.edu
*NOT BACKED UP*

Some parallel programming patterns:

# gzip a bunch of files
parallel gzip -- *.FILESTOGZIP

# fork example in python:
(for more detailed examples look at 
 https://github.com/nboley/grit/ grit/lib/multiprocessing_utils.py)

import os
import time
import random

import multiprocessing

class ProcessSafeOPStream( object ):
    def __init__( self, writeable_obj ):
        self.writeable_obj = writeable_obj
        self.lock = multiprocessing.Lock()
        self.name = self.writeable_obj.name
        return
    
    def write( self, data ):
        self.lock.acquire()
        self.writeable_obj.write( data )
        self.writeable_obj.flush()
        self.lock.release()
        return
    
    def close( self ):
        self.writeable_obj.close()

def worker(queue, ofp):
    # Try without this
    random.seed()
    while True:
        i = queue.get()
        if i == 'FINISHED': return
        # simulate an expensive function
        x = random.random()
        time.sleep(x/10)
        print i, x
        ofp.write("%i\t%s\n" % (i, x))

NSIMS = 10000
NPROC = 25

# populate queue
todo = multiprocessing.Queue()
for i in xrange(NSIMS): todo.put(i)
for i in xrange(NPROC): todo.put('FINISHED')

ofp = ProcessSafeOPStream( open("output.txt", "w") )

pids = []
for i in xrange(NPROC):
    pid = os.fork()
    if pid == 0:
       worker(todo, ofp)
       os._exit(0)
    else:
       pids.append(pid)  

for pid in pids:
    os.waitpid(pid, 0)

ofp.close()

print "FINISHED"

For use case 1 we obtained the following ENCODE and ROADMAP datasets https://www.encodeproject.org/files/ENCFF446WOD/@@download/ENCFF446WOD.bed.gz, https://www.encodeproject.org/files/ENCFF546PJU/@@download/ENCFF546PJU.bam, https://www.encodeproject.org/files/ENCFF059BEU/@@download/ENCFF059BEU.bam. Blacklisted regions were obtained from http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz. The human genome version hg38 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz.

For use case 2 we used the set of narrowPeak files summarized in https://github.com/wkopp/janggu_usecases/tree/master/extra/urls.txt (archived version v1.0.1). The human genome version hg19 was obtained from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz

For use case 3 we used the ENCODE datasets https://www.encodeproject.org/files/ENCFF591XCX/@@download/ENCFF591XCX.bam, https://www.encodeproject.org/files/ENCFF736LHE/@@download/ENCFF736LHE.bigWig, https://www.encodeproject.org/files/ENCFF177HHM/@@download/ENCFF177HHM.bam as we as the GENCODE annotation v29 from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz.

BOL: Related items

Human Complete Genome

Telomere-to-telomere consortium

Genome Context Viewer (GCV)

CGView.js is a Circular Genome Viewing tool

Features

The Role of lncRNA in Bioinformatics: Unlocking the Secrets of the Genome

What Are lncRNAs?

Challenges in lncRNA Research

Bioinformatics: A Crucial Ally in lncRNA Research

1. Identification and Annotation

2. Functional Prediction

3. Network Construction

4. Epigenetic Studies

5. Clinical Applications

Case Study: lncRNAs in Cancer Research

Future Directions

Conclusion

Genome Simulation with SLiM and msprime

Overview of SLiM and msprime

SLiM: Forward Genetic Simulator

msprime: Ancestry and Mutation Simulator

Using SLiM and msprime with slendr

How They Work Together:

Performance Considerations

Conclusion

Submit your SARS-CoV-2 sequence data to GenBank

Encode sequencing data freely available to download and use for academic means

SAMHAR-COVID19 Hackathon

Public Databases for Bioinformatics !

Scripts for the analysis of HGT in genome sequence data.