BOL: Related items

HIV genome database !

Rahul Nayak — Fri, 21 Jan 2022 05:40:15 -0600

HIV resources

https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html

Address of the bookmark: https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html

Human Complete Genome

Shruti Paniwala — Wed, 06 Jul 2022 06:42:55 -0500

Telomere-to-telomere consortium

We have sequenced the CHM13hTERT human cell line with a number of technologies. Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. The data includes 30x PacBio HiFi, 120x coverage of Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. Most raw data is available from this site, with the exception of the PacBio data which was generated by the University of Washington/PacBio and is available from NCBI SRA.

A UCSC browser is available for v2.0 (as well as legacy v1.0 and v1.1 versions). An interactive dotplot visualization of all genomic repeats is also available from resgen.io. Known issues identified in the assembly are tracked at CHM13 issues.

MORE at https://github.com/marbl/CHM13

Address of the bookmark: https://www.science.org/doi/10.1126/science.abj6987

NCBI Datasets pages

BioStar — Wed, 12 Jul 2023 06:29:31 -0500

Update! Assembly and Genome record pages now redirect to new NCBI Datasets pages. NCBI Datasets is a new resource that makes it easier to find and download genome data. Learn more: https://ncbiinsights.ncbi.nlm.nih.gov/2023/07/11/ncbi-datasets-genome-assembly-pages/ #NCBICGR

Effective July 10, 2023, NCBI’s Assembly and Genome record pages now redirect to new NCBI Datasets pages. As previously announced, these updates are part of our ongoing effort to modernize and improve your user experience. NCBI Datasets is a new resource that makes it easier to find and download genome data.  

The following pages have been updated:

The NCBI Assembly record pages now redirect to the new NCBI Datasets Genome record pages that describe assembled genomes and provide links to related NCBI tools such as Genome Data Viewer and BLAST. 
The NCBI Genome record pages now redirect to the NCBI Datasets Taxonomy record pages that provide a taxonomy-focused portal to genes, genomes, and additional NCBI resources.

During this transition, you will have the option to return to the legacy Genome and Assembly record pages. We will remove the legacy pages in early 2024. 

Entire Human Genome Sequencing !

LEGE — Tue, 02 Apr 2024 01:19:29 -0500

Cost-effective whole human genome sequencing has revolutionized the landscape of genetic research and personalized medicine by making comprehensive genetic analysis accessible to a wider population. Through advancements in sequencing technologies, such as next-generation sequencing (NGS), costs have significantly decreased, enabling researchers and healthcare providers to analyze an individual's complete genetic makeup with greater efficiency and affordability. This has profound implications for disease diagnosis, prognosis, and treatment, as it allows for the identification of genetic predispositions and the customization of healthcare interventions based on an individual's unique genetic profile. Moreover, as the cost continues to decline, the potential for population-scale genomic studies and large-scale screening programs becomes increasingly feasible, promising to further enhance our understanding of human genetics and improve healthcare outcomes on a global scale.

Here are few companies:

https://mynucleus.com/

https://myome.com/

https://nebula.org/whole-genome-sequencing-dna-test/

piRNA and Bioinformatics: Decoding the Guardians of the Genome

LEGE — Sat, 07 Dec 2024 02:15:11 -0600

In the symphony of small RNAs, PIWI-interacting RNAs (piRNAs) stand out as the protectors of genomic integrity. These small, non-coding RNAs play critical roles in silencing transposable elements, regulating gene expression, and maintaining germline stability. The rise of bioinformatics has revolutionized our understanding of piRNAs, enabling researchers to decipher their biogenesis, functions, and evolutionary significance.

What Are piRNAs?

piRNAs are the largest class of small non-coding RNAs, typically 24–32 nucleotides in length. Unlike microRNAs (miRNAs) and small interfering RNAs (siRNAs), piRNAs do not rely on Dicer enzymes for maturation. Instead, they are processed from long single-stranded precursors and associate with PIWI proteins, a subclass of the Argonaute protein family.

The primary functions of piRNAs include:

Silencing Transposable Elements: By targeting transposons, piRNAs prevent genomic instability, particularly in germline cells.
Regulating Gene Expression: piRNAs modulate gene expression at transcriptional and post-transcriptional levels.
Epigenetic Modulation: They guide epigenetic modifications, such as DNA methylation, to specific genomic loci.

Challenges in piRNA Research

Studying piRNAs is fraught with challenges, including:

Short Length: Their small size complicates sequencing and alignment.
Lack of Sequence Conservation: Unlike miRNAs, piRNAs exhibit limited sequence conservation across species.
Complex Biogenesis: The intricate pathways of piRNA generation require sophisticated computational tools to unravel.

Bioinformatics: Illuminating the World of piRNAs

Bioinformatics has emerged as an indispensable tool for studying piRNAs, facilitating their discovery, annotation, and functional analysis. Here's how bioinformatics is transforming piRNA research:

1. Identification and Annotation

The discovery of piRNAs relies on next-generation sequencing (NGS) data. Bioinformatics tools such as piRNApredictor and Piano identify piRNA clusters and predict potential targets. Databases like piRBase and piRNAdb curate information about known piRNAs, their sequences, and associated proteins.

2. Mapping and Alignment

piRNAs often originate from repetitive regions, making their alignment challenging. Tools like Bowtie and STAR handle the unique mapping requirements of piRNAs, enabling accurate identification of piRNA clusters in genomes.

3. Functional Analysis

Bioinformatics approaches predict piRNA functions by analyzing their interactions with transposons, genes, and epigenetic marks. Algorithms such as TargetFinder and RIblast explore piRNA-mRNA interactions, shedding light on regulatory networks.

4. Evolutionary Studies

piRNAs are evolutionarily diverse, reflecting their roles in species-specific genomic defense. Comparative genomics tools help trace the evolution of piRNA clusters and their associated PIWI proteins across species.

5. Epigenomic Insights

piRNAs are key players in epigenetic regulation. Bioinformatics pipelines integrate piRNA data with chromatin immunoprecipitation sequencing (ChIP-seq) and DNA methylation data to uncover their role in shaping the epigenome.

Case Study: piRNAs in Germline Integrity

One of the hallmark functions of piRNAs is the suppression of transposable elements in the germline. For example, in Drosophila melanogaster, piRNAs target retrotransposons like gypsy and copia. Bioinformatics analyses revealed that these piRNAs guide PIWI proteins to transposon-derived RNA, ensuring genome stability during gametogenesis.

Clinical Relevance of piRNAs

Recent studies suggest that piRNAs may serve as biomarkers for diseases such as cancer, infertility, and neurodegenerative disorders. For instance:

Cancer: Dysregulated piRNA expression has been linked to tumorigenesis, making them potential targets for cancer therapies.
Infertility: Aberrant piRNA pathways are implicated in male infertility due to their role in spermatogenesis.
Neurodegeneration: piRNAs may regulate neuronal gene expression, highlighting their potential in neurological research.

Future Directions

The integration of bioinformatics with emerging technologies offers exciting opportunities for piRNA research:

Single-Cell Sequencing: Unveiling cell-specific piRNA expression and function.
Machine Learning: Predicting piRNA functions and targets with greater accuracy.
CRISPR-Based Tools: Editing piRNA clusters to explore their roles in vivo.

Conclusion

piRNAs are the unsung guardians of the genome, safeguarding genetic material from transposable elements and contributing to gene regulation and epigenetic programming. Bioinformatics has opened the floodgates of discovery, unraveling the complexities of piRNAs and their myriad roles in biology and disease.

As we continue to decode the piRNA landscape, these small RNAs promise to unveil big secrets about genome stability, evolution, and human health, cementing their place as a fascinating frontier in molecular biology.

NVIDIA and Arc Institute Unveil Evo 2: A Breakthrough AI for DNA Design

BioStar — Fri, 21 Feb 2025 10:39:47 -0600

NVIDIA and the Arc Institute have introduced Evo 2, a groundbreaking AI model designed to understand, predict, and generate DNA sequences. This marks a major advancement in computational biology, offering scientists an unprecedented tool to decode the genetic blueprint of life and even design entirely new biological systems.

The Power of Evo 2: AI Meets DNA

Evo 2 is the largest AI model for biology ever created, trained on an astonishing 9.3 trillion DNA "letters" (nucleotides) carefully selected from genomes spanning the entire tree of life. This massive dataset ensures that Evo 2 can recognize patterns and relationships in genetic sequences at an unparalleled scale.

For the first time, scientists can design DNA with AI, moving beyond simple sequence analysis to active DNA generation. Evo 2 enables researchers to predict, modify, and even create entire genetic sequences, opening new possibilities in medicine, agriculture, and synthetic biology.

Decoding the Dark Genome

One of the biggest challenges in genetics is understanding the non-coding regions of DNA—vast stretches of the genome that do not code for proteins but play crucial roles in regulating gene expression. These regions control when and how genes are activated, influencing everything from development to disease.

Evo 2 is designed to decode these non-coding elements, helping researchers uncover their functions and use this knowledge to develop gene-based therapies, synthetic life forms, and precision agriculture solutions.

From Reading DNA to Writing It

To put Evo 2’s impact into perspective:

Previous AI models could "read" DNA like a book, analyzing genetic sequences and identifying patterns.
Evo 2 can "write" entirely new DNA, designing functional genes, chromosomes, and even full genomes from scratch.

This means scientists can now engineer biological systems with AI, designing new proteins, metabolic pathways, and genetic circuits to address real-world challenges.

A Step Toward Generative Biology

The Arc Institute describes Evo 2 as a major step toward "generative biology"—a revolutionary approach where AI is used to create novel biological structures rather than just analyzing existing ones. This could lead to breakthroughs such as:

New medicines: AI-generated enzymes and proteins tailored for targeted therapies.
Disease-resistant crops: Genetically optimized plants for higher yield and climate resilience.
Synthetic organisms: Custom-designed microbes for bioremediation, biofuel production, and industrial applications.

An Open-Source Revolution

Unlike many proprietary AI models, Evo 2 is open source, making its capabilities accessible to researchers worldwide. This democratization of AI-driven biology means that scientists from different disciplines can collaborate, experiment, and innovate, accelerating discoveries in genetic engineering and synthetic biology.

With Evo 2, the boundaries of what’s possible in DNA design, genetic engineering, and biological innovation are being redrawn. The future of life sciences is no longer just about understanding life’s code—it’s about writing it.

Tigers genome sequenced

Rahul Agarwal — Tue, 17 Sep 2013 16:48:24 -0500

Fifteen scientists led by Dr Jong Bhak of Genome Research Foundation, South Korea, decoded as many as 3 billion nucleotides (organic molecules that form the basic building blocks of nucleic acids, such as DNA). They identified 20,000 genes related to various functions of the tiger.

The biggest and perhaps most fearsome of the world's big cats, the tiger, shares 95.6 percent of its DNA with humans' cute and furry companions, domestic cats.

The new research showed that big cats have genetic mutations that enabled them to be carnivores. The team also identified mutations that allow snow leopards to thrive at high altitudes.

Reference:

http://www.nbcnews.com/science/your-cat-ferocious-tigers-share-lot-95-6-percent-their-4B11182690

http://timesofindia.indiatimes.com/home/environment/flora-fauna/Gene-mapping-of-tiger-completed/articleshow/22671681.cms

Paper:

http://www.nature.com/ncomms/2013/130917/ncomms3433/full/ncomms3433.html

Opera: An optimal genome scaffolding program

Jit — Mon, 27 Nov 2017 10:18:20 -0600

Opera (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly ). It uses information from paired-end or long reads to optimally order and orient contigs assembled from shotgun-sequencing reads.

An updated version called OPERA-LG has been re-engineered with features for the assembly of large and complex genomes.

Song Gao, Denis Bertrand, Burton K. H. Chia and Niranjan Nagarajan. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology, May 2016, doi: 10.1186/s13059-016-0951-y.

Song Gao, Wing-Kin Sung, Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology, Sept. 2011, doi:10.1089/cmb.2011.0170.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0951-y

Address of the bookmark: https://sourceforge.net/projects/operasf/

SPAdes hybrid genome assembly

Jit — Mon, 27 Nov 2017 08:05:40 -0600

When you have both Illumina and Nanopore data, then SPAdes remains a good option for hybrid assembly - SPAdes was used to produce the B fragilis assembly by Mick Watson’s group.

Again, running spades.py will show you the options:

spades.py

This produces:

SPAdes genome assembler v3.10.1

Usage: /usr/local/SPAdes-3.10.1-Linux/bin/spades.py [options] -o 

Basic options:
-o          directory to store all the resulting files (required)
--sc                    this flag is required for MDA (single-cell) data
--meta                  this flag is required for metagenomic sample data
--rna                   this flag is required for RNA-Seq data
--plasmid               runs plasmidSPAdes pipeline for plasmid detection
--iontorrent            this flag is required for IonTorrent data
--test                  runs SPAdes on toy dataset
-h/--help               prints this usage message
-v/--version            prints version

Input data:
--12          file with interlaced forward and reverse paired-end reads
-1            file with forward paired-end reads
-2            file with reverse paired-end reads
-s            file with unpaired reads
--pe<#>-12            file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-1             file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-2             file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-s             file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-    orientation of reads for paired-end library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--s<#>                file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)
--mp<#>-12            file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-1             file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-2             file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-s             file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-    orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--hqmp<#>-12          file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-1           file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-2           file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-s           file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-  orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--nxmate<#>-1         file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--nxmate<#>-2         file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--sanger              file with Sanger reads
--pacbio              file with PacBio reads
--nanopore            file with Nanopore reads
--tslr        file with TSLR-contigs
--trusted-contigs             file with trusted contigs
--untrusted-contigs           file with untrusted contigs

Pipeline options:
--only-error-correction runs only read error correction (without assembling)
--only-assembler        runs only assembling (without read error correction)
--careful               tries to reduce number of mismatches and short indels
--continue              continue run from the last available check-point
--restart-from      restart run with updated options and from the specified check-point ('ec', 'as', 'k', 'mc')
--disable-gzip-output   forces error correction not to compress the corrected reads
--disable-rr            disables repeat resolution stage of assembling

Advanced options:
--dataset             file with dataset description in YAML format
-t/--threads               number of threads
                                [default: 16]
-m/--memory                RAM limit for SPAdes in Gb (terminates if exceeded)
                                [default: 250]
--tmp-dir              directory for temporary files
                                [default: /tmp]
-k                 comma-separated list of k-mer sizes (must be odd and
                                less than 128) [default: 'auto']
--cov-cutoff             coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset  <33 or 64>      PHRED quality offset in the input reads (33 or 64)
                                [default: auto-detect]

As you can see this is also a “pipeline” of tools that can be switched on or off. SPAdes takes quite a long time, so for the purposes of this practical, something like this may suffice:

spades.py -t 4 \
          -m 32 \
          -k 31,51,71 \
          --only-assembler \
          -1 miseq.1.fastq -2 miseq.2.fastq \
          --nanopore minion.fastq \
          -o hybrid_assembly

In turn, these parameters mean

use 4 threads
max memory is 32Gb
use 3 kmer values to build the de bruijn graph(s) - 31, 51 and 71
only run the assembler, not the correction algorithm (for speed)
read 1 and read 2 of the MiSeq data
the nanopore data
put the output in folder “hybrid_assembly”

COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly

Jit — Wed, 06 Dec 2017 02:08:14 -0600

An efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30× simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads.

Address of the bookmark: ftp://ftp.genomics.org.cn/pub/cope