BOL: Related items

NVIDIA and Arc Institute Unveil Evo 2: A Breakthrough AI for DNA Design

BioStar — Fri, 21 Feb 2025 10:39:47 -0600

NVIDIA and the Arc Institute have introduced Evo 2, a groundbreaking AI model designed to understand, predict, and generate DNA sequences. This marks a major advancement in computational biology, offering scientists an unprecedented tool to decode the genetic blueprint of life and even design entirely new biological systems.

The Power of Evo 2: AI Meets DNA

Evo 2 is the largest AI model for biology ever created, trained on an astonishing 9.3 trillion DNA "letters" (nucleotides) carefully selected from genomes spanning the entire tree of life. This massive dataset ensures that Evo 2 can recognize patterns and relationships in genetic sequences at an unparalleled scale.

For the first time, scientists can design DNA with AI, moving beyond simple sequence analysis to active DNA generation. Evo 2 enables researchers to predict, modify, and even create entire genetic sequences, opening new possibilities in medicine, agriculture, and synthetic biology.

Decoding the Dark Genome

One of the biggest challenges in genetics is understanding the non-coding regions of DNA—vast stretches of the genome that do not code for proteins but play crucial roles in regulating gene expression. These regions control when and how genes are activated, influencing everything from development to disease.

Evo 2 is designed to decode these non-coding elements, helping researchers uncover their functions and use this knowledge to develop gene-based therapies, synthetic life forms, and precision agriculture solutions.

From Reading DNA to Writing It

To put Evo 2’s impact into perspective:

Previous AI models could "read" DNA like a book, analyzing genetic sequences and identifying patterns.
Evo 2 can "write" entirely new DNA, designing functional genes, chromosomes, and even full genomes from scratch.

This means scientists can now engineer biological systems with AI, designing new proteins, metabolic pathways, and genetic circuits to address real-world challenges.

A Step Toward Generative Biology

The Arc Institute describes Evo 2 as a major step toward "generative biology"—a revolutionary approach where AI is used to create novel biological structures rather than just analyzing existing ones. This could lead to breakthroughs such as:

New medicines: AI-generated enzymes and proteins tailored for targeted therapies.
Disease-resistant crops: Genetically optimized plants for higher yield and climate resilience.
Synthetic organisms: Custom-designed microbes for bioremediation, biofuel production, and industrial applications.

An Open-Source Revolution

Unlike many proprietary AI models, Evo 2 is open source, making its capabilities accessible to researchers worldwide. This democratization of AI-driven biology means that scientists from different disciplines can collaborate, experiment, and innovate, accelerating discoveries in genetic engineering and synthetic biology.

With Evo 2, the boundaries of what’s possible in DNA design, genetic engineering, and biological innovation are being redrawn. The future of life sciences is no longer just about understanding life’s code—it’s about writing it.

Detail annotation of genes !

Jit — Fri, 11 Jan 2019 05:23:33 -0600

gene_info recalculated daily
---------------------------------------------------------------------------
tab-delimited
one line per GeneID
Column header line is the first line in the file.
Note: subsets of gene_info are available in the DATA/GENE_INFO
directory (described later)
---------------------------------------------------------------------------

tax_id:
the unique identifier provided by NCBI Taxonomy
for the species or strain/isolate

GeneID:
the unique identifier for a gene
ASN1: geneid

Symbol:
the default symbol for the gene
ASN1: gene->locus

LocusTag:
the LocusTag value
ASN1: gene->locus-tag

Synonyms:
bar-delimited set of unofficial symbols for the gene

dbXrefs:
bar-delimited set of identifiers in other databases
for this gene. The unit of the set is database:value.
Note that HGNC and MGI include 'HGNC' and 'MGI', respectively,
in the value part of their identifier. Consequently,
dbXrefs for these databases will appear like:
HGNC:HGNC:1100
This would be interpreted as database='HGNC', value='HGNC:1100'
Example for MGI:
MGI:MGI:104537
This would be interpreted as database='MGI', value='MGI:104537'

chromosome:
the chromosome on which this gene is placed.
for mitochondrial genomes, the value 'MT' is used.

map location:
the map location for this gene

description:
a descriptive name for this gene

type of gene:
the type assigned to the gene according to the list of options
provided in https://www.ncbi.nlm.nih.gov/IEB/ToolBox/CPP_DOC/lxr/source/src/objects/entrezgene/entrezgene.asn

Symbol from nomenclature authority:
when not '-', indicates that this symbol is from a
a nomenclature authority

Full name from nomenclature authority:
when not '-', indicates that this full name is from a
a nomenclature authority

Nomenclature status:
when not '-', indicates the status of the name from the
nomenclature authority (O for official, I for interim)

Other designations:
pipe-delimited set of some alternate descriptions that
have been assigned to a GeneID
'-' indicates none is being reported.

Modification date:
the last date a gene record was updated, in YYYYMMDD format

Feature type:
pipe-delimited set of annotated features and their classes or
controlled vocabularies, displayed as feature_type:feature_class
or feature_type:controlled_vocabulary, when appropriate; derived
from select feature annotations on RefSeq(s) associated with the
GeneID

Address of the bookmark: ftp://ftp.ncbi.nih.gov/gene/DATA/GENE_INFO/

Conserved Domain Database (CDD) version 3.11 released

Shikha Logwani — Wed, 19 Feb 2014 15:02:40 -0600

National Center for Biotechnology Information (NCBI) Conserved Domain Database (CDD) version 3.11 is now available with 596 new or updated NCBI-curated and 49,641 total domain models. The new version now contains the most recent Pfam release 27.

Updates to the Conserved Domain Database include:

Position-specific score matrices (PSSMs) have been recomputed for many models in CDD, and frequency tables have been added to the PSSMs;

The search databases distributed as part of this release can now be used with the more recent versions of RPS-BLAST (BLAST release 2.2.28 and up) using composition-based scoring. This abolishes the need to mask out compositionally biased regions in query sequences;

Domain annotation displays in CD-Search, BATCH CD-Search, and other services now all use a uniform display style. A new display option in CD-Search and BATCH CD-Search provides “standard” results, in addition to “concise” and “full” results. “Standard” results will provide, for each region on the query sequence, the best0-scoring domain model (if any) from each of CDD’s database providers (Pfam, SMART, COG, TIGRFAMs, Protein Clusters, and the NCBI in-house curation project), but will suppress redundancy from within a single provider's results list.

You can access CDD at the Conserved Domains homepage and find updated content on the CDD FTP site.

Reference:

NCBI Website

NCBI Remap

Jit — Thu, 11 Feb 2016 11:02:26 -0600

NCBI Remap. This tool is conceptually similar to liftOver in that in manages conversions between a pair of genome assemblies but it uses different methods to achieve these mappings. It is also available through a simple web interface or you can use the API for NCBI Remap.

More at http://www.ncbi.nlm.nih.gov/genome/tools/remap

API http://www.ncbi.nlm.nih.gov/genome/tools/remap/docs/api

Address of the bookmark: http://www.ncbi.nlm.nih.gov/genome/tools/remap

Entrez Direct: E-utilities on the UNIX Command Line

Anjana — Wed, 19 Oct 2016 08:06:24 -0500

Entrez Direct (EDirect) is an advanced method for accessing the NCBI's suite of interconnected databases (publication, sequence, structure, gene, variation, expression, etc.) from a UNIX terminal window. Functions take search terms from command-line arguments. Individual operations are combined to build multi-step queries. Record retrieval and formatting normally complete the process.

EDirect also provides an argument-driven function that simplifies the extraction of data from document summaries or other results that are returned in structured XML format. This can eliminate the need for writing custom software to answer ad hoc questions. Queries can move seamlessly between EDirect commands and UNIX utilities or scripts to perform actions that cannot be accomplished entirely within Entrez.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/books/NBK179288/

Magic-BLAST

Shruti Paniwala — Fri, 20 Mar 2020 15:18:36 -0500

Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.

Address of the bookmark: https://ncbi.github.io/magicblast/

New Release of RefSeq !

Abhi — Tue, 16 Jul 2024 10:09:21 -0500

Check out RefSeq release 225, now available online and from the FTP site. You can access RefSeq data through NCBI Datasets.

What’s included in this release?

As of July 8, 2024, this full release incorporates genomic, transcript, and protein data containing:

448,507,905 records
334,845,613 proteins
63,542,774 RNAs
Sequences from 152,668 organisms

The release is provided in several directories as a complete dataset and also as divided by logical groupings.

Run miniasm assembler on nanopore reads !

Jit — Mon, 18 Dec 2017 04:07:50 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

Find the detail of the reads repeats:

fq2fa ONT_A.fastq ONT_A.fasta

minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf

awk '{if($1==$6){print}}' AONT.paf > AONTself.paf

awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list

Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)

minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps

minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps

minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps

Assemble with miniasm:

miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta

minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf

awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf

minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps

Njoy the assembly !

BlasR Mapping single molecule sequencing reads using Basic Local Alignment with Successive Refinement (BLASR): Theory and Application,

Jit — Wed, 23 May 2018 06:54:32 -0500

BLASR (Basic Local Alignment with Successive Refinement) for mapping Single Molecule Sequencing (SMS) reads that are thousands to tens of thousands of bases long with divergence between the read and genome dominated by insertion and deletion error.

Here is how I use the blasr to align PacBio reads to the contigs (target.fasta). The “target.fasta.sa” is the suffix array from “target.fasta” generated by sawriter.

blasr query.fa ./target.fasta -sa ./target.fasta.sa -bestn 40 -maxScore -500 -m 4 -nproc 24 -out target.m4 -maxLCPLength 15

the output format option “-m 4″ generate the alignment coordinate. Not fully documented, but I can explain that to you.

I use a 24 cores / 48G ram server for the alignment. It took about 2 to 3 hours aligning 3G PacBio Reads to 10^6 sequences of short read contigs with a mean 3.5kbp length.

Address of the bookmark: http://bix.ucsd.edu/projects/blasr/

Porechop: tool for finding and removing adapters from Oxford Nanopore reads

Rahul Nayak — Tue, 29 May 2018 07:33:44 -0500

Porechop is a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.

Porechop also supports demultiplexing of Nanopore reads that were barcoded with the Native Barcoding Kit, PCR Barcoding Kit or Rapid Barcoding Kit.

Address of the bookmark: https://github.com/rrwick/Porechop