BOL: Related items

NVIDIA and Arc Institute Unveil Evo 2: A Breakthrough AI for DNA Design

BioStar — Fri, 21 Feb 2025 10:39:47 -0600

NVIDIA and the Arc Institute have introduced Evo 2, a groundbreaking AI model designed to understand, predict, and generate DNA sequences. This marks a major advancement in computational biology, offering scientists an unprecedented tool to decode the genetic blueprint of life and even design entirely new biological systems.

The Power of Evo 2: AI Meets DNA

Evo 2 is the largest AI model for biology ever created, trained on an astonishing 9.3 trillion DNA "letters" (nucleotides) carefully selected from genomes spanning the entire tree of life. This massive dataset ensures that Evo 2 can recognize patterns and relationships in genetic sequences at an unparalleled scale.

For the first time, scientists can design DNA with AI, moving beyond simple sequence analysis to active DNA generation. Evo 2 enables researchers to predict, modify, and even create entire genetic sequences, opening new possibilities in medicine, agriculture, and synthetic biology.

Decoding the Dark Genome

One of the biggest challenges in genetics is understanding the non-coding regions of DNA—vast stretches of the genome that do not code for proteins but play crucial roles in regulating gene expression. These regions control when and how genes are activated, influencing everything from development to disease.

Evo 2 is designed to decode these non-coding elements, helping researchers uncover their functions and use this knowledge to develop gene-based therapies, synthetic life forms, and precision agriculture solutions.

From Reading DNA to Writing It

To put Evo 2’s impact into perspective:

Previous AI models could "read" DNA like a book, analyzing genetic sequences and identifying patterns.
Evo 2 can "write" entirely new DNA, designing functional genes, chromosomes, and even full genomes from scratch.

This means scientists can now engineer biological systems with AI, designing new proteins, metabolic pathways, and genetic circuits to address real-world challenges.

A Step Toward Generative Biology

The Arc Institute describes Evo 2 as a major step toward "generative biology"—a revolutionary approach where AI is used to create novel biological structures rather than just analyzing existing ones. This could lead to breakthroughs such as:

New medicines: AI-generated enzymes and proteins tailored for targeted therapies.
Disease-resistant crops: Genetically optimized plants for higher yield and climate resilience.
Synthetic organisms: Custom-designed microbes for bioremediation, biofuel production, and industrial applications.

An Open-Source Revolution

Unlike many proprietary AI models, Evo 2 is open source, making its capabilities accessible to researchers worldwide. This democratization of AI-driven biology means that scientists from different disciplines can collaborate, experiment, and innovate, accelerating discoveries in genetic engineering and synthetic biology.

With Evo 2, the boundaries of what’s possible in DNA design, genetic engineering, and biological innovation are being redrawn. The future of life sciences is no longer just about understanding life’s code—it’s about writing it.

WhatsHap: fast and accurate read-based phasing

Jit — Mon, 28 May 2018 09:52:16 -0500

WhatsHap is a software for phasing genomic variants using DNA sequencing reads, also called read-based phasing or haplotype assembly. It is especially suitable for long reads, but works also well with short reads.

Features

Very accurate results (Martin et al., WhatsHap: fast and accurate read-based phasing)

Works well with Illumina, PacBio, Oxford Nanopore and other types of reads

It phases SNVs, indels and even “complex” variants (such as TCG → AGAA)

Pedigree phasing mode uses reads from related individuals (such as trios) to improve results and to reduce coverage requirements (Garg et al., Read-Based Phasing of Related Individuals).

WhatsHap is easy to install

It is easy to use: Pass in a VCF and one or more BAM files, get out a phased VCF. Supports multi-sample VCFs.

It produces standard-compliant VCF output by default

If desired, get output that is compatible with ReadBackedPhasing

Open Source (MIT license)

Address of the bookmark: https://whatshap.readthedocs.io/en/latest/

ALE: a Generic Assembly Likelihood Evaluation Framework for Assessing the Accuracy of Genome and Metagenome Assemblies

Neel — Tue, 26 Apr 2016 03:38:43 -0500

Assembly Likelihood Evaluation (ALE) framework that overcomes these limitations, systematically evaluating the accuracy of an assembly in a reference-independent manner using rigorous statistical methods. This framework is comprehensive, and integrates read quality, mate pair orientation and insert length (for paired-end reads), sequencing coverage, read alignment and k-mer frequency. ALE pinpoints synthetic errors in both single and metagenomic assemblies, including single-base errors, insertions/deletions, genome rearrangements and chimeric assemblies presented in metagenomes. At the genome level with real-world data, ALE identifies three large misassemblies from the Spirochaeta smaragdinae finished genome, which were all independently validated by Pacific Biosciences sequencing. At the single-base level with Illumina data, ALE recovers 215 of 222 (97%) single nucleotide variants in a training set from a GC-rich Rhodobacter sphaeroides genome. Using real Pacific Biosciences data, ALE identifies 12 of 12 synthetic errors in a Lambda Phage genome, surpassing even Pacific Biosciences' own variant caller, EviCons. In summary, the ALE framework provides a comprehensive, reference-independent and statistically rigorous measure of single genome and metagenome assembly accuracy, which can be used to identify misassemblies or to optimize the assembly process.

More at http://www.ncbi.nlm.nih.gov/pubmed/23303509

Address of the bookmark: http://sc932.github.io/ALE/about.html

FQC Dashboard: Integrates FastQC results into a web-based, interactive, and extensible FASTQ quality control tool

Shruti Paniwala — Tue, 10 Nov 2020 01:30:22 -0600

FQC is software that facilitates quality control of FASTQ files by carrying out a QC protocol using FastQC, parsing results, and aggregating quality metrics into an interactive dashboard designed to richly summarize individual sequencing runs. The dashboard groups samples in dropdowns for navigation among the data sets, utilizes human-readable configuration files to manipulate the pages and tabs, and is extensible with CSV data.

Address of the bookmark: https://github.com/pnnl/fqc

LoFreq*: A sequence-quality aware, ultra-sensitive variant caller for NGS data

BioStar — Tue, 18 Feb 2020 03:24:22 -0600

LoFreq* (i.e. LoFreq version 2) is a fast and sensitive variant-caller for inferring SNVs and indels from next-generation sequencing data. It makes full use of base-call qualities and other sources of errors inherent in sequencing (e.g. mapping or base/indel alignment uncertainty), which are usually ignored by other methods or only used for filtering.

https://github.com/CSB5/lofreq

http://csb5.github.io/lofreq/installation/

https://github.com/CSB5/lofreq/tree/master/dist

Address of the bookmark: http://csb5.github.io/lofreq/

Reference-free prediction of rearrangement breakpoint reads

Neel — Thu, 08 Mar 2018 05:05:25 -0600

lideSort-BPR ( b reak p oint r eads) is based on a fast algorithm for all-against-all comparisons of short reads and theoretical analyses of the number of neighboring reads. When applied to a dataset with a sequencing depth of 100×, it finds ∼88% of the breakpoints correctly with no false-positive reads. Moreover, evaluation on a real prostate cancer dataset shows that the proposed method predicts more fusion transcripts correctly than previous approaches, and yet produces fewer false-positive reads. To our knowledge, this is the first method to detect breakpoint reads without using a reference genome.

https://github.com/ewijaya/slidesort-bpr

Address of the bookmark: https://code.google.com/archive/p/slidesort-bpr/

SuRankCo: supervised ranking of contigs in de novo assemblies

Neel — Wed, 24 May 2017 04:46:52 -0500

SuRankCo is a machine learning based software to score and rank contigs from de novo assemblies of next generation sequencing data. It trains with alignments of contigs with known reference genomes and predicts scores and ranking for contigs which have no related reference genome yet.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0644-7

Address of the bookmark: https://sourceforge.net/projects/surankco/

Minipolish: A tool for Racon polishing of miniasm assemblies

BioStar — Tue, 03 Dec 2019 02:40:54 -0600

Miniasm is a great long-read assembly tool: straight-forward, effective and very fast. However, it does not include a polishing step, so its assemblies have a high error rate – they are essentially made of stitched-together pieces of long reads.

Racon is a great polishing tool that can be used to clean up assembly errors. It's also very fast and well suited for long-read data. However, it operates on FASTA files, not the GFA graphs that miniasm makes.

That's where Minipolish comes in. With a single command, it will use Racon to polish up a miniasm assembly, while keeping the assembly in graph form.

It also takes care of some of the other nuances of polishing a miniasm assembly:

Adding read depth information to contigs
Fixing sequence truncation that can occur in Racon
Adding circularising links to circular contigs if not already present (so they display better in Bandage)
'Rotating' circular contigs between polishing rounds to ensure clean circularisation

Address of the bookmark: https://github.com/rrwick/Minipolish

Syri compares alignments between two chromosome-level assemblies and identifies synteny and structural rearrangements.

Shruti Paniwala — Wed, 01 Jun 2022 02:01:13 -0500

Syri compares alignments between two chromosome-level assemblies and identifies synteny and structural rearrangements.

Address of the bookmark: https://github.com/schneebergerlab/syri

Tigers genome sequenced

Rahul Agarwal — Tue, 17 Sep 2013 16:48:24 -0500

Fifteen scientists led by Dr Jong Bhak of Genome Research Foundation, South Korea, decoded as many as 3 billion nucleotides (organic molecules that form the basic building blocks of nucleic acids, such as DNA). They identified 20,000 genes related to various functions of the tiger.

The biggest and perhaps most fearsome of the world's big cats, the tiger, shares 95.6 percent of its DNA with humans' cute and furry companions, domestic cats.

The new research showed that big cats have genetic mutations that enabled them to be carnivores. The team also identified mutations that allow snow leopards to thrive at high altitudes.

Reference:

http://www.nbcnews.com/science/your-cat-ferocious-tigers-share-lot-95-6-percent-their-4B11182690

http://timesofindia.indiatimes.com/home/environment/flora-fauna/Gene-mapping-of-tiger-completed/articleshow/22671681.cms

Paper:

http://www.nature.com/ncomms/2013/130917/ncomms3433/full/ncomms3433.html