BOL: Related items

NVIDIA and Arc Institute Unveil Evo 2: A Breakthrough AI for DNA Design

BioStar — Fri, 21 Feb 2025 10:39:47 -0600

NVIDIA and the Arc Institute have introduced Evo 2, a groundbreaking AI model designed to understand, predict, and generate DNA sequences. This marks a major advancement in computational biology, offering scientists an unprecedented tool to decode the genetic blueprint of life and even design entirely new biological systems.

The Power of Evo 2: AI Meets DNA

Evo 2 is the largest AI model for biology ever created, trained on an astonishing 9.3 trillion DNA "letters" (nucleotides) carefully selected from genomes spanning the entire tree of life. This massive dataset ensures that Evo 2 can recognize patterns and relationships in genetic sequences at an unparalleled scale.

For the first time, scientists can design DNA with AI, moving beyond simple sequence analysis to active DNA generation. Evo 2 enables researchers to predict, modify, and even create entire genetic sequences, opening new possibilities in medicine, agriculture, and synthetic biology.

Decoding the Dark Genome

One of the biggest challenges in genetics is understanding the non-coding regions of DNA—vast stretches of the genome that do not code for proteins but play crucial roles in regulating gene expression. These regions control when and how genes are activated, influencing everything from development to disease.

Evo 2 is designed to decode these non-coding elements, helping researchers uncover their functions and use this knowledge to develop gene-based therapies, synthetic life forms, and precision agriculture solutions.

From Reading DNA to Writing It

To put Evo 2’s impact into perspective:

Previous AI models could "read" DNA like a book, analyzing genetic sequences and identifying patterns.
Evo 2 can "write" entirely new DNA, designing functional genes, chromosomes, and even full genomes from scratch.

This means scientists can now engineer biological systems with AI, designing new proteins, metabolic pathways, and genetic circuits to address real-world challenges.

A Step Toward Generative Biology

The Arc Institute describes Evo 2 as a major step toward "generative biology"—a revolutionary approach where AI is used to create novel biological structures rather than just analyzing existing ones. This could lead to breakthroughs such as:

New medicines: AI-generated enzymes and proteins tailored for targeted therapies.
Disease-resistant crops: Genetically optimized plants for higher yield and climate resilience.
Synthetic organisms: Custom-designed microbes for bioremediation, biofuel production, and industrial applications.

An Open-Source Revolution

Unlike many proprietary AI models, Evo 2 is open source, making its capabilities accessible to researchers worldwide. This democratization of AI-driven biology means that scientists from different disciplines can collaborate, experiment, and innovate, accelerating discoveries in genetic engineering and synthetic biology.

With Evo 2, the boundaries of what’s possible in DNA design, genetic engineering, and biological innovation are being redrawn. The future of life sciences is no longer just about understanding life’s code—it’s about writing it.

LAMSA: fast split read alignment with long approximate matches

Jit — Tue, 15 May 2018 04:44:42 -0500

LAMSA (Long Approximate Matches-based Split Aligner) is a novel split alignment approach with faster speed and good ability of handling SV events. It is well-suited to align long reads (over thousands of base-pairs). LAMSA takes takes the advantage of the rareness of SVs to implement a specifically designed two-step strategy. That is, LAMSA initially splits the read into relatively long fragments and co-linearly align them to solve the small variations or sequencing errors, and mitigate the effect of repeats. The alignments of the fragments are then used for implementing a sparse dynamic programming (SDP)-based split alignment approach to handle the large or non-co-linear variants. We benchmarked LAMSA with simulated and real datasets having various read lengths and sequencing error rates, the results demonstrate that it is substantially faster than the state-of-the-art long read aligners; mean-while, it also has good ability to handle various categories of SVs. LAMSA is open source and free for non-commercial use. LAMSA is mainly designed by Bo Liu & Yan Gao and developed by Yan Gao in Center for Bioinformatics, Harbin Institute of Technology, China.

Address of the bookmark: https://github.com/hitbc/LAMSA

nanofilt: Filtering and trimming of long read sequencing data

Jit — Mon, 30 Jul 2018 12:01:52 -0500

Filtering on quality and/or read length, and optional trimming after passing filters.
Reads from stdin, writes to stdout.

Intended to be used:

directly after fastq extraction
prior to mapping
in a stream between extraction and mapping

https://github.com/wdecoster/nanofilt

Address of the bookmark: https://github.com/wdecoster/nanofilt

TULIP - The Uncorrected Long read Itegration Pipeline

Jit — Thu, 23 Nov 2017 09:30:01 -0600

#Running TULIP (The Uncorrected Long-read Integration Process), version 0.4 late 2016 (European eel)

TULIP currently consists of to Perl scripts, tulipseed.perl and tulipbulb.perl. These are very much intended as prototypes, and additional components and/or implementations are likely to follow.
Tulipseed takes as input alignments files of long reads to sparse short seeds, and outputs a graph and scaffold structures. Tulipbulb adds long read sequencing data to these.

https://github.com/Generade-nl/TULIP

Address of the bookmark: https://github.com/Generade-nl/TULIP

TULIP - The Uncorrected Long read Integration Pipeline

Jit — Tue, 15 May 2018 09:06:37 -0500

TULIP currently consists of two Perl scripts, tulipseed.perl and tulipbulb.perl. These are very much intended as prototypes, and additional components and/or implementations are likely to follow. Tulipseed takes as input alignments files of long reads to sparse short seeds, and outputs a graph and scaffold structures.

Address of the bookmark: https://github.com/Generade-nl/TULIP

Run miniasm assembler on nanopore reads !

Jit — Mon, 18 Dec 2017 04:07:50 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

Find the detail of the reads repeats:

fq2fa ONT_A.fastq ONT_A.fasta

minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf

awk '{if($1==$6){print}}' AONT.paf > AONTself.paf

awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list

Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)

minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps

minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps

minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps

Assemble with miniasm:

miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta

minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf

awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf

minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps

Njoy the assembly !

Frequent Paired-end reads (PE 2x100) mapping command lines

Jit — Tue, 15 May 2018 08:59:29 -0500

bowtie2 -x hs37m -X 650 -q -1 r1.fq -2 r2.fq -S r12.bowtie2.sam

bwa aln hs37m.fa r1.fq > r1.sai && bwa aln hs37m.fa r2.fq > r2.sai \
&& bwa sampe hs37m r1.sai r2.sai r1.fq r2.fq > r12.bwa.sam

bwa bwasw ../index/bwa/hs37m.fa r12.fq > r12.bwasw.sam

gsnap -A sam -d hs37m r1.fq r2.fq > r12.gsnap.sam

novoalign -r Random -o SAM -f r1.fq r2.fq -i 500 50 -d hs37m-k14s3.novo > r12.novo.sam

smalt map -f samsoft -i 650 -o r12.smalt-k20s13.sam hs37m-k20s13 r1.fq r2.fq

stampy.py -g hs37m -h hs37m -o r12.stampy.sam -M r1.fq,r2.fq

soap -D hs37m.fa.index -a r1.fq -b r2.fq -l 32 -g 3 -u dummy -2 dummy -o r12.soap

npScarf: real-time scaffolder using SPAdes contigs and Nanopore sequencing reads

Shruti Paniwala — Mon, 11 Jun 2018 05:14:57 -0500

npScarf (jsa.np.npscarf) is a program that connect contigs from a draft genomes to generate sequences that are closer to finish. These pipelines can run on a single laptop for microbial datasets. In real-time mode, it can be integrated with simple structural analyses such as gene ordering, plasmid forming.

Address of the bookmark: http://japsa.readthedocs.io/en/latest/tools/jsa.np.npscarf.html

PANDASEQ is a program to align Illumina reads, optionally with PCR primers embedded in the sequence, and reconstruct an overlapping sequence.

BioStar — Fri, 21 Sep 2018 10:19:52 -0500

Development packages for zlib and libbz2 are needed, as well as a standard compiler environment. On Ubuntu, this can be installed via:

sudo apt-get install build-essential libtool automake zlib1g-dev libbz2-dev pkg-config

On MacOS, the Apple Developer tools and Fink (or MacPorts or Brew) must be installed, then:

sudo fink install bzip2-dev pkgconfig

Address of the bookmark: https://github.com/neufeld/pandaseq

SViper: Swipe your Structural Variants called on long (ONT/PacBio) reads with short exact (Illumina) reads.

Neel — Sun, 22 Dec 2019 03:48:28 -0600

Call sviper

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants

This will output a polished_variants.vcf file, that contains all the refined variants.

Sometimes it is helpful to look at the polished sequence, e.g. with the IGV browser. In that case you want SViper to output the polished and aligned sequences in a bam file via the option --output-polished-bam:

~$ ./sviper -s short-reads.bam -l long-reads.bam -r ref.fa -c variants.vcf -o polished_variants --output-polished-bam

Address of the bookmark: https://github.com/smehringer/SViper