BOL: Related items

MetaGraph: Ultra Scalable Framework for DNA Search, Alignment, Assembly

Abhi — Sat, 08 Jun 2024 16:15:25 -0500

The MetaGraph framework is designed to work with a wide range of input data sets, indexing from a few samples up to the contents of entire archives with hundreds of thousands of records. The indexing workflow always follows the same principle, transforming single input samples into error-removed, refined sample graphs, which are then merged into a joint metagraph index. Each input sample is annotated in the joint index as a subgraph. This graph index enriched with metadata can then be used for downstream applications such as sequence search or differential assembly.

Searcg link https://metagraph.ethz.ch/search

Pre-print https://www.biorxiv.org/content/10.1101/2020.10.01.322164v4

Address of the bookmark: https://metagraph.ethz.ch/

BioGeek Fun

Jit — Sun, 16 Mar 2014 06:33:31 -0500

1. A futuristic computational biology student was told to write "It is in my gene!!!" on the board 100 times as a punishment. here's his response -

use warnings;
for ($count=1; $count <=100; $count++) { print "It is in my gene!!!";}

I guess, he is gonna to be a real biogeek. Nice try though. Smart kid.

2. In some perl script I found this
. . . . . .
. . . . . .
# It works for me, only God understood how it is working
while (/(<\/[^>]+>)|(<[^>]+>)|(<[^>]+>)$|([^><]+)/go) {
            $startGene=$1;
            $beginChromosome=$2;

. . . . . .
.. . . . . .
}

3. One more interesting message in Perl found …. It will must tickle you bone :)
open(my $fh, "<", "gene.txt") or kill " Me if you think this is a mistake :$!";

4. From the Perl

while () { # "The Mothership Connection is here!"
print “$_\n”; # Printing the offspring :)

5. Perl message
if ($1) { print “Just found a the error in chromosome !!!, yahoo…”; else { “That is not error, but mutation you moron!”;

6. One genome database curator walk in wine bar asked the bartender:
CREATE TABLE gene IF NOT EXISTS SexOnTheBeach;

A Bioinformatician’s Lament

LEGE — Thu, 29 May 2025 01:33:31 -0500

"I have a presentation tomorrow," they say,

With hopeful eyes, like it’s all child's play.
As if results bloom overnight, full-grown—
Not wrangled from chaos, and error-prone.

Oh brave soul, sit, let’s walk through the tale,
Of pipelines broken and servers that fail.
The journey starts: “The data? It’s there—
Just fetch it from S3, easy, I swear.”

Now I summon awscli with dread,
Reset my keys, credentials fed.
Configure regions, IAM roles too—
All this, and still no peek at the view.

Next up, the tool: “It’s open source!”
On GitHub, rotting, no sign of remorse.
Python 2.7, some GCC trick—
The install alone might make you sick.

Finally, progress! The pipeline runs…
Till RAM collapses and error stuns.
Oh, and the metadata? A crime,
Merged cells, font soup, out of time.

Sample IDs—what a cryptic game:
Sample_1, S1, sample-1... the same?
Controls mislabeled, cases flipped,
No wonder my sanity's starting to slip.

Then QC plots, PCA joy—
Wait, that’s a tumor labeled as a boy?
Clusters cross, and axes lie,
And I still don’t know which sample’s "guy."

But the clock ticks on, and it’s half-past doom,
They want the final UMAP soon.
With pastel colors, labeled clear—
"Can we move that legend to right here?"

Tweak by tweak, I adjust each frame,
Resize Panel B, annotate a name.
Export the plot—it starts to gleam…
Then my laptop crashes. I scream.

This is the grind, the long-haul game,
Where science hides behind code and flame.
No “Export to Nature” button to press,
Just toil and logic and hope for success.

So next time you whisper that fated line—
“I have a talk, can you make it shine?”
Know: bioinformatics is craft, not a click,
It’s science with scars, not just a quick fix.

To all who debug at 3AM light,
Who ghostwrite figures through sleepless night—
You are the backbone, silent and true,
First-author-worthy, if only they knew.

"कल मेरी प्रेज़ेंटेशन है," वो कहते हैं,

आशा भरी आँखों से, जैसे सब सहज है।
जैसे परिणाम रातोंरात प्रकट हो जाएं—
ना कि डेटा की भूलभुलैया से उखाड़े जाएं।

आओ बैठो, एक किस्सा सुनाता हूँ,
जहाँ पाइपलाइन टूटती है, और सर्वर भी थक जाते हैं।
कहानी शुरू होती है: “डेटा तो है—
बस S3 बकेट में, एकदम पास में कहीं।”

अब awscli बुलाता हूँ डरते हुए,
कुंजी सेट करूँ, क्रेडेंशियल जोड़ूं, रीजन भरूँ।
इतनी मशक्कत, फिर भी डेटा नहीं मिला,
बस सेटअप में ही पूरा दिन चला।

फिर आता है टूल: “ओपन-सोर्स है!”
GitHub पर है, 2019 से सूखा पड़ा है।
Python 2.7 चाहिए, एक पुराना कम्पाइलर,
और साथ में थोड़ी सी दुआ की ताकत।

आख़िरकार टूल चला, खुशी सी हुई,
लेकिन रन करते ही, मेमोरी ने हार मानी।
और मेटाडेटा? एक एक्सेल की आफ़त,
मर्ज़ किए हुए सेल, बस और क्या चाहिए काफ़ियत?

सैंपल आईडी? बस भगवान ही जाने—
Sample_1, sample-1, S1, और control1—
ये सब एक ही सैंपल हैं क्या?
पता तब चलता है जब पूछो दो-तीन बार।

काउंट मैट्रिक्स तैयार, अब R या Python की बारी,
QC करो, PCA प्लॉट—पर कुछ गड़बड़ भारी।
ट्यूमर और नॉर्मल का अदला-बदली खेल,
बार-बार, वही पुरानी झमेल।

आख़िर में आया मॉडलिंग का समय,
स्टैट्स, प्लॉट्स, डिफरेंशियल एक्सप्रेशन का श्रम।
लेकिन घड़ी में 5 बज चुके हैं जनाब,
और 8 बजे तक UMAP चाहिए, साफ़-सुथरा जबाब।

तो मैं कोड लिखता हूँ रात भर बैठ कर,
कलर पैलेट, जीन लेबल, लीजेंड बाहर रख कर।
फ़ॉन्ट, पैनल, एक्सिस सब सुधार,
एक्सपोर्ट करता हूँ... और लैपटॉप कहता है—"अब नहीं यार!"

इसीलिए बायोइन्फॉर्मेटिक्स में लगता है समय,
ये “बस सीरत चलाओ” या “वोल्कैनो प्लॉट बनाओ” नहीं है।
ये है सिस्टम एडमिन का काम, डेटा की सफ़ाई,
QC, डिबगिंग, और सांइस की सच्ची लड़ाई।

तो कुछ सीखें इस व्यथा से आप भी आज:
24 घंटे पहले चमत्कार मत माँगिए।
अच्छे फ़िगर साफ़ डेटा से बनते हैं।
बायोइन्फॉर्मेटिक्स जादू नहीं, विज्ञान है।
समय से बात कीजिए, प्रक्रिया का सम्मान कीजिए।

और उन सभी बायोइन्फॉर्मेटिशियनों को सलाम,
जो दूसरों की प्रेज़ेंटेशन के लिए रातों में जागते हैं—
तुम हो फ़िगर्स के भूत लेखक,
तुम हो बिना नाम के सह-लेखक।
तुम पहले लेखक बनने के हक़दार हो—
और एक लंबी नींद के भी।

Note: Written with the help of AI/LLM Tools !

New born babies get ready to know their whole genome soon!!!

Rahul Agarwal — Thu, 05 Sep 2013 07:24:02 -0500

USA launch a pilot projects to examine medical information of newborn baby, which are being funded by the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) and the National Human Genome Research Institute (NHGRI), both parts of the National Institutes of Health.

Awards of $5 million to four grantees have been made in fiscal year 2013 under the Genomic Sequencing and Newborn Screening Disorders research program. The program will be funded at $25 million over five years, as funds are made available.

"Hundreds of US babies will be pioneers in genomic medicine through a US$25-million programme to sequence their genomes soon after they are born."

Source:

http://blogs.nature.com/news/2013/09/scientists-to-sequence-hundreds-of-newborns-genomes.html

http://www.genome.gov/27554919

Opera: An optimal genome scaffolding program

Jit — Mon, 27 Nov 2017 10:18:20 -0600

Opera (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly ). It uses information from paired-end or long reads to optimally order and orient contigs assembled from shotgun-sequencing reads.

An updated version called OPERA-LG has been re-engineered with features for the assembly of large and complex genomes.

Song Gao, Denis Bertrand, Burton K. H. Chia and Niranjan Nagarajan. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology, May 2016, doi: 10.1186/s13059-016-0951-y.

Song Gao, Wing-Kin Sung, Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology, Sept. 2011, doi:10.1089/cmb.2011.0170.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0951-y

Address of the bookmark: https://sourceforge.net/projects/operasf/

SPAdes hybrid genome assembly

Jit — Mon, 27 Nov 2017 08:05:40 -0600

When you have both Illumina and Nanopore data, then SPAdes remains a good option for hybrid assembly - SPAdes was used to produce the B fragilis assembly by Mick Watson’s group.

Again, running spades.py will show you the options:

spades.py

This produces:

SPAdes genome assembler v3.10.1

Usage: /usr/local/SPAdes-3.10.1-Linux/bin/spades.py [options] -o 

Basic options:
-o          directory to store all the resulting files (required)
--sc                    this flag is required for MDA (single-cell) data
--meta                  this flag is required for metagenomic sample data
--rna                   this flag is required for RNA-Seq data
--plasmid               runs plasmidSPAdes pipeline for plasmid detection
--iontorrent            this flag is required for IonTorrent data
--test                  runs SPAdes on toy dataset
-h/--help               prints this usage message
-v/--version            prints version

Input data:
--12          file with interlaced forward and reverse paired-end reads
-1            file with forward paired-end reads
-2            file with reverse paired-end reads
-s            file with unpaired reads
--pe<#>-12            file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-1             file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-2             file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-s             file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-    orientation of reads for paired-end library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--s<#>                file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)
--mp<#>-12            file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-1             file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-2             file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-s             file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-    orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--hqmp<#>-12          file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-1           file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-2           file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-s           file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-  orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9;  = fr, rf, ff)
--nxmate<#>-1         file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--nxmate<#>-2         file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--sanger              file with Sanger reads
--pacbio              file with PacBio reads
--nanopore            file with Nanopore reads
--tslr        file with TSLR-contigs
--trusted-contigs             file with trusted contigs
--untrusted-contigs           file with untrusted contigs

Pipeline options:
--only-error-correction runs only read error correction (without assembling)
--only-assembler        runs only assembling (without read error correction)
--careful               tries to reduce number of mismatches and short indels
--continue              continue run from the last available check-point
--restart-from      restart run with updated options and from the specified check-point ('ec', 'as', 'k', 'mc')
--disable-gzip-output   forces error correction not to compress the corrected reads
--disable-rr            disables repeat resolution stage of assembling

Advanced options:
--dataset             file with dataset description in YAML format
-t/--threads               number of threads
                                [default: 16]
-m/--memory                RAM limit for SPAdes in Gb (terminates if exceeded)
                                [default: 250]
--tmp-dir              directory for temporary files
                                [default: /tmp]
-k                 comma-separated list of k-mer sizes (must be odd and
                                less than 128) [default: 'auto']
--cov-cutoff             coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset  <33 or 64>      PHRED quality offset in the input reads (33 or 64)
                                [default: auto-detect]

As you can see this is also a “pipeline” of tools that can be switched on or off. SPAdes takes quite a long time, so for the purposes of this practical, something like this may suffice:

spades.py -t 4 \
          -m 32 \
          -k 31,51,71 \
          --only-assembler \
          -1 miseq.1.fastq -2 miseq.2.fastq \
          --nanopore minion.fastq \
          -o hybrid_assembly

In turn, these parameters mean

use 4 threads
max memory is 32Gb
use 3 kmer values to build the de bruijn graph(s) - 31, 51 and 71
only run the assembler, not the correction algorithm (for speed)
read 1 and read 2 of the MiSeq data
the nanopore data
put the output in folder “hybrid_assembly”

COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly

Jit — Wed, 06 Dec 2017 02:08:14 -0600

An efficient tool called Connecting Overlapped Pair-End (COPE) reads, to connect overlapping pair-end reads using k-mer frequencies. We evaluated our tool on 30× simulated pair-end reads from Arabidopsis thaliana with 1% base error. COPE connected over 99% of reads with 98.8% accuracy, which is, respectively, 10 and 2% higher than the recently published tool FLASH. When COPE is applied to real reads for genome assembly, the resulting contigs are found to have fewer errors and give a 14-fold improvement in the N50 measurement when compared with the contigs produced using unconnected reads.

Address of the bookmark: ftp://ftp.genomics.org.cn/pub/cope

Tools for bacterial whole genome annotation

Radha Agarkar — Sat, 16 Dec 2017 17:37:47 -0600

RAST – Web tool (upload contigs), uses the subsystems in the SEED database and provides detailed annotation and pathway analysis. Takes several hours per genome but I think this is the best way to get a high quality annotation (if you have only a few genomes to annotate).

Prokka – Standalone command line tool, takes just a few minutes per genome. This is the best way to get good quality annotation in a flash, which is particularly useful if you have loads of genomes or need to annotate a pangenome or metagenome. Note however that the quality of functional information is not as good as RAST, and you will need several extra steps if you want to do functional profiling and pathway analysis of your genome(s)… which is in-built in RAST.

NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements.

PGAP: NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Automatic Annotation Pipeline (PGAAP; see Pubmed Article) developed in 2005 has been replaced with an upgraded version that is capable of processing a larger data volume. NCBI's annotation pipeline depends on several internal databases and is not currently available for download or use outside of the NCBI environment.

BEACON (automated tool for Bacterial GEnome Annotation ComparisON), a fast tool for an automated and a systematic comparison of different annotations of single genomes. The extended annotation assigns putative functions to many genes with unknown functions. BEACON is available under GNU General Public License version 3.0 and is accessible at: http://www.cbrc.kaust.edu.sa/BEACON/.

BlastKOLA: Assigns K numbers to the user's sequence data by BLAST searches, respectively, against a nonredundant set of KEGG GENES. KOALA (KEGG Orthology And Links Annotation) is KEGG's internal annotation tool for K number assignment of KEGG GENES using SSEARCH computation. Annotate Sequence in KEGG Mapper and Pathogen Checker in KEGG Pathogen are special interfaces to this server and can be executed in an interactive mode. BlastKOALA is suitable for annotating fully sequenced genomes.

PAGIT: Provides a toolkit for improving the quality of genome assemblies created via an assembly software. PAGIT compiled four tools: (i) ABACAS which classifies and orientates contigs and estimates the sizes of gaps between them; (ii) IMAGE uses paired-end reads to extend contigs and close gaps within the scaffolds; (iii) ICORN for identifying and correcting small errors in consensus sequences and; (iv) RATT for help annotation. The software was mainly created to analyze parasite genomes of up to about 300 Mb.

MAKER: A portable and easily configurable genome annotation pipeline. MAKER allows smaller eukaryotic and prokaryotic genome projects to independently annotate their genomes and to create genome databases. It identifies repeats, aligns ESTs and proteins to a genome, produces ab-initio gene predictions and automatically synthesizes these data into gene annotations having evidence-based quality values. MAKER's inputs are minimal and its ouputs can be directly loaded into a Generic Model Organism Database (GMOD). They can also be viewed in the Apollo genome browser; this feature of MAKER provides an easy means to annotate, view and edit individual contigs and BACs without the overhead of a database. MAKER is available for download and can be tested online via the MAKER Web Annotation Service (MWAS).

MyPro is a software pipeline for high-quality prokaryotic genome assembly and annotation. It was validated on 18 oral streptococcal strains to produce submission-ready, annotated draft genomes. MyPro installed as a virtual machine and supported by updated databases will enable biologists to perform quality prokaryotic genome assembly and annotation with ease.

GIGGLE: a search engine for large-scale integrated genome analysis

Jit — Wed, 10 Jan 2018 03:10:45 -0600

GIGGLE is a genomics search engine that identifies and ranks the significance of genomic loci shared between query features and thousands of genome interval files. GIGGLE (https://github.com/ryanlayer/giggle) scales to billions of intervals and is over three orders of magnitude faster than existing methods. Its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEx by facilitating data integration and hypothesis generation.

https://www.nature.com/articles/nmeth.4556

Address of the bookmark: https://github.com/ryanlayer/giggle

MUMmer4: A fast and versatile genome alignment system

Jit — Sat, 03 Feb 2018 04:59:17 -0600

MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes;

Address of the bookmark: https://mummer4.github.io/