BOL: Related items

LRSDAY: Long-read Sequencing Data Analysis for Yeasts

Poonam Mahapatra — Mon, 26 Aug 2019 18:07:33 -0500

Long-read sequencing technologies have become increasingly popular in genome projects due to their strengths in resolving complex genomic regions. As a leading model organism with small genome size and great biotechnological importance, the budding yeast, Saccharomyces cerevisiae, has many isolates currently being sequenced with long reads.

Address of the bookmark: https://github.com/yjx1217/LRSDAY

Comparative Genomics Data Set Including 240 Mammals Released !

Jit — Thu, 19 Nov 2020 06:45:39 -0600

The genome of 130 mammals was sequenced by a large international consortium and the data was analyzed together with 110 existing genomes to allow scientists to identify the important positions in the DNA. This report, published in Nature today will help advance research on human disease mutations and inform how best to protect endangered species.

In addition to the knowledge of the human genome, all these genomes, widely sampled across mammals, can be used to research how particular organisms respond to different conditions. Some otters, for example, have a thick, water-resistant shell, and some rodents, but not all, have adapted to hibernation. These animal traits will help us to understand human traits, such as metabolic diseases.

With climate change and more animal ecosystems being threatened by human activity, the protection of endangered species is becoming increasingly important. Scientists have historically researched several people in various populations of a species to understand the genetic variation that occurs in that species. This is important for understanding how particular species can be protected. In this study, animals on the Red List of Endangered Species of the International Union for Conservation of Nature had fewer differences in their genomes, which is consistent with their endangered status.

Ref @ A comparative genomics multitool for scientific discovery and conservation https://www.nature.com/articles/s41586-020-2876-6

Data at http://zoonomiaproject.org/

AMR Database !

LEGE — Tue, 04 Jun 2024 13:37:21 -0500

ARG-ANNOT. PMID: 24145532
CARD. PMID: 23650175
MEGARes PMID: 27899569
NCBI BioProject: PRJNA313047
plasmidfinder PMID: 24777092
resfinder. PMID: 22782487
VFDB. PMID: 26578559
SRST2's version of ARG-ANNOT. PMID: 25422674.
VirulenceFinder PMID: 24574290.

Address of the bookmark: https://github.com/sanger-pathogens/ariba/wiki/Task%3A-getref

What is Data Science? — A Bioinformatics Perspective

Abhi — Mon, 16 Jun 2025 01:44:34 -0500

In today’s era of big biology, we’re generating more data than ever before—genomes, transcriptomes, proteomes, metabolomes, microbiomes… you name it. But raw biological data doesn’t speak for itself. Making sense of it requires more than traditional biology. This is where data science steps in.

So, What Is Data Science?
At its core, data science is the interdisciplinary field that extracts knowledge and insights from data using programming, statistics, and domain expertise. In bioinformatics, data science enables us to turn gigabytes of sequence data into biological meaning.

Imagine trying to understand gene regulation in cancer by analyzing thousands of RNA-seq samples, or predicting antibiotic resistance from bacterial genomes—these challenges are not solvable through wet lab experiments alone. They require data-driven thinking.

Data Science Meets Bioinformatics
Bioinformatics is inherently a data science domain. From genomics to systems biology, every field in modern biology relies on data science techniques to:

Clean and process massive datasets

Discover patterns in high-dimensional data

Build predictive models (e.g., for disease classification)

Visualize complex biological networks and trends

Integrate diverse data types (e.g., transcriptomic + epigenomic data)

The Bioinformatics Toolkit
Here’s what data science typically looks like in bioinformatics:

Task Data Science Role
Sequence alignment Efficient algorithms, indexing, parallel processing
Gene expression analysis Statistical modeling (e.g., DESeq2, limma)
Variant calling Data filtering, probabilistic models
Clustering of cells in single-cell data Unsupervised learning
Protein structure prediction Deep learning models (e.g., AlphaFold)
Metagenomics Data integration, classification, dimensionality reduction

Common tools include Python, R, Bioconductor, scikit-learn, Pandas, Seurat, and TensorFlow—often working together in reproducible workflows.

It's Not Just About Coding
A common misconception is that bioinformatics is just programming or scripting. But being a data scientist in bioinformatics also means:

Understanding experimental design

Asking biologically meaningful questions

Choosing the right statistical or machine learning models

Communicating findings effectively (e.g., plots, dashboards, papers)

In other words, data science in bioinformatics is where biology, statistics, and computer science converge.

Why It Matters
The real power of data science in bioinformatics is its ability to scale discovery.

Instead of studying one gene, we can study thousands.

Instead of analyzing one species, we can explore entire ecosystems.

Instead of waiting months for lab results, we can generate hypotheses in days.

From personalized medicine and cancer diagnostics to agricultural genomics and pandemic surveillance, data science is at the heart of the bioinformatics revolution.

Final Thoughts
If you’re a biologist who’s curious about code, or a data enthusiast fascinated by life sciences, bioinformatics is your playground—and data science is your toolkit.

In bioinformatics, data science isn’t just useful. It’s essential.

Tallymer: method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

Jit — Thu, 15 Feb 2018 10:21:02 -0600

Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set whole genome shotgun sequences from maize (B73) (total size 10⁹ bp).
Tallymer was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available.

A manual can be found here.

Address of the bookmark: https://www.zbh.uni-hamburg.de/forschung/arbeitsgruppe-genominformatik/software/tallymer.html

Kevler: Reference-free variant discovery in large eukaryotic genomes

Jit — Tue, 28 Jan 2020 03:21:53 -0600

Welcome to kevlar, software for predicting de novo genetic variants without mapping reads to a reference genome! kevlar's k-mer abundance based method calls single nucleotide variants (SNVs), multinucleotide variants (MNVs), insertion/deletion variants (indels), and structural variants (SVs) simultaneously with a single simple model.

More at https://kevlar.readthedocs.io/en/latest/

https://www.cell.com/iscience/pdf/S2589-0042(19)30259-7.pdf

Address of the bookmark: https://github.com/kevlar-dev/kevlar

Ruby Language

Jitendra Narayan — Mon, 15 Jul 2013 01:34:26 -0500

Ruby was created by Yukihiro Matsumoto, who wished to create a new language that balanced functional programming with imperative programming

Ruby is a dynamic, reflective, general purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was initially developed and designed by Yukihiro "Matz" Matsumoto. It was influenced primarily by Perl, Smalltalk, Eiffel, and Lisp.

Ruby supports multiple programming paradigms, including functional, object oriented, imperative and reflective. It also has a dynamic typesystem and automatic memory management; it is therefore similar in varying respects to Python, Perl, Lisp, Dylan, Pike, and CLU.

The standard 1.8.7 implementation is written in C, as a single-pass interpreted language. There is currently no specification of the Ruby language, so the original implementation is considered to be the de facto reference. As of 2010, there are a number of complete or upcoming alternative implementations of the Ruby language, including YARV, JRuby, Rubinius, IronRuby, MacRuby and HotRuby, each of which takes a different approach, with IronRuby, JRuby and MacRuby providing just-in-time compilation and MacRuby also providing ahead-of-time compilation. The official 1.9 branch uses YARV, as will 2.0 (development), and will eventually supersede the slower Ruby MRI.

Ruby Quick Reference
http://www.zenspider.com/Languages/Ruby/QuickRef.html

Ruby Annotation
http://www.w3.org/TR/ruby/

Ruby in Linux Journals
http://www.linuxjournal.com/article/5915

Ruby Documentation: Programming Ruby
http://ruby-doc.org/docs/ProgrammingRuby/

The Top 10 Reasons The Ruby Programming Language Sucks

http://www.slideshare.net/vishnu/the-top-10-reasons-the-ruby-programming-language-sucks

Ruby : The Programmers best friends
http://www.ruby-lang.org/en/

For Ruby Beginners
http://www.squidoo.com/ruby-programming-beginner

Ruby Programming
http://en.wikibooks.org/wiki/Ruby_Programming

Ruby CookBook
http://en.wikibooks.org/wiki/Cookbook:Table_of_Contents

Ruby Programming Challenge for Newbies -
http://rubylearning.com/blog/ruby-programming-challenge-faq/

Common "issues" faced by Ruby Newbies by Chris Strom -
http://japhr.blogspot.com/2009/10/newbie-feedback.html

Books
http://www.sapphiresteel.com/The-Book-Of-Ruby

Free Online Ruby Programming along with many Ruby newbies here -
http://rubylearning.org/class/

Which math/statistics programming language/application do you most frequently use in bioinformatics?

John Parker — Thu, 04 Sep 2014 17:46:41 -0500

I'm doing a bit more statistical analysis on some bioinformatics things lately, and I'm curious if there are any programming languages that are particularly good for this NGS computation. What suggestions do you guys have? Are there any languages that have exceptionally good libraries?

Many-Core Engine (MCE) for Perl example

Jit — Tue, 31 Jan 2017 05:37:50 -0600

MCE spawns a pool of workers and therefore does not fork a new process per each element of data. Instead, MCE follows a bank queuing model. Imagine the line being the data and bank-tellers the parallel workers. MCE enhances that model by adding the ability to chunk the next n elements from the input stream to the next available worker.

CORE MODULES

Three modules make up the core engine for MCE.

MCE::Core: Provides the Core API for Many-Core Engine. The various MCE options are described here.
MCE::Signal: Temporary directory creation, cleanup, and signal handling.
MCE::Util: Utility functions for Many-Core Engine.

MCE EXTRAS

There are 4 add-on modules for use with MCE.

MCE::Candy: Provides a collection of sugar methods and output iterators for preserving output order.
MCE::Mutex: Provides a simple semaphore implementation supporting threads and processes.
MCE::Queue: Provides a hybrid queuing implementation for MCE supporting normal queues and priority queues from a single module. MCE::Queue exchanges data via the core engine to enable queuing to work for both children (spawned from fork) and threads.
MCE::Relay: Enables workers to receive and pass on information orderly with zero involvement by the manager process while running.

MCE MODELS

The models take Many-Core Engine to a new level for ease of use. Two options (chunk_size and max_workers) are configured automatically as well as spawning and shutdown.

MCE::Loop: Provides a parallel loop utilizing MCE for building creative loops.
MCE::Flow: A parallel flow model for building creative applications. This makes use of user_tasks in MCE. The author has full control when utilizing this model. MCE::Flow is similar to MCE::Loop, but allows for multiple code blocks to run in parallel with a slight change to syntax.
MCE::Grep: Provides a parallel grep implementation similar to the native grep function.
MCE::Map: Provides a parallel map model similar to the native map function.
MCE::Step: Provides a parallel step implementation utilizing MCE::Queue between user tasks. MCE::Step is a spin off from MCE::Flow with a touch of MCE::Stream. This model, introduced in 1.506, allows one to pass data from one sub-task into the next transparently.
MCE::Stream: Provides an efficient parallel implementation for chaining multiple maps and greps together through user_tasks and MCE::Queue. Like with MCE::Flow, MCE::Stream can run multiple code blocks in parallel with a slight change to syntax from MCE::Map and MCE::Grep.

MISCELLANEOUS

Miscellaneous additions included with the distribution.

MCE::Examples: Describes various demonstrations for MCE including a Monte Carlo simulation.
MCE::Subs: Exports functions mapped directly to MCE methods; e.g. mce_wid. The module allows 3 options; :manager, :worker, and :getter.

REQUIREMENTS

Perl 5.8.0 or later. PDL::IO::Storable is required in scripts running PDL.

SOURCE AND FURTHER READING

The source, cookbook, and examples are hosted at GitHub.

Parallel Processing with Perl !

Rahul Nayak — Sat, 25 Aug 2018 11:32:40 -0500

Here is a small tutorial on how to make best use of multiple processors for bioinformatics analysis. One best way is using perl threads and forks. Knowing how these threads and forks work is very important before implementing them. Getting to know how these work would be really useful before reading this tutorial.

Many times in bioinformatics we need to deal with huge datasets which are more than 100GB size. The traditional way to analysis a file is using the while loop

while (FILE){

Do something;

}

This is very slow(since we are using only one processor) and if we have 500 million lines in the dataset it takes more than a day to iterate through the whole dataset. So how do we make best use of all our processors and get the work done quickly?

Here is a very simple and efficient technique with perl which i have been using. I am more inclined towards using perl fork than perl threads.

One of the oldest way to fork is

my $fork = fork();
if($fork){
push (@childs,$fork);
}
elseif($fork==0){
your code here;
exit(0);
}
else{die “Couldnt fork : $!”;}
## wait for the child process to finish
foreach(@childs){
my $tmp=waitid($_,0);
}

what a fork does is it creates a child process and takes the variables and code with it to analyze it separately (detached from the parent process) and thus a separate process is created( which usually runs on a separate processor). Thats it!! One big disadvantage of forking is its very difficult to share variables among the different processes. I will show you how to do it easily but still it has its own drawbacks.

Okie, now if you really do not want to use fork in your code, that’s okie too..There are many useful modules which do it for you very efficiently. One really useful module is Parallel::ForkManager. You can use Parallel::ForkManager to manage the number of forks you want to generate (number of processors you want to use).
Simple usage:
use Parallel::ForkManager;
my $max_processors=8;
my $fork= new Parallel::ForkManager($max_processors);
foreach (@dna) {
$fork->start and next; # do the fork
you code here;
$fork->finish; # do the exit in the child process
}
$pm->wait_all_children;

so you will be generating 8 forks which do the same thing for your each element of array. when one child finishes, Parallel::ForkManager generates a new one and thus you will be using all your processors to analyze the data. Now, if you have generated 8 child processes and want to write the data to one file. You need to lock the file to do this, because you will have problems with the buffering. You can lock the file using flock command.

open (my $QUAL, “myfile.txt”);
flock $QUAL, LOCK_EX or die “cant lock file $!”;
print $QUAL “$output”;
flock $QUAL, LOCK_UN or die “$!”;
close $QUAL;

I would not suggest using flock when dealing with multiple processes because it will decrease the processing efficiency( each child process must wait for the lock to be released by the other child process). Instead, I would suggest each fork writing to a separate file and after the processing just concatenating them.

Putting it all together, If you have 100GB data you can do this

step 1 : split the dataset equally according to number of processors you have. this may take a few hours(about 2-3 hrs for 100GB file)
You can use unix “split” command for this
for example:
my $number_split=int($number_of_entries_in_your_dataset/$max_processors);
my $split_Files=`split -l $number_split “your_file.fasta” “file_name”`;
step2: open you directory comtaining you split files and start Parallel::ForkManager.
For example:
opendir(DIRECTORY, $split_files_directory) or die $!; ### open the directory
my $fork= new Parallel::ForkManager($max_processors);
while (my $file = readdir(DIRECTORY)) { ### read the directory
if($file=~/^\./){next;}
print $file,”\n”;
########## Start fork ##########
my $pid= $super_fork->start and next;
Whatever you want to do with the split file ;
analyze my piece of $file;
######### end fork ###############
$super_fork->finish;
}
$super_fork->wait_all_children;

So basically each processor will be active with its piece of data (split file) and thus you have created 8 processes at one time which run without interfering with the other process. I again will not suggest writing output from each child process to one file(for reasons above). Write output from each fork to a separate file and finally concatenate them. Thats it, you have just increased your program speed by 8 times!! Isnt it easy?

Note:
You may worry about concatenation of the output each child generates, since it does take some time(remember 100GB). I think now you can use a mysql database LOAD DATA LOCAL INFILE command to load all the files into a single table(Should take about 3hrs for 100Gb dataset) and then export the whole table into one file. This should be faster than just concatenating them using “cat” command.(correct me if I am wrong)

Or much simpler way is to use pipes

cat output_dir/* | my_pipe or my_pipe <(file1) final_file;

Thats it guys!! Enjoy programming and please do comment. I am not a computer scientist so forgive me for any mistakes and if any please report them. Thank you.