BOL: Related items

Bioinformatics tools and software

Jit — Tue, 05 Jul 2016 10:02:26 -0500

USEARCH >
Extreme high-throughput sequence analysis. Orders of magnitude faster than BLAST. MUSCLE >
Multiple sequence alignment. Faster and more accurate than CLUSTALW.

UPARSE >
OTU clustering for 16S and other marker genes. Highly accurate OTU sequences and improved diversity measures. UCHIME >
Chimeric sequence detection. PILER >
De novo genome repeat finder. PILER-CR >
Detection of CRISPR repeats in bacterial genomes. QSCORE >
Compare two multiple alignments for benchmarking. PALS >
Whole-genome alignment. PREFAB >
Protein Reference Alignment Database. MSA benchmark collection >
Selected multiple alignment benchmarks in a standardized FASTA format.

Address of the bookmark: http://drive5.com/software.html

Useful Bioinformatics Tools

Poonam Mahapatra — Mon, 29 Aug 2016 04:08:12 -0500

Collections of few handy tools for bioinformatician

http://molbiol-tools.ca/Convert.htm

Address of the bookmark: http://molbiol-tools.ca/Convert.htm

Statistical for biological research

Neel — Thu, 03 Nov 2016 04:59:48 -0500

There is no disputing the importance of statistical analysis in biological research, but too often it is considered only after an experiment is completed, when it may be too late.

This collection highlights important statistical issues that biologists should be aware of and provides practical advice to help them improve the rigor of their work.

Nature Methods' Points of Significance column on statistics explains many key statistical and experimental design concepts. Other resources include an online plotting tool and links to statistics guides from other publishers.

Address of the bookmark: http://www.nature.com/collections/qghhqm

Harvest

Jit — Tue, 31 Jan 2017 10:57:56 -0600

Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes, including variant calls, recombination detection, and phylogenetic trees.

Tools

Parsnp - Core-genome alignment and analysis
Gingr - Interactive visualization of alignments, trees and variants
HarvestTools - Archiving and postprocessing

Citation

Treangen TJ, Ondov BD, Koren S, Phillippy AM. The Harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biology, 15 (11), 1-15 [PDF]

Address of the bookmark: http://harvest.readthedocs.io/en/latest/index.html

Genomics for Bioinformatician

Jitendra Narayan — Sat, 20 Jul 2013 07:03:00 -0500

Genomics is the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis, epistasis, pleiotropy and other interactions between loci and alleles within the genome. In contrast, the investigation of the roles and functions of single genes is a primary focus of molecular biology or genetics and is a common topic of modern medical and biological research. Research of single genes does not fall into the definition of genomics unless the aim of this genetic, pathway, and functional information analysis is to elucidate its effect on, place in, and response to the entire genome's networks.

Genomics was established by Fred Sanger when he first sequenced the complete genomes of a virus and a mitochondrion. His group established techniques of sequencing, genome mapping, data storage, and bioinformatic analyses in the 1970-1980s. A major branch of genomics is still concerned with sequencing the genomes of various organisms, but the knowledge of full genomes has created the possibility for the field of functional genomics, mainly concerned with patterns of gene expression during various conditions. The most important tools here are microarrays and bioinformatics. Study of the full set of proteins in a cell type or tissue, and the changes during various conditions, is called proteomics. A related concept is materiomics, which is defined as the study of the material properties of biological materials (e.g. hierarchical protein structures and materials, mineralized biological tissues, etc.) and their effect on the macroscopic function and failure in their biological context, linking processes, structure and properties at multiple scales through a materials science approach. The actual term 'genomics' is thought to have been coined by Dr. Tom Roderick, a geneticist at the Jackson Laboratory (Bar Harbor, ME) over beer at a meeting held in Maryland on the mapping of the human genome in 1986.

The outcome of almost two years of intense discussions with literally hundreds of scientists and members of the public, has three major areas of focus: Genomics to Biology, Genomics to Health, and Genomics to Society.

Genomics to Biology:
The human genome sequence provides foundational information that now will allow development of a comprehensive catalog of all of the genome's components, determination of the function of all human genes, and deciphering of how genes and proteins work together in pathways and networks.

Genomics to Health:
Completion of the human genome sequence offers a unique opportunity to understand the role of genetic factors in health and disease, and to apply that understanding rapidly to prevention, diagnosis, and treatment. This opportunity will be realized through such genomics-based approaches as identification of genes and pathways and determining how they interact with environmental factors in health and disease, more precise prediction of disease susceptibility and drug response, early detection of illness, and development of entirely new therapeutic approaches.

Genomics to Society:
Just as the HGP has spawned new areas of research in basic biology and in health, it has created new opportunities in exploring the ethical, legal, and social implications (ELSI) of such work. These include defining policy options regarding the use of genomic information in both medical and non-medical settings and analysis of the impact of genomics on such concepts as race, ethnicity, kinship, individual and group identity, health, disease, and "normality" for traits and behaviors.

This vision for the future of genomics is not just about the NHGRI. It encompasses the whole field of genomics, including the work of all the other Institutes and Centers at the NIH and of a number of other federal agencies. All of the NIH Institutes are already taking full advantage of the sequence and will apply its data to the better understanding of both rare and common diseases, almost all of which have a genetic component. A recent example of the way that the HGP and the knowledge and new technologies it has spawned are already facilitating science is the extremely rapid sequencing by groups in Canada and at the Centers for Disease Control and Prevention (CDC) in Atlanta of the genome of the virus that causes Severe Acute Respiratory Syndrome (SARS). The sequencing of the SARS virus genome provides insight into this new and deadly disease at a speed never before possible in science. In turn, this should lead to the rapid development of diagnostic tests and, in time, vaccines and effective treatments.

Links for the addition material available on Net

Genomes and genomics:

Bioinformatics and Genomics:

Structural genomics tutorial:

Comparative Genomics Tutorial:

GENOME TUTORIAL:

Tools and resources for identifying protein families, domains and motifs

Bioinformatics Tools
Tips, Tutorials, and Terminology for Using Selected Resources in Genome Database Guide:

A Web-Based Comparative Genomics Tutorial for Investigating Microbial Genomes:

Free Online Tutorials Teach Anyone How to Use Genome Databases:

Circos to create concise, explanatory, unique and print-ready visualizations of your data:

Genomics and Comparative Genomics Learning Module:

Computational Challenges in Comparative Genomics

A Tutorial:

A Comparative Genomics Resource for Grains:

PLAZA: A Comparative Genomics Resource to Study Gene and Genome Evolution in Plants:

VISTA :

Software for Genomics

Artemis Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation.
Chromas It will display and prints chromatogram files from ABI automated DNA sequencers, and Staden SCF files which the analysis programs for ALF, Li-Cor and Visible Genetics OpenGene sequencers can create.
Glimmer A system for finding genes in microbial DNA, especially the genomes of bacteria and archaea.Glimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DN
Glimmer HMM A fast and accurate gene finder based on a GHMM architecture, developed specifically for eukaryotes. It incorporates splice site models adapted from the GeneSplicer program and uses interpolated Markov models for evaluating the coding regions.
Glimmer M A gene finder derived from Glimmer, but developed specifically for eukaryotes. It is based on a dynamic programming algorithm that considers all combinations of possible exons for inclusion in a gene model and chooses the best of these combinations. The d
MUMmer MUMmer is a system for rapidly aligning entire genomes, whether in complete or draft form.
pDRAW pDRAW32 is being developed as a free time hobby project. It is far from finished, but as it has reached a point where it could be helpful for many labs, it is now available to the scientific community.
Sequin Sequin is a stand-alone software tool developed by the NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence databases. It is capable of handling simple submissions that contain a single short mRNA sequence, and complex submissio
Staden The Staden Package consists of a series of tools for DNA sequence preparation (pregap4), assembly (gap4), editing (gap4) and DNA/protein sequence analysis (spin).

For more software @ http://bioinformaticsonline.com/bookmarks/view/926/list-of-popular-bioinformatics-softwaretools

Five points for bioinformatics software/tools

Jitendra Narayan — Mon, 05 Aug 2013 04:12:32 -0500

In the bioinformatics sector we mostly spend time on computational analysis of huge amounts of data and try to make sense of it, biologically. But, most of the newbie bioinformaticians are faced with dilemma when they receive biological sequence data for the first time. They mostly found confusing over open source, user friendly GUI, and commercial bioinformatics software. Don’t be surprise this is true and also not an easy task to decide, because analytical step is the most crucial part and believe to be the biggest bottleneck in publishing paper in high impact journals. Through this blog I would like to address the pros and cons of both kind of software/tools and try to assist (Hmmm not really, It looks convince) you to make decision on your software selections.

The most common newbie questions are:

Should I try to use these free open source programs? Why are we not trying GUI software for computational analysis? Should I use commercial bioinformatics programs/software?”

1. Let’s be open

We generally think free and cheap are useless. But this concept is not applicable when we discuss open source software. Mostly, the bioinformatics software is developed by highly competitive biological programmers who believe in open sharing of knowledge. They come under Open Bioinformatics Foundation or O|B|F which is a non-profit, volunteer run organization focused on supporting open source programming in bioinformatics. The best part about open source tools/software is that they’re free to download the source code and read exactly what the program does. If you are so inclined, you can view all of the parts of the program and see the logical flow of the pipeline. In addition, open source makes an excellent learning tool for any beginning bioinformatician. Moreover, you can modify existing open source programs to deal with cutting-edge problems or to customize your pipeline. Apart from your computational and analysis work, most of the reviewer also prefers the open source based results so that they can validate the results if validation required.

2. Code headache

As a bioinformatician you are supposed to know the basics of programming languages, and if you are not good at it, then please learn it as soon as possible because you are not a bio-analyst but biological programmers. The open source programs usually lack dedicated service and support teams (often because they were the product of an overworked doc/postdoc!) so you are responsible for troubleshooting your own errors most of the time. We commonly receive the HELP email to support and assist to setup the pipeline; you can also find this kind of request on any QA forum. I personally believe this coding horror brings the biggest downside of open-source programs; where you need some programming skills in order to implement the program in your pipeline. But, if you are not able to fix the pipeline and modify the open source code according to your requirements them you should re-think on your bioinformatician name tag!!!

3. Dive into the codes

Some of the biologist turn bioinformatician says “if you can do the same thing with commercial software then why to get migraine with weird codes”, well this statement looks to me that guys are keen to learn swimming but still don’t like to get wet. If you are still using paid software and doing your work by customer support and clicking some of the well-designed GUI button then perhaps you are not interested in learning and trying new and challenging bioinformatics works. You are missing the basic flavour of bioinformatics. Let’s dive into the coding world, I am sure your will enjoy it. I recommend your to swim freely in code’s sea, and enjoy the journey; do not merely watch it from the outside.

4. Paid does not mean better

The bioinformatics company which are specializes in bioinformatics solutions develop well designed/packed, user friendly software by using a large number of specialised scientist, programmers and support staff. They also provide good services to accomplice your biological analysis work. This means that if you hit a ‘snag’ with your data, help is likely only a phone call away! These companies price their products competitively against the cost of a dedicated bioinformatician. You may be able to afford the program, but not the additional staff! Additionally, most of the functionality that you need in your analysis is already coded into the program. Need to plot a graph? Just click this button right here. It is that easy. But, as a bioinformatician this is not generally well encouraged approach in biological analysis work, because the software is not available to everyone and your data can’t be validated. Moreover, there is very less chances that anyone will repeat your work or love to do similar kind of research (because not all the labs in the world are rich like yours).

5. Take a caution

In biological analysis work, in which you deal GB/TB of data are having maximum chances of getting errors, so please be careful and always cross check your data before coming to any conclusion. Even an error in two line code can alter your entire analysis and display weird results. Some of the scientist blindly believes on commercial software, which is entirely wrong. Using proprietary tools does not absolve you of the need to actually read and research the type of analysis that you are doing. This is particularly true in the case of genome assembly and annotation.

At the end, I would like to tell only one think that open source solutions allows you to do more cutting edge analysis than the commercial tools. So let’s go for it.

Disclaimer:

This is my personal view. I have nothing to do with any company or open source community. The views expressed on these pages are mine alone and not those of my current/past employers. I do reserve the right to remove comments left by spammers or off-topic comments.

Useful Bioinformatics Analysis Tools !

Neel — Thu, 23 Dec 2021 23:10:02 -0600

CoMeta

Classificier of reads from metagenomic sequencing experiments.

• Kawulok, J., Deorowicz, S., CoMeta: Classification of Metagenomes Using k-mers, PLOS ONE, 2015; 10(4):1–23,

CoMSA

Compressor of multiple sequence alignments of proteins.

• Deorowicz, S., Walczyszyn, J., Debudaj-Grabysz, A., CoMSA: compression of protein multiple sequence alignment files, Bioinformatics, 2019; 35(2):22–234,

DSRC

Compressor of sequencing reads.

• Roguski, L., Deorowicz, S., DSRC 2: Industry-oriented compression of FASTQ files, Bioinformatics, 2014; 30(15):2213–2215,
• Deorowicz, S., Grabowski, Sz., Compression of DNA sequences in FASTQ format, Bioinformatics, 2011; 27(6):860–862,

FAMSA

Multiple sequence alignment designed for huge families of proteins (even containing hundreds of thousands of sequences).

• Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., FAMSA: Fast and accurate multiple sequence alignment of huge protein families, Scientific Reports, 2016; 6(33964):

FaStore

Compressor of FASTQ files.

• Roguski, L., Ochoa, I., Hernaez, M., Deorowicz, S., FaStore - a space-saving solution for raw sequencing data, Bioinformatics, 2018; 34(16):2748–2756,

FQSqueezer

Experimental high-end compressor of FASTQ files.

• Deorowicz, S., FQSqueezer: k-mer-based compression of sequencing data, Scientific Reports, 2020; 10(578):

GDC

Compressor of collections of genome sequences.

• Deorowicz, S., Danek, A., Niemiec, M., GDC 2: Compression of large collections of genomes, Scientific Reports, 2015; 5(11565):1–12,
• Deorowicz, S., Grabowski, Sz., Robust relative compression of genomes with random access, Bioinformatics, 2011; 27(21):2979–2986,

GTC

Genotype databases compressor with support for fast queries.

• Danek, A., Deorowicz, S., GTC: how to maintain huge genotype collections in a compressed form, Bioinformatics, 2018; 34(11):1834–1840,

GTShark

Genotypes compressor.

• Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, Bioinformatics, 2019; 35(22):4791–4793,

KMC

Memory frugal k-mer counter.

•  Kokot, M., Długosz, M., Deorowicz, S., KMC 3: counting and manipulating k -mer statistics, Bioinformatics, 2017; 33(17):2759–2761,
•  Deorowicz, S., Kokot, M., Grabowski, Sz., Debudaj-Grabysz, A., KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2015; 31(10):1569–1576,
•  Deorowicz, S., Debudaj-Grabysz, A., Grabowski, Sz., Disk-based k-mer counting on a PC, BMC Bioinformatics, 2013; 14():Article no. 160,

Kmer-db

Tool for estimation of evolutionary distances in a collection of genomes.

• Deorowicz, S., Gudys, A., Dlugosz, M., Kokot, M., Danek, A., Kmer-db: instant evolutionary distance estimation, Bioinformatics, 2019; 35(1):133–136,

MuGI

Index allowing queries for a collection of multiple genome sequences.

• Danek, A., Deorowicz, S., Grabowski, Sz., Indexes of Large Genome Collections on a PC, PLOS ONE, 2014; 9(10):e109384,

ORCOM

Experimental compressor of sequencing reads.

• Grabowski, Sz., Deorowicz, S., Roguski, L., Disk-based compression of data from genome sequencing, Bioinformatics, 2014; 31(9):1389–1395,

PgSA

Index allowing queries for a collection of sequencing reads.

• Kowalski, T., Grabowski, Sz., Deorowicz, S., Indexing arbitrary-length k-mers in sequencing reads, PLOS ONE, 2015; 10(7):1–16,

QuickProbs

Multiple sequence alignment designed especially for GPU.

• Gudys, A., Deorowicz, S., QuickProbs 2: towards rapid construction of high-quality alignments of large protein families, Scientific Reports, 2017; 7(41553):
• Gudys, A., Deorowicz, S., QuickProbs – A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors, PLOS ONE, 2014; 9(2):e88901,

RECKONER

Read error corrector.

• Maciej Długosz, M., Deorowicz, S., RECKONER: read error corrector based on KMC, Bioinformatics, 2017; 33(7):1086–1089,

TGC

Compressor of collections of genomes given in Variant Call Format (VCF) files.

• Deorowicz, S., Danek, A., Grabowski, Sz., Genome compression: a novel approach for large collections, Bioinformatics, 2013; 29(20):2572–2578,

VCFShark

Compressor of VCF files.

• Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, biorxiv.org, 2020; ():

Whisper

Experimental mapper of whole genome sequencing data.

•  Deorowicz, S., Gudys, A., Whisper 2: indel-sensitive short read mapping, bioRxiv.org, 2019; :
•  Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Whisper: read sorting allows robust robust mapping of DNA sequencing data, Bioinformatics, 2019; 35(12):2043–2050,
•  Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Robust mapping of whole genome sequencing data, Poster at The Biology of Genomes Conference, 2017;

Important Bioinformatics Tools !

BioStar — Tue, 30 Jul 2024 05:03:29 -0500

1. Ktrim: An extra-fast, accurate adapter trimmer for sequencing data. It processes FASTQ files from multiple lanes with minimal mismatching and over-trimming of adapters.

2. BWA MEM: A reliable alignment tool (particularly for mapping ALT contigs and HLA genes, which are not fully addressed in BWA-MEM2).

3. Sambamba markdup: Quickly marks or removes duplicate reads using Picard's criteria.

4. ichorCNA: Estimates the tumor DNA fraction in cell-free DNA from ultra-low-pass whole genome sequencing (0.1x coverage) based on copy number alterations (CNA).

5. Fragle: A deep learning method for quantifying ctDNA levels from cell-free DNA fragmentomic profiles. It detects TF as low as ~1% ctDNA and works with targeted genomic panel sequencing data.

6. AlfredQC: A quality control tool for high-throughput sequencing data. It assesses metrics like read quality scores, GC content, and duplication rates, visualized through detailed plots and summary statistics.

7. Mosdepth: A fast tool for calculating sequencing coverage depth, offering a quicker alternative to samtools/sambamba depth by processing BAM and CRAM files.

8. Bedtools: A versatile toolkit for genomics, enabling operations like intersect, merge, count, and shuffle on genomic intervals across formats such as BAM, BED, GFF/GTF, and VCF.

9. Datamash: A command-line tool for basic numeric, textual, and statistical operations on input data streams. It supports operations such as grouping, sorting, transposing, and performing arithmetic calculations on tabular data.

10. gwf.app: A pragmatic alternative to Snakemake. Developed at Aarhus University, this flexible, generic workflow tool builds and runs large scientific workflows.

10 Books to Kickstart (and Level Up) Your Bioinformatics Journey

Neel — Tue, 12 Aug 2025 03:50:11 -0500

If you’re starting out in bioinformatics or looking to sharpen your computational biology skills, having the right learning resources makes all the difference.
Here’s my curated list of 10 must-read books — from beginner-friendly introductions to advanced computational genomics.

1️⃣ Data Analysis for the Life Sciences
A fantastic starting point to learn statistics, R programming, and exploratory data analysis in the context of biology. The best part? It’s available free online from HarvardX.

2️⃣ Practical Computing for Biologists
The very first book I picked up when I started learning computational biology. It’s beginner-friendly and focuses on essential computing skills every biologist needs.

3️⃣ A Primer for Computational Biology
An open-access, hands-on introduction to computational biology concepts and coding techniques. Perfect if you want to learn through real examples.

4️⃣ Computational Genomics with R
For those who already know R and want to dive deeper into genome-scale data analysis, from sequence alignment to gene expression.

5️⃣ The Biologist’s Guide to Computing
Bridges the gap between biological problems and computational thinking, making it easier for life scientists to approach programming and data analysis.

6️⃣ Bioinformatics Data Skills
A must-read to sharpen your bioinformatics toolkit — from command-line skills to reproducible research workflows. Ideal once you’ve covered the basics.

7️⃣ Bioinformatics Workbook
A practical tutorial series to help scientists design bioinformatics projects, analyze data, and understand best practices.

8️⃣ Modern Statistics for Modern Biology
An essential guide to modern statistical methods applied to biology, blending theory with hands-on examples in R.

9️⃣ Algorithms on Strings, Trees, and Sequences by Dan Gusfield
A classic reference for anyone wanting to understand the algorithms behind sequence alignment, genome assembly, and biological data structures.

Bioinformatics software for biologists in the genomics era

Poonam Mahapatra — Sun, 22 Dec 2013 17:31:05 -0600

The genome sequencing revolution is approaching a landmark figure of 1000 completely sequenced genomes. Coupled with fast-declining, per-base sequencing costs, this influx of DNA sequence data has encouraged laboratory scientists to engage large datasets in comparative sequence analyses for making evolutionary, functional and translational inferences. However, the majority of the scientists at the forefront of experimental research are not bioinformaticians, so a gap exists between the user-friendly software needed and the scripting/programming infrastructure often employed for the analysis of large numbers of genes, long genomic segments and groups of sequences. We see an urgent need for the expansion of the fundamental paradigms under which biologist-friendly software tools are designed and developed to fulfill the needs of biologists to analyze large datasets by using sophisticated computational methods. We argue that the design principles need to be sensitive to the reality that comparatively small teams of biologists have historically developed some of the most popular biological software packages in molecular evolutionary analysis. Furthermore, biological intuitiveness and investigator empowerment need to take precedence over the current supposition that biologists should re-tool and become programmers when analyzing genome scale datasets.

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/23/14/1713.full