BOL: Related items

List of popular bioinformatics software/tools

Jitendra Narayan — Tue, 16 Jul 2013 14:30:30 -0500

In current genome era, our day to day work is to handle the huge geneome sequences, expression data, several other datasets. This link provide a comprehensive list of commonly used sofware/tools.

Address of the bookmark: http://samtools.sourceforge.net/swlist.shtml

Useful Bioinformatics Analysis Tools !

Neel — Thu, 23 Dec 2021 23:10:02 -0600

CoMeta

Classificier of reads from metagenomic sequencing experiments.

• Kawulok, J., Deorowicz, S., CoMeta: Classification of Metagenomes Using k-mers, PLOS ONE, 2015; 10(4):1–23,

CoMSA

Compressor of multiple sequence alignments of proteins.

• Deorowicz, S., Walczyszyn, J., Debudaj-Grabysz, A., CoMSA: compression of protein multiple sequence alignment files, Bioinformatics, 2019; 35(2):22–234,

• Roguski, L., Deorowicz, S., DSRC 2: Industry-oriented compression of FASTQ files, Bioinformatics, 2014; 30(15):2213–2215,
• Deorowicz, S., Grabowski, Sz., Compression of DNA sequences in FASTQ format, Bioinformatics, 2011; 27(6):860–862,

FAMSA

Multiple sequence alignment designed for huge families of proteins (even containing hundreds of thousands of sequences).

• Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., FAMSA: Fast and accurate multiple sequence alignment of huge protein families, Scientific Reports, 2016; 6(33964):

FaStore

Compressor of FASTQ files.

• Roguski, L., Ochoa, I., Hernaez, M., Deorowicz, S., FaStore - a space-saving solution for raw sequencing data, Bioinformatics, 2018; 34(16):2748–2756,

FQSqueezer

Experimental high-end compressor of FASTQ files.

• Deorowicz, S., FQSqueezer: k-mer-based compression of sequencing data, Scientific Reports, 2020; 10(578):

GDC

Compressor of collections of genome sequences.

• Deorowicz, S., Danek, A., Niemiec, M., GDC 2: Compression of large collections of genomes, Scientific Reports, 2015; 5(11565):1–12,
• Deorowicz, S., Grabowski, Sz., Robust relative compression of genomes with random access, Bioinformatics, 2011; 27(21):2979–2986,

GTC

Genotype databases compressor with support for fast queries.

• Danek, A., Deorowicz, S., GTC: how to maintain huge genotype collections in a compressed form, Bioinformatics, 2018; 34(11):1834–1840,

GTShark

Genotypes compressor.

• Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, Bioinformatics, 2019; 35(22):4791–4793,

KMC

Memory frugal k-mer counter.

•  Kokot, M., Długosz, M., Deorowicz, S., KMC 3: counting and manipulating k -mer statistics, Bioinformatics, 2017; 33(17):2759–2761,
•  Deorowicz, S., Kokot, M., Grabowski, Sz., Debudaj-Grabysz, A., KMC 2: Fast and resource-frugal k-mer counting, Bioinformatics, 2015; 31(10):1569–1576,
•  Deorowicz, S., Debudaj-Grabysz, A., Grabowski, Sz., Disk-based k-mer counting on a PC, BMC Bioinformatics, 2013; 14():Article no. 160,

Kmer-db

Tool for estimation of evolutionary distances in a collection of genomes.

• Deorowicz, S., Gudys, A., Dlugosz, M., Kokot, M., Danek, A., Kmer-db: instant evolutionary distance estimation, Bioinformatics, 2019; 35(1):133–136,

MuGI

Index allowing queries for a collection of multiple genome sequences.

• Danek, A., Deorowicz, S., Grabowski, Sz., Indexes of Large Genome Collections on a PC, PLOS ONE, 2014; 9(10):e109384,

ORCOM

Experimental compressor of sequencing reads.

• Grabowski, Sz., Deorowicz, S., Roguski, L., Disk-based compression of data from genome sequencing, Bioinformatics, 2014; 31(9):1389–1395,

PgSA

Index allowing queries for a collection of sequencing reads.

• Kowalski, T., Grabowski, Sz., Deorowicz, S., Indexing arbitrary-length k-mers in sequencing reads, PLOS ONE, 2015; 10(7):1–16,

QuickProbs

Multiple sequence alignment designed especially for GPU.

• Gudys, A., Deorowicz, S., QuickProbs 2: towards rapid construction of high-quality alignments of large protein families, Scientific Reports, 2017; 7(41553):
• Gudys, A., Deorowicz, S., QuickProbs – A Fast Multiple Sequence Alignment Algorithm Designed for Graphics Processors, PLOS ONE, 2014; 9(2):e88901,

RECKONER

Read error corrector.

• Maciej Długosz, M., Deorowicz, S., RECKONER: read error corrector based on KMC, Bioinformatics, 2017; 33(7):1086–1089,

TGC

Compressor of collections of genomes given in Variant Call Format (VCF) files.

• Deorowicz, S., Danek, A., Grabowski, Sz., Genome compression: a novel approach for large collections, Bioinformatics, 2013; 29(20):2572–2578,

VCFShark

Compressor of VCF files.

• Deorowicz, S., Danek, A., GTShark: Genotype compression in large projects, biorxiv.org, 2020; ():

Whisper

Experimental mapper of whole genome sequencing data.

•  Deorowicz, S., Gudys, A., Whisper 2: indel-sensitive short read mapping, bioRxiv.org, 2019; :
•  Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Whisper: read sorting allows robust robust mapping of DNA sequencing data, Bioinformatics, 2019; 35(12):2043–2050,
•  Deorowicz, S., Debudaj-Grabysz, A., Gudys, A., Grabowski, Sz., Robust mapping of whole genome sequencing data, Poster at The Biology of Genomes Conference, 2017;

Interesting Bioinformatics Resources !

Abhi — Fri, 11 Nov 2022 06:30:46 -0600

1. a reproducible workflow. https://www.youtube.com/watch?v=s3JldKoA0zw This two minute video will change your mind on reproducible research

2. Parallel sequencing lives, or what makes large sequencing projects successful https://academic.oup.com/gigascience/article/6/11/gix100/4557140?login=false

3. Common-sense approaches to sharing tabular data alongside publication https://www.sciencedirect.com/science/article/pii/S2666389921002300

4. A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker https://psyarxiv.com/8xzqy/

5. Practical Computational Reproducibility in the Life Sciences https://www.cell.com/cell-systems/fulltext/S2405-4712(18)30140-6

6. A video by Dr.Keith A. Baggerly from MD Anderson [The Importance of Reproducible Research in High-Throughput Biology](https://www.youtube.com/watch?v=7gYIs7uYbMo) highly recommended.

7. Ten Simple Rules for Reproducible Computational Research http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285)

8. Good Enough Practices in Scientific Computing http://arxiv.org/abs/1609.00037

9. Best Practices for Scientific Computing https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

10. A Quick Guide to Organizing Computational Biology Projects http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.100042 A must read for computational biologists!

11. Reproducibility of computational workflows is automated using continuous analysis https://www.nature.com/articles/nbt.3780

12. Five selfish reasons to work reproducibly https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Exploring RNA Sequence Analysis: Tools for Every Bioinformatician

Neel — Fri, 13 Dec 2024 04:03:04 -0600

RNA sequence analysis has become an essential part of modern biological research. From RNA-seq pipelines to specialized tools for specific RNA types, here's a comprehensive guide to tools you can use to make sense of RNA data.

1. RNA-Seq Analysis Pipelines

RNA-seq is one of the most popular techniques for studying RNA. These tools streamline processing raw sequence data:

FASTQC: For quality control of raw RNA-seq reads.
Trimmomatic: For trimming and filtering RNA-seq reads.
HISAT2/STAR: High-performance aligners for RNA-seq reads.
FeatureCounts: For quantifying gene expression.
DESeq2/EdgeR: For differential expression analysis.

2. Transcriptome Assembly and Annotation

For analyzing transcriptomes from non-model organisms or assembling novel transcripts:

Trinity: For de novo transcriptome assembly.
StringTie: For transcript assembly and quantification from RNA-seq alignments.
TransDecoder: To predict coding regions within assembled transcripts.
TAU: Tools for annotating non-coding and coding RNAs.

3. Exploring Non-Coding RNA (ncRNA)

Non-coding RNAs play critical regulatory roles. Dedicated tools for studying them include:

Infernal: For identifying ncRNA sequences based on covariance models.
Rfam: Database and tools for ncRNA families.
miRDeep: For identifying microRNAs in RNA-seq datasets.

4. RNA Structure and Motif Analysis

Structural biology of RNA helps in understanding its function:

RNAfold (ViennaRNA): Predicts secondary structures from RNA sequences.
RNAstructure: Tools for RNA secondary structure prediction and analysis.
MEME Suite: For identifying motifs in RNA sequences.
IntaRNA: For RNA-RNA interaction prediction.

5. RNA Editing and Modifications

Epitranscriptomics is a growing field focusing on RNA modifications:

REDItools: For RNA editing analysis.
m6Aboost: For identifying m6A modifications in RNA.

6. Long-Read RNA Sequencing Analysis

Long-read technologies like Nanopore and PacBio are transforming RNA research:

FLAIR: For isoform-level analysis of long-read RNA-seq data.
NanoMod: For detecting modifications in RNA from Nanopore sequencing.

7. RNA-Protein Interactions

To study RNA-protein interactions and complexes:

RBPmap: For identifying RNA-binding protein motifs.
PARalyzer: For analyzing PAR-CLIP data.

8. Functional Enrichment Analysis

Understanding biological functions and pathways from RNA-seq data:

getENRICH: A tool designed for pathway enrichment analysis of non-model organisms (hypergeometric P-value calculation with FDR correction).
ClusterProfiler: For GO and KEGG pathway enrichment analysis.

9. Visualization and Data Sharing

Presenting and sharing RNA sequence analysis results effectively:

IGV: Genome browser for visualizing RNA-seq alignments.
Circos: Circular visualization of RNA-seq data.
DashBio: A Python library for creating bioinformatics visualizations.

Conclusion

The bioinformatics landscape for RNA sequence analysis is vast, with tools catering to specific needs. Whether you’re studying coding RNAs, non-coding RNAs, or exploring RNA-protein interactions, the right tools can transform your data into biological insights.

Chemical Elements of Bioinformatics

Rahul Agarwal — Tue, 03 Sep 2013 16:35:39 -0500

You must be familiar with periodic table and colour pattern, but this time you are going to amaze by new elements table by Eagle genomics. Just check it out and have fun :)

http://elements.eaglegenomics.com/

Bioinformatics tools for genome assembly !

BioStar — Mon, 24 Jul 2023 07:04:26 -0500

There are numerous genome assembly tools available, each with its strengths and weaknesses. Here is a list of some widely used genome assembly tools as of my last update in September 2021:

SPAdes: An assembler specifically designed for single-cell and multi-cell bacterial genomes, as well as small eukaryotic genomes.
ABySS: A parallelized assembler for large genomes that uses de Bruijn graphs.
Velvet: Another de Bruijn graph-based assembler optimized for short-read sequencing data.
SOAPdenovo: A de Bruijn graph-based assembler designed for short reads, widely used for assembling large and complex genomes.
MaSuRCA: A hybrid assembler that combines data from multiple sequencing technologies, such as Illumina and PacBio.
Canu: A long-read assembler optimized for PacBio and Oxford Nanopore sequencing data.
Flye: A long-read assembler suitable for bacterial and small eukaryotic genomes.
SMARTdenovo: An assembler designed for long reads, particularly suited for PacBio data.
SPAdes Long Read (SPAdesLR): An extension of SPAdes for long-read data, such as those from PacBio or Nanopore.
Minia: An assembler optimized for low memory consumption, suitable for small and medium-sized genomes.
Unicycler: A hybrid assembler that combines short and long reads for circular bacterial genome assembly.
wtdbg2: A de Bruijn graph assembler for long reads, efficient for very large genomes.
Shasta: A long-read assembler that uses the Overlap-Layout-Consensus approach, suitable for PacBio and Nanopore data.
Sparc: An assembler designed to handle noisy long reads from Nanopore sequencing.
CANA: An assembler for metagenomic data, particularly for complex and diverse microbial communities.
Ra Assembler: A metagenome assembler for long reads, designed for highly complex metagenomic samples.

Please note that the field of bioinformatics is constantly evolving, and new assembly tools may have emerged since my last update. Additionally, the performance of these tools can vary depending on the characteristics of the sequencing data and the genome being assembled. When selecting an assembly tool, consider the specific requirements of your project, the available data types, and the computational resources at your disposal. Always refer to the respective tool's documentation and publications for the most up-to-date information and recommendations.

Data Visualization in Bioinformatics: Useful and Eye-Catching Plots for Data Analysis

LEGE — Sat, 14 Dec 2024 12:41:53 -0600

Data visualization is a cornerstone of bioinformatics, enabling researchers to interpret complex datasets effectively. With a plethora of data types—genomic sequences, expression profiles, protein interactions, and more—the right visualizations can make or break an analysis. This blog highlights some of the most useful and visually compelling plots for bioinformatics data analysis, along with tools to create them.

1. Heatmaps: Exploring Patterns in High-Dimensional Data

Heatmaps are a go-to visualization for representing high-dimensional datasets, such as gene expression or metabolomics data. They use color gradients to display data intensity, making patterns and clusters easily detectable.

Applications: Gene expression analysis, pathway enrichment, methylation studies.
Tools: Seaborn (Python), ComplexHeatmap (R), Morpheus (web-based).

Tip: Add dendrograms to visualize clustering of rows and columns for hierarchical relationships.

2. Volcano Plots: Highlighting Differential Features

Volcano plots are indispensable for identifying significantly differentially expressed genes or proteins. They plot the log2 fold change against –log10(p-value), making it easy to spot statistically significant changes.

Applications: RNA-seq, proteomics, and metabolomics.
Tools: ggplot2 (R), EnhancedVolcano (R), Plotly (Python).

Tip: Use color to highlight significant features and label key genes or proteins.

3. PCA Plots: Reducing Complexity with Principal Component Analysis

Principal Component Analysis (PCA) plots are used to reduce dimensionality and uncover trends or clusters in data. They provide insights into sample variability and grouping.

Applications: Transcriptomics, metabolomics, microbiome studies.
Tools: scikit-learn + Matplotlib (Python), prcomp (R), ClustVis (web-based).

Tip: Annotate clusters with metadata to enhance interpretability.

4. Manhattan Plots: Genome-Wide Association Studies

Manhattan plots visualize p-values across the genome, making it easy to identify significant associations in genome-wide studies. They resemble city skylines, with the highest peaks indicating loci of interest.

Applications: GWAS, QTL mapping.
Tools: qqman (R), Matplotlib (Python).

Tip: Use alternating colors for chromosomes and highlight significant SNPs for clarity.

5. Circular Plots (Circos): Visualizing Genomic Relationships

Circular plots are ideal for visualizing relationships across the genome, such as structural variations, gene duplications, or synteny.

Applications: Comparative genomics, structural variation studies.
Tools: Circos (standalone), Rcircos (R), pyCircos (Python).

Tip: Keep the plot clean and avoid overcrowding to maintain readability.

6. Sankey Diagrams: Tracking Data Flows

Sankey diagrams visualize flows or relationships between categories, often used to track changes in gene expression or pathway enrichment across conditions.

Applications: Pathway analysis, gene set enrichment analysis.
Tools: Plotly (Python), networkD3 (R).

Tip: Use gradients or distinct colors to highlight key transitions.

7. Network Graphs: Mapping Interactions

Network graphs represent relationships between entities, such as protein-protein interactions or gene regulatory networks. Nodes represent entities, and edges represent relationships.

Applications: Systems biology, interactomics.
Tools: Cytoscape (standalone), igraph (R), NetworkX (Python).

Tip: Use edge thickness or node size to represent interaction strength or centrality.

8. Violin Plots: Visualizing Data Distribution

Violin plots combine a boxplot with a density plot, showing the distribution and variability of data.

Applications: Single-cell RNA-seq, quantitative trait analysis.
Tools: Seaborn (Python), ggplot2 (R).

Tip: Split violins by groups for side-by-side comparisons.

9. Time-Series Plots: Monitoring Changes Over Time

Time-series plots display changes in variables across time points, useful for tracking gene expression dynamics or metabolic fluxes.

Applications: Time-course experiments, cell cycle studies.
Tools: Matplotlib (Python), ggplot2 (R).

Tip: Smooth the data to highlight trends while avoiding overfitting.

10. Genome Tracks: Visualizing Genomic Features

Genome tracks display multiple layers of genomic data, such as gene annotations, sequencing coverage, and epigenetic marks.

Applications: ChIP-seq, ATAC-seq, whole-genome sequencing.
Tools: IGV (standalone), pyGenomeTracks (Python).

Tip: Stack related tracks for direct comparisons.

11. UpSet Plots: Visualizing Set Intersections

UpSet plots are a powerful alternative to Venn diagrams for visualizing intersections between multiple datasets.

Applications: Overlap analysis for gene sets, pathways, or variants.
Tools: UpSetR (R), ComplexUpset (Python).

Tip: Use bar plots to represent the size of each intersection for added clarity.

12. Ridge Plots: Comparing Distributions

Ridge plots visualize the distributions of multiple datasets, stacked for easy comparison.

Applications: Transcriptomics, single-cell RNA-seq.
Tools: ggridges (R), Matplotlib (Python).

Tip: Use transparency and consistent scaling for better readability.

13. Chord Diagrams: Visualizing Connections Between Groups

Chord diagrams illustrate relationships between categories, such as shared genes between pathways or overlaps in regulatory elements.

Applications: Pathway overlap, synteny, co-expression networks.
Tools: Circlize (R), Holoviews (Python).

Tip: Use distinct colors for each group to emphasize relationships.

14. Treemaps: Hierarchical Data Representation

Treemaps visualize hierarchical data as nested rectangles, with area proportional to data size.

Applications: Ontology enrichment, pathway analysis.
Tools: Treemapify (R), Plotly (Python).

Tip: Use colors to represent additional variables, like significance or enrichment scores.

15. T-SNE/UMAP Plots: Dimensionality Reduction for Clustering

T-SNE and UMAP plots are great for visualizing high-dimensional data in two dimensions while preserving local or global structure.

Applications: Single-cell transcriptomics, clustering analyses.
Tools: scikit-learn (Python), Seurat (R).

Tip: Combine with metadata annotations for better cluster interpretation.

Bringing It All Together

The choice of visualization can significantly impact the insights gained from bioinformatics data. By selecting plots tailored to your data type and analysis goals, you can effectively communicate your findings and make your research more impactful. Whether you’re a seasoned bioinformatician or a beginner, mastering these visualizations will elevate your analyses and presentations.

Software developed in pevsner lab

Robert M Willioms — Mon, 06 Oct 2014 12:41:26 -0500

DRAGON: Database Referencing of Array Genes Online

SNOMAD: Standardization and Normalization of Microarray Data

SNPduo: SNP Analysis Between Two Individuals

SNPtrio: Analyzing and Visualizing and Inheritance Patterns in Trios

SNPscan: Data Analysis and Visualization of SNP Data

pediSNP: Analyze SNP Data From a Pedigree of Two Generations

kcoeff: Calculate Cotterman Coefficients of SNP Genotype Data

triPOD: Detects chromosomal abnormalities in parent-child trio-based microarray data

Address of the bookmark: http://pevsnerlab.kennedykrieger.org/php/?q=software

deepTools

Martin Jones — Sat, 08 Nov 2014 15:02:08 -0600

deepTools addresses the challenge of handling the large amounts of data that are now routinely generated from DNA sequencing centers. To do so, deepTools contains useful modules to process the mapped reads data to create coverage files in standard bedGraph and bigWig file formats. By doing so, deepTools allows the creation of normalized coverage files or the comparison between two files (for example, treatment and control). Finally, using such normalized and standardized files, multiple visualizations can be created to identify enrichments with functional annotations of the genome.

Publicaton: http://nar.oxfordjournals.org/content/early/2014/05/05/nar.gku365.full

Source Code and Wiki: https://github.com/fidelram/deepTools/wiki

Galaxy Tool Shed repository: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools

and example Galaxy workflows: http://toolshed.g2.bx.psu.edu/view/bgruening/deeptools_workflows

A guide for complete R beginners :- Getting data into R

Archana Malhotra — Tue, 24 Feb 2015 20:15:08 -0600

For a beginner this can be is the hardest part, it is also the most important to get right.

It is possible to create a vector by typing data directly into R using the combine function ‘c’

x

same as

x

creates the vector x with the numbers between 1 and 5.

You can see what is in an object at any time by typing its name;

x

will produce the output ‘[1] 1 2 3 4 5′

Note that names need to be quoted

daysofweek ← c(‘Monday’, ‘Tuesday’, ‘Wednesday’, ‘Thursday’, ‘Friday’);

Usually however you want to input from a file. We have touched on the ‘read.table’ function already.

mydata

Now mydata is a data frame with multiple vectors

each vector can be identified by the default syntax

#if any of these are typed it will print to screen

mydata$V1 mydata$V2 mydata$V3

By default the function assumes certain things from the file

The file is a plain text file (there are function to read excel files: not covered here)
columns are separated by any number of tabs or spaces
there is the same number of data points in each column
there is no header row (labels for the columns)
there is no column with names for the rows** [I’ll explain].

If any of these are false, we need to tell that to the function

If it has a header column

mydata header=T also works

Note that there is a comma between different parts of the functions arguments

If there is one less column in the header row, then R assumes that the 1^st column of data after the header are the row names

Now the vectors (columns) are identified by their name

#if any of these are typed it will print to screen

mydata$A mydata$B mydata$C

# Summary about the whole data frame

summary(mydata)

# Summary information of column A

summary(mydata$A)

We can shortcut having to type the data frame each time by attaching it

attach(mydata)

# summary of column B as ‘mydata’ is attached

summary(B)

Two other important options for read.table

If is is separated only by tabs and has a header

mydata

Really useful if you have spaces in the contents of some columns, so R does not mess up reading the columns . However if the columns or of an uneven length it will tell you.

If you know that the file has uneven columns

mydata

This causes R to fill empty spaces in a columns with ‘NA’ .

The last two examples will still work with our file and give the same result as with only headers=T

Graphs

to get an idea of what R is capable of type

demo(graphics)

steps through the examples, and the code is printed to the screen

We will work with simpler examples that have immediate use to biologists.

Remember to get more information about the options to a function type ‘?function’

Histogram of A

hist(mydata$A)

If there was more data we could increase the number of vertical columns with the option, breaks=50 (or another relevant number).

boxplot(mydata)

We can get rid of the need to type the data frame each time by using the attach function

# if not already done so

attach(mydata)
boxplot(mydata$A, mydata$B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

same as

boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

Scatter plot

# if not already done so

attach(mydata)
plot(A,B) # or plot(mydata$A, mydata$B)

SAVING an image

Windows users (Rgui) RIGHT click on image and select which you want.

These instructions work for everyone.

You need to create a new device of the type of file you need, then send the data to that device

to save as a png file (easy to load into the likes of powerpoint, also great for web applications.

png(‘filename’)
boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

or to save as a pdf

pdf(‘filename’)
boxplot(A, B, name=c(“Value A”, “Value B”) , ylab=“Count of Something”)

Note

Nothing will appear on screen, the output is going to the file
Also it may not be saved immediately but will once the device (or R) is turned quit.

To quit R type

q() # If you save your session, next time you start R, you will have your data preloaded.

Or if you want to remain in R

dev.off() #turns of the png (or pdf etc) device, thus forces the data to save