BOL: Abhi's blogs

SNP Analysis: Unlocking the Secrets in Our DNA

Abhi — Wed, 16 Jul 2025 01:31:45 -0500

Single Nucleotide Polymorphisms (SNPs) are the most common type of genetic variation in humans—and many other organisms. A single base change in the DNA sequence (for example, an A instead of a G) can influence everything from our eye color to our risk of developing diseases. Analyzing these tiny changes has become central to modern genetics, medicine, agriculture, and evolutionary biology.

What are SNPs?
SNPs (pronounced "snips") are positions in the genome where individuals differ by a single nucleotide. For example:

Reference: ...A T G C A T G A...
Variant: ...A T G T A T G A...

Here, the C in the reference genome has been replaced by a T in the variant.

SNPs occur roughly every 300–1,000 bases in the human genome, meaning there are millions of them scattered throughout our DNA. Most SNPs have no effect on health, but some are linked to disease susceptibility, drug response, and other traits.

Why Do We Analyze SNPs?
1. Medical Genetics

Identify disease-associated variants (e.g., BRCA1/2 in breast cancer).

Predict drug response (pharmacogenomics).

Enable precision medicine by tailoring treatments.

2. Population Genetics & Ancestry

Trace human migration and ancestry.

Study genetic diversity within and between populations.

3. Agriculture & Animal Breeding

Select for desirable traits (drought resistance, yield, disease resistance).

Improve breeding efficiency in livestock.

4. Evolutionary Biology

Track natural selection.

Study adaptation in wild populations.

How is SNP Analysis Performed?
SNP analysis can be broadly divided into three steps:

SNP Detection
Genotyping arrays: Chips that test hundreds of thousands of known SNP positions simultaneously. Fast and affordable, widely used in consumer ancestry testing.

Whole-genome or whole-exome sequencing: Can detect known and novel SNPs across the genome.

Targeted sequencing or PCR: For focused analysis of specific regions.

Variant Calling
Sequencing data is aligned to a reference genome. Bioinformatics tools (e.g., GATK, bcftools) identify positions where the sequenced sample differs from the reference.

Annotation and Interpretation
Tools (e.g., SnpEff, VEP) predict the functional impact of SNPs.

Are the SNPs in coding regions? Do they cause amino acid changes? Are they known to be pathogenic?

Databases like dbSNP, ClinVar, and GWAS Catalog provide information on known associations.

Common Tools for SNP Analysis
Alignment: BWA, Bowtie2

Variant Calling: GATK, FreeBayes

Visualization: IGV, UCSC Genome Browser

Annotation: SnpEff, VEP

Statistical Analysis: PLINK, SNPTEST

Challenges in SNP Analysis
False positives/negatives: Sequencing errors, alignment issues.

Population stratification: Confounding in association studies.

Interpretation: Many SNPs have unknown or complex effects.

Researchers address these with rigorous quality control, large datasets, and increasingly sophisticated statistical models.

The Future of SNP Analysis
With advances in sequencing technology and AI-driven analysis, SNP studies are expanding:

Polygenic risk scores predict disease risk based on thousands of SNPs.

Large-scale biobanks (e.g., UK Biobank, All of Us) enable powerful genome-wide association studies (GWAS).

CRISPR and functional assays help validate SNP effects in the lab.

SNP analysis is at the heart of the genomic revolution, promising insights into biology, health, and evolution at unprecedented scale.

Conclusion
From diagnosing rare diseases to designing better crops, SNP analysis is a foundational tool in modern science. As our ability to sequence and interpret genomes improves, so will our understanding of these tiny—but mighty—variations in DNA.

P-Value, FDR, q-score: What Do They Mean? A Simple Guide with Example

Abhi — Fri, 27 Jun 2025 03:26:38 -0500

In statistics and bioinformatics, you’ll often see results reported with p-values, FDR, and q-values (q-scores). But what do these terms mean, and how are they different? Let’s break them down with simple definitions and a step-by-step example.

1. What is a P-Value?
Definition: The p-value is the probability of observing a result at least as extreme as the one you got, assuming the null hypothesis is true.

Low p-value (e.g., p < 0.05) → evidence against the null hypothesis.

High p-value → no strong evidence against the null.

Key idea: It tells you how surprising your data is if there’s really no effect.

2. The Multiple Testing Problem
In bioinformatics, genomics, or any large-scale study, you test thousands of hypotheses (e.g., thousands of genes). Even if there’s no real signal, some tests will have p < 0.05 just by chance.

Example:

Testing 10,000 genes

Even if all null, expect ~500 genes with p < 0.05 by chance

This is why we need multiple testing correction.

3. What is FDR (False Discovery Rate)?
Definition: FDR is the expected proportion of false positives among the results you declare significant.

Unlike the family-wise error rate (FWER), which controls for even a single false positive, FDR lets you tolerate some false discoveries to gain power.

Benjamini–Hochberg (BH) procedure is the most popular method to control FDR.

4. What is a q-value (or q-score)?
Definition: The q-value of a test is the minimum FDR at which that test would be called significant.

A p-value tells you how surprising your result is.

A q-value tells you how many of your significant results might be false positives if you call this result significant.

You can think of the q-value as the FDR-adjusted p-value.

5. Example: Step-by-Step
Let’s work through an example with 10 tests.

Test Raw p-value
1 0.001
2 0.004
3 0.010
4 0.020
5 0.030
6 0.040
7 0.050
8 0.060
9 0.070
10 0.080

Goal: Control FDR at 5%.

Step 1: Rank p-values
Rank from lowest to highest:

Rank p-value
1 0.001
2 0.004
3 0.010
4 0.020
5 0.030
6 0.040
7 0.050
8 0.060
9 0.070
10 0.080

Step 2: Apply Benjamini–Hochberg threshold
For each rank i, compute:

BH critical value =i/m*q
BH critical value=m/i*Q
m = 10 tests
Q = 0.05

Rank p-value BH critical value
1 0.001 0.005
2 0.004 0.010
3 0.010 0.015
4 0.020 0.020
5 0.030 0.025
6 0.040 0.030
7 0.050 0.035
8 0.060 0.040
9 0.070 0.045
10 0.080 0.050

Find the largest p-value ≤ its critical value:

p(4) = 0.020 ≤ 0.020 (T)

p(5) = 0.030 > 0.025 (F)

Result: We can declare the top 4 tests significant at FDR 5%.

Step 3: Computing q-values (conceptually)
The q-value for each p-value is roughly the minimum FDR at which it would be significant. Specialized software (e.g., R’s qvalue package) can estimate them.

In our example:

Tests 1–4 would have q-values ≤ 0.05

Tests 5–10 would have q-values > 0.05

The q-value gives you an adjusted p-value that accounts for multiple testing.

6. In Bioinformatics Workflows
You see these all the time:

RNA-seq differential expression → Report p-values, FDR/q-values

ChIP-seq peak calling

Genome-wide association studies (GWAS)

Proteomics, metabolomics

Always check if results are corrected for multiple testing. Reporting raw p-values alone can be misleading.

Summary
Term Meaning Interpretation
p-value Probability under null Small p → evidence against null
FDR False Discovery Rate Expected proportion of false positives among calls
q-value FDR-adjusted p-value Minimum FDR threshold where result is significant

Final Tip
Always correct for multiple testing! Otherwise, your beautiful "significant" results might just be noise.

What is Data Science? — A Bioinformatics Perspective

Abhi — Mon, 16 Jun 2025 01:44:34 -0500

In today’s era of big biology, we’re generating more data than ever before—genomes, transcriptomes, proteomes, metabolomes, microbiomes… you name it. But raw biological data doesn’t speak for itself. Making sense of it requires more than traditional biology. This is where data science steps in.

So, What Is Data Science?
At its core, data science is the interdisciplinary field that extracts knowledge and insights from data using programming, statistics, and domain expertise. In bioinformatics, data science enables us to turn gigabytes of sequence data into biological meaning.

Imagine trying to understand gene regulation in cancer by analyzing thousands of RNA-seq samples, or predicting antibiotic resistance from bacterial genomes—these challenges are not solvable through wet lab experiments alone. They require data-driven thinking.

Data Science Meets Bioinformatics
Bioinformatics is inherently a data science domain. From genomics to systems biology, every field in modern biology relies on data science techniques to:

Clean and process massive datasets

Discover patterns in high-dimensional data

Build predictive models (e.g., for disease classification)

Visualize complex biological networks and trends

Integrate diverse data types (e.g., transcriptomic + epigenomic data)

The Bioinformatics Toolkit
Here’s what data science typically looks like in bioinformatics:

Task Data Science Role
Sequence alignment Efficient algorithms, indexing, parallel processing
Gene expression analysis Statistical modeling (e.g., DESeq2, limma)
Variant calling Data filtering, probabilistic models
Clustering of cells in single-cell data Unsupervised learning
Protein structure prediction Deep learning models (e.g., AlphaFold)
Metagenomics Data integration, classification, dimensionality reduction

Common tools include Python, R, Bioconductor, scikit-learn, Pandas, Seurat, and TensorFlow—often working together in reproducible workflows.

It's Not Just About Coding
A common misconception is that bioinformatics is just programming or scripting. But being a data scientist in bioinformatics also means:

Understanding experimental design

Asking biologically meaningful questions

Choosing the right statistical or machine learning models

Communicating findings effectively (e.g., plots, dashboards, papers)

In other words, data science in bioinformatics is where biology, statistics, and computer science converge.

Why It Matters
The real power of data science in bioinformatics is its ability to scale discovery.

Instead of studying one gene, we can study thousands.

Instead of analyzing one species, we can explore entire ecosystems.

Instead of waiting months for lab results, we can generate hypotheses in days.

From personalized medicine and cancer diagnostics to agricultural genomics and pandemic surveillance, data science is at the heart of the bioinformatics revolution.

Final Thoughts
If you’re a biologist who’s curious about code, or a data enthusiast fascinated by life sciences, bioinformatics is your playground—and data science is your toolkit.

In bioinformatics, data science isn’t just useful. It’s essential.

Cracking the Code: A Guide to Bioinformatics Job Hunting

Abhi — Mon, 23 Dec 2024 19:36:41 -0600

Entering the world of bioinformatics is an exciting journey, filled with opportunities to combine biology, data science, and technology to address some of the most pressing scientific challenges. However, securing a position in this competitive field can be daunting, especially for newcomers. Here’s a guide to help you navigate the job-hunting process and land your dream role in bioinformatics.

1. Understand the Landscape

Before diving into applications, take the time to understand the bioinformatics job market. Common roles include:

Bioinformatics Analyst/Scientist: Focused on data analysis and interpretation.
Computational Biologist: Combines computational techniques with biological research.
Data Scientist in Genomics: Applies machine learning and statistical models to genomic data.
Software Developer in Bioinformatics: Designs and develops tools and pipelines for biological research.

Familiarize yourself with the key industries hiring bioinformaticians, such as academia, biotech, pharmaceuticals, healthcare, and agriculture.

2. Build a Strong Foundation

Bioinformatics demands a diverse skill set. Ensure you have a solid foundation in the following areas:

Programming Skills: Proficiency in Python, R, or Perl is often required. Familiarity with tools like Bash scripting and version control systems (e.g., Git) is a plus.
Statistics and Data Analysis: Knowledge of statistical methods, machine learning, and data visualization is crucial.
Biological Knowledge: Understanding genomics, transcriptomics, and proteomics will help you communicate effectively with biologists.
Specialized Tools and Databases: Be comfortable using tools like BLAST, Bowtie, and databases like NCBI and Ensembl.

3. Create a Winning Resume and Portfolio

Highlight your technical skills, biological knowledge, and relevant experience. Tips for a standout application:

Tailor your resume to each job, emphasizing skills mentioned in the job description.
Showcase your experience with real-world datasets by linking to your GitHub profile or online portfolio.
Include details of any publications, presentations, or significant projects.

4. Network Actively

Networking is often the key to discovering opportunities. Here’s how to build connections:

Attend Conferences and Workshops: Events like ISMB or specialized bioinformatics workshops are great for meeting professionals.
Engage Online: Join LinkedIn groups, participate in bioinformatics forums, and follow relevant hashtags on Twitter.
Leverage Alumni Networks: Connect with alumni from your university who are working in the field.

5. Gain Relevant Experience

Experience is a major factor for hiring managers. Ways to enhance your profile include:

Internships: Seek out internships in research labs or biotech companies.
Collaborations: Volunteer to work on projects with professors or peers.
Open Source Contributions: Participate in bioinformatics software development on platforms like GitHub.

6. Prepare for Interviews

Bioinformatics interviews often combine technical and behavioral questions. Prepare by:

Reviewing Key Concepts: Refresh your knowledge of algorithms, sequence analysis, and statistical methods.
Practicing Coding: Be ready to solve coding challenges or discuss code snippets.
Understanding the Organization: Research their recent projects, publications, or products.
Preparing Questions: Demonstrate interest by asking about their tools, workflows, or team structure.

7. Stay Resilient and Persistent

Job hunting can be a long process, but persistence pays off. Tips to keep moving forward:

Keep improving your skills by taking online courses or certifications.
Stay updated with advancements in bioinformatics by following journals and blogs.
Apply to multiple positions and don’t get discouraged by rejections. Each application is a learning experience.

Closing Thoughts

Landing a bioinformatics job requires a mix of technical expertise, networking, and resilience. By understanding the market, showcasing your skills effectively, and continuously learning, you’ll be well on your way to a rewarding career in this dynamic field. Remember, the key to cracking the code is perseverance—stay curious, stay determined, and success will follow.

Life as a Bioinformatician – Expectation vs. Reality

Abhi — Mon, 23 Dec 2024 19:32:36 -0600

You enter the world of bioinformatics envisioning a sleek, high-tech career, surrounded by cutting-edge algorithms, advanced computational tools, and groundbreaking discoveries. You imagine a seamless integration of biology and data science, where every day you decode the mysteries of life at a molecular level. Your days will be spent analyzing elegant datasets, publishing in top-tier journals, and making significant contributions to human health and the environment. To top it off, you picture yourself working in a comfortable, quiet environment, with plenty of time to perfect your skills and learn new ones.

While the expectations are not entirely off base, the reality of life as a bioinformatician is a mix of exciting discoveries, troubleshooting, and, let’s admit it, a fair amount of frustration. Here’s what it’s really like:

1. Expectation: Seamlessly Working with Perfect Datasets

Reality: You often receive messy, incomplete, or poorly annotated datasets. Hours are spent cleaning, normalizing, and validating data before you even begin your analysis. "Garbage in, garbage out" is a constant reminder in your workflow. Tools designed to handle these problems exist, but they require significant customization, which adds another layer of complexity.

2. Expectation: Effortless Multidisciplinary Integration

Reality: Bridging biology and computational science is far from straightforward. You need to be proficient in both domains while keeping up with advancements in genomics, machine learning, and statistics. Additionally, collaborating with biologists who might not be fluent in computational jargon requires patience and effective communication skills.

3. Expectation: Rapid, Groundbreaking Results

Reality: Analysis often involves waiting—waiting for scripts to run, pipelines to complete, or software to install. Bioinformatics projects are iterative; you analyze, debug, and refine repeatedly. A single project might take months to complete due to unforeseen challenges, like computational bottlenecks or the need for additional experiments.

4. Expectation: Beautiful Visualizations with a Click

Reality: While tools like R, Python, and specialized software can create stunning plots, generating a publication-ready visualization requires significant effort. You’ll spend hours tweaking axes, labels, and color palettes, ensuring clarity and accuracy.

5. Expectation: All Work, No Bugs

Reality: Debugging is an integral part of the job. Whether it’s a misconfigured server, a script throwing unexpected errors, or a pipeline breaking due to an update, you’ll develop a knack for problem-solving under pressure.

6. Expectation: Ample Time for Skill Development

Reality: Bioinformatics moves fast. Juggling ongoing projects, tight deadlines, and the constant stream of new tools and algorithms leaves little time for leisurely learning. Staying updated requires proactive effort—evenings, weekends, or dedicated study breaks.

7. Expectation: Publishing Papers Regularly

Reality: Publishing in bioinformatics is a marathon, not a sprint. Your analysis needs to be thorough, reproducible, and supported by strong biological insights. Reviewers often demand additional experiments or clarifications, stretching the timeline even further.

8. Expectation: A Clear Career Path

Reality: Bioinformatics offers diverse career paths, from academia and industry to healthcare and government. However, the choice can be daunting, with each path requiring unique skill sets and presenting different challenges. Navigating these options takes time, research, and sometimes trial and error.

Finding Joy in the Chaos

Despite these challenges, being a bioinformatician is immensely rewarding. You are at the forefront of science, enabling discoveries that impact medicine, agriculture, and the environment. The thrill of uncovering insights hidden in complex datasets and the satisfaction of solving biological puzzles make the hard work worthwhile.

Advice for Aspiring Bioinformaticians

Embrace Learning: The field is ever-evolving. Stay curious and adaptable.
Develop Communication Skills: Bridging the gap between biology and computation is as much about explaining your methods as it is about applying them.
Find a Community: Collaborate with peers, join forums, and attend conferences to stay inspired and updated.
Celebrate Small Wins: Every cleaned dataset, successful script, or informative plot is a step forward.

Bioinformatics is a blend of science, technology, and artistry. While the reality might not match the polished expectations, the journey is nothing short of exhilarating. If you’re ready to embrace the chaos and keep learning, the field of bioinformatics will never cease to amaze you.

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.

Meta-Transcriptomics: Dynamic World of RNA in Diverse Environments

Abhi — Wed, 31 Jul 2024 02:40:49 -0500

Meta-transcriptomics combines high-throughput sequencing technologies with computational biology to profile the RNA content of a sample. This technique allows researchers to capture a snapshot of gene expression and metabolic activities across diverse microbial communities, such as those found in soil, water, and the human gut.

Key Components

Sample Collection: Meta-transcriptomics begins with the collection of environmental samples. These samples are often complex, containing a wide range of microorganisms.
RNA Extraction: RNA is extracted from the sample, which includes mRNA, rRNA, tRNA, and other non-coding RNAs. This step is crucial as it determines the quality and representativeness of the data.
Sequencing: High-throughput RNA sequencing (RNA-seq) technologies are used to obtain sequences of the RNA transcripts. This step provides a vast amount of data on the RNA molecules present in the sample.
Data Analysis: Computational tools and bioinformatics methods are employed to process and analyze the sequencing data. This involves mapping RNA sequences to reference genomes or transcriptomes, identifying expressed genes, and quantifying their abundance.
Functional Annotation: The functional roles of identified transcripts are inferred based on known gene functions, allowing researchers to understand the metabolic and ecological functions of the microbial community.

Applications

Environmental Monitoring: Meta-transcriptomics can be used to monitor the health and functional status of ecosystems. For example, it can help assess the impact of pollution on microbial communities by revealing changes in gene expression related to stress response and degradation processes.
Microbiome Research: In human health, meta-transcriptomics offers insights into the gut microbiome’s functional state. It helps in understanding how microbial communities interact with their host, how they respond to dietary changes, and their role in health and disease.
Biotechnology: The technique can aid in the discovery of novel enzymes and bioactive compounds by profiling microbial communities in extreme environments or industrial processes.
Disease Pathogenesis: By analyzing RNA profiles from disease-associated environments, researchers can uncover pathogen-host interactions and identify potential targets for therapeutic interventions.

Challenges

Complexity of Data: The sheer volume and complexity of data generated by meta-transcriptomics can be overwhelming. Effective data management and advanced computational tools are required to extract meaningful insights.
Sampling Bias: Environmental samples can be heterogeneous, and RNA extraction methods may introduce biases, potentially affecting the accuracy of the results.
Reference Databases: Incomplete or biased reference databases can hinder the accurate functional annotation of transcripts, especially when studying novel or poorly characterized organisms.

Future Directions

Meta-transcriptomics is a rapidly evolving field, with ongoing advancements in sequencing technologies and bioinformatics. Future research may focus on improving data integration, developing more comprehensive reference databases, and enhancing our understanding of microbial community dynamics in various environments. As these challenges are addressed, meta-transcriptomics will continue to provide valuable insights into the functional roles of microorganisms and their interactions within ecosystems.

Conclusion

Meta-transcriptomics represents a powerful tool for exploring the functional aspects of microbial communities in their natural environments. By capturing a snapshot of gene expression and metabolic activities, this approach offers a deeper understanding of ecological interactions, health implications, and biotechnological potentials. As technology and methodologies advance, meta-transcriptomics is poised to make significant contributions to our knowledge of the microbial world.

Bioinformatic tools for pathogens informatics at CVR

Abhi — Sat, 08 Jun 2024 15:59:46 -0500

Novel sequencing and analytical approaches focused on studying viruses and virus-host interactions. Below you will find summaries and links to a number of bioinformatic tools that have been developed @ CVR.

DIGS

The database-integrated genome-screening (DIGS) tool provides a framework for implementing automated in silico screening of sequence databases using BLAST in combination with a relational database (MySQL).

DisCVR

DisCVR is a Diagnostic tool for detecting known human viruses in clinical samples from Next-Generation Sequencing (NGS) data. The tool uses a simple and straightforward Graphical User Interface and is optimized on Windows OS without compromising speed and accuracy.

DiversiTools

DiversiTools is a computational tool that is specifically tailored towards viral HTS data sets and the analysis of the underlying viral populations that they represent. It was initially developed in collaboration with a number of virologists interested in characterising the intra-host diversity of viral populations and studying their evolution across transmission chains at the micro-evolutionary scale.

GLUE

GLUE is a flexible data-centric bioinformatics environment for virus sequence data, with a focus on virus evolution and genomic variation. GLUE has been applied to a range of viruses. A GLUE-based resource focused on Hepatitis C virus is HCV-GLUE.

Tanoti

Tanoti is a BLAST guided reference based short read aligner. It is developed for maximising alignment in highly variable next generation sequence data sets (Illumina).

ViCTree

ViCTree is a bioinformatic framework that automatically selects new candidate virus sequences from GenBank, generates multiple sequence alignments, calculates a maximum likelihood phylogeny and integrates the sequences into the existing phylogenetic trees. For more information click here.

Viral Host Predictor

Viral Host Predictor provides a fast and simple way to predict the hosts and vectors of RNA viruses from viral sequences.

GRACy

GRACy is a bioinformatic tool designed for the analysis of Illumina data originated from Human cytomegalovirus samples. GRACy can be used to perform read quality filtering, genotyping, de novo assembly, variant detection, annotation and data submission to public database.

LoReTTA

LoReTTA (Long Read Template Targeted Assembler) is a reference assisted de novo assembler specifically designed to deal with PacBio reads generated from viral genomes.

BingleSeq

BingleSeq is a R-package enables the user-friendly analysis of count tables obtained by both Bulk RNA-Seq and single-cell RNA-Seq protocols. The development of BingleSeq focused on providing a flexible and intuitive user experience.

Interesting Bioinformatics Resources !

Abhi — Fri, 11 Nov 2022 06:30:46 -0600

1. a reproducible workflow. https://www.youtube.com/watch?v=s3JldKoA0zw This two minute video will change your mind on reproducible research

2. Parallel sequencing lives, or what makes large sequencing projects successful https://academic.oup.com/gigascience/article/6/11/gix100/4557140?login=false

3. Common-sense approaches to sharing tabular data alongside publication https://www.sciencedirect.com/science/article/pii/S2666389921002300

4. A Reproducible Data Analysis Workflow with R Markdown, Git, Make, and Docker https://psyarxiv.com/8xzqy/

5. Practical Computational Reproducibility in the Life Sciences https://www.cell.com/cell-systems/fulltext/S2405-4712(18)30140-6

6. A video by Dr.Keith A. Baggerly from MD Anderson [The Importance of Reproducible Research in High-Throughput Biology](https://www.youtube.com/watch?v=7gYIs7uYbMo) highly recommended.

7. Ten Simple Rules for Reproducible Computational Research http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285)

8. Good Enough Practices in Scientific Computing http://arxiv.org/abs/1609.00037

9. Best Practices for Scientific Computing https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

10. A Quick Guide to Organizing Computational Biology Projects http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.100042 A must read for computational biologists!

11. Reproducibility of computational workflows is automated using continuous analysis https://www.nature.com/articles/nbt.3780

12. Five selfish reasons to work reproducibly https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0850-7

Tools for Differential expression analysis

Abhi — Tue, 08 Nov 2022 03:40:33 -0600

apeglm - https://bioconductor.org/packages/release/bioc/html/apeglm.html

ashr - https://github.com/stephens999/ashr, https://cran.r-project.org/web/packages/ashr/index.html

consensusDE - https://bioconductor.org/packages/release/bioc/html/consensusDE.html

DESeq2 - https://bioconductor.org/packages/release/bioc/html/DESeq2.html

edgeR - https://bioconductor.org/packages/release/bioc/html/edgeR.html

limma - https://kasperdanielhansen.github.io/genbioconductor/html/limma.html https://bioconductor.org/packages/release/bioc/html/limma.html

MetaCycle - https://cran.r-project.org/web/packages/MetaCycle/index.html, https://github.com/gangwug/MetaCycle

RUVSeq - https://bioconductor.org/packages/release/bioc/html/RUVSeq.html

SARTools - https://github.com/PF2-pasteur-fr/SARTools

tximport - https://github.com/mikelove/tximport