BOL: Related items

Understanding HiFi Reads !

Rahul Nayak — Thu, 24 Mar 2022 19:48:11 -0500

While little public data is available for either of the new synthetic long read approaches, Illumina showed an example comparison earlier this year at the Festival of Genomics & Biodata conference (FoG 2022). In the IGV screenshot presented (below), synthetic Infinity reads – labeled “Longas” – are at the top, followed by standard Illumina short reads, and PacBio HiFi reads labeled “CCS” depicted at the bottom:

Address of the bookmark: http://pacb.com/blog/the-hifi-difference-true-long-reads-vs-synthetic-long-reads/

Trust But Verify: Sequencing Your Cell Lines Might Reveal an Uninvited Guest

LEGE — Wed, 04 Jun 2025 00:07:57 -0500

High-throughput sequencing has become indispensable in cell biology, enabling detailed insights into chromatin structure, gene expression, and regulatory dynamics. Yet, when faced with unexpectedly low mapping rates to the human genome, researchers often rush to troubleshoot technical parameters—sequencer quality, adapter trimming, or aligner settings.

Before you go down that path, consider this critical biological question:
Are you sequencing human cells—or bacterial contamination?

The Silent Saboteur: Mycoplasma in Cell Cultures

Mycoplasma contamination remains one of the most widespread and underdiagnosed issues in tissue culture work. Studies suggest that 15–35% of cell lines in use may be contaminated, often without visible signs. Unlike other microbial infections, Mycoplasma does not produce cloudiness, odor, or a change in pH. Many researchers won’t detect it unless they specifically test for it.

The consequences, however, are profound. Mycoplasma can significantly alter:

Host gene expression patterns
Cell proliferation rates
Epigenetic profiles and chromatin accessibility
Cytokine signaling and immune responses

In short, it can skew your results, compromise your biological conclusions, and invalidate weeks or months of research.

A Simple Diagnostic Step: Map Against Mycoplasma Genomes

If you encounter poor alignment rates to the human genome, consider mapping your reads to a Mycoplasma reference genome—or better yet, use a combined human + Mycoplasma reference. There have been cases where over half of all reads, initially assumed to be from human cells, were in fact bacterial in origin. This check is fast, easy, and could save your project.

How Contamination Happens—and Persists

Mycoplasma is small (0.1–0.3 μm), lacks a cell wall, and can pass through standard filters undetected. Common sources include:

Contaminated reagents (e.g., FBS)
Infected cell lines obtained from other labs
Poor aseptic technique or shared equipment

Once present, it spreads quickly between cultures and can persist for months, silently affecting results.

Why Treatment Is Difficult

While antibiotics such as Plasmocin or BM-Cyclin are sometimes used, they often offer only partial resolution and may themselves alter cell behavior. In many cases, the best course of action is to discard the contaminated culture and start with a fresh, verified stock.

Practical Recommendations for Researchers

Routinely test for Mycoplasma using PCR, qPCR, or fluorescence-based assays
Incorporate contamination screens into your sequencing QC pipeline
Use combined reference genomes when mapping ambiguous reads
Practice strict aseptic technique and monitor all incoming cell lines
Don’t ignore unexplained data anomalies—they might point to contamination

Closing Thought: Contamination Is a Biological Variable

It’s easy to view poor mapping as a technical issue, but sometimes the problem lies deeper—in the biology itself. Mycoplasma contamination doesn’t just interfere with sequencing; it interferes with science. As a research community, we must treat contamination not as an afterthought, but as a key variable to control.

So next time your reads won’t align, don’t just tune the aligner. Ask if your cells are telling the truth—or if they're hiding something.

Recombination detection tool

Jit — Tue, 02 Feb 2016 10:11:14 -0600

A program to detect recombination hotspots using population genetic data.

More at https://github.com/auton1/LDhot

Address of the bookmark: https://github.com/auton1/LDhot

ORFfinder with smart BLAST

Jit — Tue, 17 May 2016 01:43:15 -0500

ORF Finder

ORFfinder is a graphical analysis tool for finding open reading frames (ORFs). We’ve been working on a few updates, and we’d like to find out what you think about them. Read on to find out what you can do with the new ORFfinder.

Smart BLAST (https://ncbiinsights.ncbi.nlm.nih.gov/2015/07/29/smartblast/)

Select one or a group of ORFs and BLAST several databases at once, and use the newly developed SmartBLAST to verify protein names. Looking for the traditional results from BLAST? They’re there too.

BBMap/BBTools package: Multipurpose tool designed for converting reads or other nucleotide data between different formats.

Jit — Mon, 13 Jun 2016 05:47:21 -0500

Reformatis a member of the BBMap/BBTools package. It is a multipurpose tool designed for converting reads or other nucleotide data between different formats. It supports, and can inter-convert:

fastq
fasta
fasta+qual
sam
scarf (an old Illumina format)
bam (if samtools is installed)
gzip
zip
ascii-33 (sanger)
ascii-64 (old Illumina)
paired files
interleaved files

It is multithreaded and can process data at over 500 megabytes per second, and can accept streams from standard in and write to standard out, allowing it to be easily dropped into the middle of a pipeline for format conversion. Reformat autodetects formats based on file extensions and content, making it very easy to use; and the autodetection can be overridden, allowing flexibility for people who don't like to follow naming conventions, or out-of-spec fastq files with qualities values like -17 or 120.

The program has been gradually expanded, and can now perform various other functions. None of these will break pairing, if the input is paired.

Quality trimming (either or both ends)
Quality filtering
Fixed-length trimming
Generation of histograms (base composition, quality, etc)
Subsampling (to a fraction of input reads, or an exact number of reads or bases)
Changing fasta line-wrapping length
Reverse-complementing (all reads or only read 2)
Adding /1 and /2 suffix to read names
GC-content filtering
Length-filtering
Testing for corrupted interleaved files

Reformat is compatible with any platform that supports Java 1.7 or higher. It also has a bash shellscript for simpler invocation. Typical usage examples:

Reformat fastq into fasta:
reformat.sh in=x.fq out=y.fa

Interleave paired reads:
reformat.sh in1=x1.fq in2=x2.fq out=y.fq

Note - you can actually use a shortcut if paired read files have the same name with a 1 and a 2. This is equivalent to the above command:
reformat.sh in=x#.fq out=y.fq

De-interleave reads:
reformat.sh in=x.fq out1=y1.fq out2=y2.fq

Verify that interleaving appears correct, assuming Illumina namimg conventions:
reformat.sh in=x.fq vint

Convert ASCII-33 to ASCII-64:
reformat.sh in=x.fq out=y.fq qin=33 qout=64

Quality-trim paired reads to Q10 on the left and right ends and discard reads shorter than 50bp after trimming:
reformat.sh in1=x1.fq in2=x2.fq out1=y1.fq out2=y2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlength=50

Subsample 10% of the first 20000 pairs in an interleaved file:
reformat.sh in=x.fq out=y.fq reads=20000 samplerate=0.1 int=t
(in this case "int=t" overrides interleaving autodetection, to ensure reads are treated as pairs)

Pipe in a gzipped sam file and pipe out fasta:
reformat.sh in=stdin.sam.gz out=stdout.fa

Reverse-complement reads:
reformat.sh in=x.fq out=y.fq rcomp

For reformatting a file with very long sequences, Reformat will need more memory; just add the additional flag "-Xmx2g". For example, to change the line-wrapping length on the human genome (which has individual sequences over 200Mbp long) to 70 characters:
reformat.sh -Xmx2g in=HG19.fa.gz out=HG19_wrapped.fa.gz fastawrap=70

For additional functions, please run the shellscript with no arguments, or just read it with a text editor. If you have any questions, please post them in this thread.

For people using a non-bash terminal, you may need to type "bash reformat.sh" instead of just "reformat.sh".
For users of Windows or other platforms that do not support bash shellscripts, replace "reformat.sh" with "java -ea -Xmx200m /path/to/bbmap/current/ jgi.ReformatReads"
for example,
java -ea -Xmx200m C:\bbmap\current\ jgi.ReformatReads in=x.fq out=y.fa

Reformat can be downloaded with BBTools here:
https://sourceforge.net/projects/bbmap/

gbtools: Interactive Visualization of Metagenome Bins in R

Jit — Sun, 26 Mar 2017 15:41:31 -0500

We have developed gbtools, a software package that allows users to visualize metagenomic assemblies by plotting coverage (sequencing depth) and GC values of contigs, and also to annotate the plots with taxonomic information. Different sets of annotations, including taxonomic assignments from conserved marker genes or SSU rRNA genes, can be imported simultaneously; users can choose which annotations to plot. Bins can be manually defined from plots, or be imported from third-party binning tools and overlaid onto plots, such that results from different methods can be compared side-by-side. gbtools reports summary statistics of bins including marker gene completeness, and allows the user to add or subtract bins with each other.

Tool at https://github.com/kbseah/genome-bin-tools

Address of the bookmark: http://journal.frontiersin.org/article/10.3389/fmicb.2015.01451/full

HGT-Finder: A New Tool for Horizontal Gene Transfer Finding and Application to Aspergillus genomes

Jit — Wed, 17 Jan 2018 05:03:19 -0600

HGT-Finder:

(i) can be used for HGT detection in both prokaryotes and eukaryotes,

(ii) can report a statistical P value for each gene to indicate how likely it is to be horizontally transferred, and

(iii) is fully automated (requires minimal human intervention), as well as very easy to install and run.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4626719/

MGERT: Mobile Genetic Elements Retrieving Tool

Neel — Sat, 18 May 2019 08:58:01 -0500

MGERT is a computational pipeline for easy retrieving of MGE's coding sequences of a particular family from genome assemblies. MGERT utilizes several established bioinformatic tools combined into single pipeline which hides different technical quirks from an inexperienced user.

Address of the bookmark: https://github.com/andrewgull/MGERT

Panacus : A Counting Tool for Pangenome Graphs

Neel — Fri, 14 Jun 2024 14:42:28 -0500

panacus is a tool for calculating statistics for GFA files. It supports GFA files with P and W lines, but requires that the graph is blunt, i.e., nodes do not overlap and consequently, each link (L) points from the end of one segment (S) to the start of another.

panacus supports the following calculations:

coverage histogram
pangenome growth statistics
path-/group-resolved coverage table

Address of the bookmark: https://github.com/marschall-lab/panacus

DAVI: Deep learning-based tool for alignment and single nucleotide variant identification

Jit — Tue, 16 Mar 2021 05:41:33 -0500

DAVI consists of models for both global and local alignment and for variant calling. We have evaluated the performance of DAVI against existing state-of-the-art tool sets and found that its accuracy and performance is comparable to existing tools used for bench-marking. We further demonstrate that while existing tools are based on data generated from a specific sequencing technology, the models proposed in DAVI are generic and can be used across different NGS technologies as well as across different species

https://iopscience.iop.org/article/10.1088/2632-2153/ab7e19/pdf

Address of the bookmark: https://github.com/gguptaiitd/NEAT