BOL: Related items

SPAdes

Abhimanyu Singh — Tue, 19 Apr 2016 08:37:08 -0500

SPAdes – St. Petersburg genome assembler – is intended for both standard isolates and single-cell MDA bacteria assemblies. This manual will help you to install and run SPAdes. SPAdes version 3.7.1 was released under GPLv2 on March 8, 2016 and can be downloaded from http://bioinf.spbau.ru/en/spades.

Manual at http://spades.bioinf.spbau.ru/release3.7.1/manual.html

Address of the bookmark: http://bioinf.spbau.ru/spades

cutadapt

Radha Agarkar — Fri, 13 May 2016 04:54:50 -0500

Cutadapt finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.

Cleaning your data in this way is often required: Reads from small-RNA sequencing contain the 3’ sequencing adapter because the read is longer than the molecule that is sequenced. Amplicon reads start with a primer sequence. Poly-A tails are useful for pulling out RNA from your sample, but often you don’t want them to be in your reads.

Cutadapt helps with these trimming tasks by finding the adapter or primer sequences in an error-tolerant way. It can also modify and filter reads in various ways. Adapter sequences can contain IUPAC wildcard characters. Also, paired-end reads and even colorspace data is supported. If you want, you can also just demultiplex your input data, without removing adapter sequences at all.

Cutadapt comes with an extensive suite of automated tests and is available under the terms of the MIT license.

If you use cutadapt, please cite DOI:10.14806/ej.17.1.200 .

Address of the bookmark: https://cutadapt.readthedocs.io/en/stable/installation.html#quickstart

WgSim

Jit — Thu, 23 Jun 2016 07:26:49 -0500

Reads simulator

Wgsim is a small tool for simulating sequence reads from a reference genome. It is able to simulate diploid genomes with SNPs and insertion/deletion (INDEL) polymorphisms, and simulate reads with uniform substitution sequencing errors. It does not generate INDEL sequencing errors, but this can be partly compensated by simulating INDEL polymorphisms.

Wgsim outputs the simulated polymorphisms, and writes the true read coordinates as well as the number of polymorphisms and sequencing errors in read names. One can evaluate the accuracy of a mapper or a SNP caller with wgsim_eval.pl that comes with the package.

Address of the bookmark: https://github.com/lh3/wgsim

SpeedSeq

Jit — Fri, 20 Jan 2017 06:05:43 -0600

A flexible framework for rapid genome analysis and interpretation

C Chiang, R M Layer, G G Faust, M R Lindberg, D B Rose, E P Garrison, G T Marth, A R Quinlan, and I M Hall. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Meth (2015). doi:10.1038/nmeth.3505.

http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.3505.html

Address of the bookmark: https://github.com/hall-lab/speedseq

LoRDEC: a hybrid error correction program for long, PacBio reads

Jit — Mon, 10 Apr 2017 04:16:09 -0500

LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.

Usually, errors in PacBio reads include many insertions and deletions, and comparatively less substitutions. LoRDEC can correct errors of all these types.
After correction, a larger portion of the sequence of PacBio reads is usable for detection of region of similarity with other sequences, for aligning them to the contigs of an assembly, etc.

Why is LoRDEC different?

It is efficient and can process large read data sets, included from eukaryotic or vertebrate species, on a usual computing server, and even works on desktop/laptop computers.
It adopts a novel graph based approach: it builds a succinct De Bruijn Graph (DBG) representing the short reads, and seeks a corrective sequence for each erroneous region of a long read by traversing chosen paths in the graph.

Address of the bookmark: http://www.atgc-montpellier.fr/lordec/

Jabba: Hybrid Error Correction for Long Sequencing Reads

Jit — Fri, 05 Jan 2018 03:58:14 -0600

Jabba is a hybrid error correction tool to correct third generation (PacBio / ONT) sequencing data, using second generation (Illumina) data.

Input

Jabba takes as input a concatenated de Bruijn graph and a set of sequences:

the de Bruijn graph should appear in fasta format with 1 entry per node, the meta information should be in the format:
>NODE
the set of sequences should be in fasta or fastq format. These sequences will be corrected (e.g. PacBio reads). The corrections will be written to a file Jabba fasta.
The output is a file in fasta format with corrections of the long reads, and additionally a file in the input format containing uncorrected reads.

https://github.com/biointec/jabba/wiki

https://almob.biomedcentral.com/articles/10.1186/s13015-016-0075-7

Address of the bookmark: https://github.com/biointec/jabba

NxRepair: error correction in de novo assemblies using Nextera Mate Pair Reads

BioStar — Thu, 24 Jan 2019 10:35:12 -0600

NxRepair is a python module that automatically detects large structural errors in de novo assemblies using Nextera mate pair reads. The decector will break a contig at the site of an identified misassembly and will generate a new fasta file containing both the corrected contigs and the correct, unaffected contigs.

https://nxrepair.readthedocs.io/en/latest/tutorial.html

nxrepair aligned_matepairs.bam assemblyfasta.fasta error_locations.csv new_fasta.fasta

Address of the bookmark: https://github.com/rebeccaroisin/nxrepair

Understanding HiFi Reads !

Rahul Nayak — Thu, 24 Mar 2022 19:48:11 -0500

While little public data is available for either of the new synthetic long read approaches, Illumina showed an example comparison earlier this year at the Festival of Genomics & Biodata conference (FoG 2022). In the IGV screenshot presented (below), synthetic Infinity reads – labeled “Longas” – are at the top, followed by standard Illumina short reads, and PacBio HiFi reads labeled “CCS” depicted at the bottom:

Address of the bookmark: http://pacb.com/blog/the-hifi-difference-true-long-reads-vs-synthetic-long-reads/

Trust But Verify: Sequencing Your Cell Lines Might Reveal an Uninvited Guest

LEGE — Wed, 04 Jun 2025 00:07:57 -0500

High-throughput sequencing has become indispensable in cell biology, enabling detailed insights into chromatin structure, gene expression, and regulatory dynamics. Yet, when faced with unexpectedly low mapping rates to the human genome, researchers often rush to troubleshoot technical parameters—sequencer quality, adapter trimming, or aligner settings.

Before you go down that path, consider this critical biological question:
Are you sequencing human cells—or bacterial contamination?

The Silent Saboteur: Mycoplasma in Cell Cultures

Mycoplasma contamination remains one of the most widespread and underdiagnosed issues in tissue culture work. Studies suggest that 15–35% of cell lines in use may be contaminated, often without visible signs. Unlike other microbial infections, Mycoplasma does not produce cloudiness, odor, or a change in pH. Many researchers won’t detect it unless they specifically test for it.

The consequences, however, are profound. Mycoplasma can significantly alter:

Host gene expression patterns
Cell proliferation rates
Epigenetic profiles and chromatin accessibility
Cytokine signaling and immune responses

In short, it can skew your results, compromise your biological conclusions, and invalidate weeks or months of research.

A Simple Diagnostic Step: Map Against Mycoplasma Genomes

If you encounter poor alignment rates to the human genome, consider mapping your reads to a Mycoplasma reference genome—or better yet, use a combined human + Mycoplasma reference. There have been cases where over half of all reads, initially assumed to be from human cells, were in fact bacterial in origin. This check is fast, easy, and could save your project.

How Contamination Happens—and Persists

Mycoplasma is small (0.1–0.3 μm), lacks a cell wall, and can pass through standard filters undetected. Common sources include:

Contaminated reagents (e.g., FBS)
Infected cell lines obtained from other labs
Poor aseptic technique or shared equipment

Once present, it spreads quickly between cultures and can persist for months, silently affecting results.

Why Treatment Is Difficult

While antibiotics such as Plasmocin or BM-Cyclin are sometimes used, they often offer only partial resolution and may themselves alter cell behavior. In many cases, the best course of action is to discard the contaminated culture and start with a fresh, verified stock.

Practical Recommendations for Researchers

Routinely test for Mycoplasma using PCR, qPCR, or fluorescence-based assays
Incorporate contamination screens into your sequencing QC pipeline
Use combined reference genomes when mapping ambiguous reads
Practice strict aseptic technique and monitor all incoming cell lines
Don’t ignore unexplained data anomalies—they might point to contamination

Closing Thought: Contamination Is a Biological Variable

It’s easy to view poor mapping as a technical issue, but sometimes the problem lies deeper—in the biology itself. Mycoplasma contamination doesn’t just interfere with sequencing; it interferes with science. As a research community, we must treat contamination not as an afterthought, but as a key variable to control.

So next time your reads won’t align, don’t just tune the aligner. Ask if your cells are telling the truth—or if they're hiding something.

NanoSim: nanopore sequence read simulator based on statistical characterization.

Jit — Mon, 18 Dec 2017 04:16:31 -0600

NanoSim, a fast and scalable read simulator that captures the technology-specific features of ONT data and allows for adjustments upon improvement of nanopore sequencing technology. The first step of NanoSim is read characterization, which provides a comprehensive alignment-based analysis and generates a set of read profiles serving as the input to the next step, the simulation stage. The simulation stage uses the model built in the previous step to produce in silico reads for a given reference genome. NanoSim is written in Python and R. The source files and manual are available at the Genome Sciences Centre website: http://www.bcgsc.ca/platform/bioinfo/software/nanosim

https://github.com/bcgsc/NanoSim

Address of the bookmark: http://www.bcgsc.ca/platform/bioinfo/software/nanosim