BOL: Related items

Peregrine & SHIMMER Genome Assembly Toolkit

Abhi — Thu, 16 Dec 2021 02:50:19 -0600

Peregrine is a fast genome assembler for accurate long reads (length > 10kb, accuracy > 99%). It can assemble a human genome from 30x reads within 20 cpu hours from reads to polished consensus. It uses Sparse HIereachical MimiMizER (SHIMMER) for fast read-to-read overlaping without quadratic comparisions used in other OLC assemblers.

Address of the bookmark: https://github.com/cschin/Peregrine

HIV genome database !

Rahul Nayak — Fri, 21 Jan 2022 05:40:15 -0600

HIV resources

https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html

Address of the bookmark: https://www.hiv.lanl.gov/components/sequence/HIV/search/search.html

Human Complete Genome

Shruti Paniwala — Wed, 06 Jul 2022 06:42:55 -0500

Telomere-to-telomere consortium

We have sequenced the CHM13hTERT human cell line with a number of technologies. Human genomic DNA was extracted from the cultured cell line. As the DNA is native, modified bases will be preserved. The data includes 30x PacBio HiFi, 120x coverage of Oxford Nanopore, 70x PacBio CLR, 50x 10X Genomics, as well as BioNano DLS and Arima Genomics HiC. Most raw data is available from this site, with the exception of the PacBio data which was generated by the University of Washington/PacBio and is available from NCBI SRA.

A UCSC browser is available for v2.0 (as well as legacy v1.0 and v1.1 versions). An interactive dotplot visualization of all genomic repeats is also available from resgen.io. Known issues identified in the assembly are tracked at CHM13 issues.

MORE at https://github.com/marbl/CHM13

Address of the bookmark: https://www.science.org/doi/10.1126/science.abj6987

Tools to access the quality of your assembled genome !

LEGE — Thu, 08 Aug 2024 23:31:18 -0500

FASTA VALIDATOR + SEQKIT RMDUP: FASTA validation
GENOMETOOLS GT GFF3VALIDATOR: GFF3 validation
ASSEMBLATHON STATS: Assembly statistics
GENOMETOOLS GT STAT: Annotation statistics
NCBI FCS ADAPTOR: Adaptor contamination pass/fail
NCBI FCS GX: Foreign organism contamination pass/fail
BUSCO: Gene-space completeness estimation
TIDK: Telomere repeat identification
LAI: Continuity of repetitive sequences
KRAKEN2: Taxonomy classification
HIC CONTACT MAP: Alignment and visualisation of HiC data
MUMMER → CIRCOS + DOTPLOT & MINIMAP2 → PLOTSR: Synteny analysis
MERQURY: K-mer completeness, consensus quality and phasing assessment

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.

Genomic architecture surrounding the fusion site of human chromosome 2

LEGE — Tue, 04 Mar 2025 12:26:29 -0600

The article "Genomic Structure and Evolution of the Ancestral Chromosome Fusion Site in 2q13–2q14.1 and Paralogous Regions on Other Human Chromosomes (https://pmc.ncbi.nlm.nih.gov/articles/PMC187548/)" explores the genomic architecture surrounding the fusion site of human chromosome 2. This fusion event is a key evolutionary marker distinguishing humans from other great apes, as humans have 46 chromosomes while chimpanzees, gorillas, and orangutans possess 48. The fusion occurred through an end-to-end joining of two ancestral chromosomes, which remain separate in nonhuman primates.

Key Findings:

Chromosomal Fusion and Its Molecular Signature:
- The fusion site is located at 2q13–2q14.1 and is characterized by degenerate telomeric sequences appearing interstitially, indicating the historical head-to-head joining of ancestral chromosomes.
- Despite being a signature of a past fusion event, these telomeric repeats are no longer functional and have undergone sequence degradation over time.
Extensive Duplications in the Surrounding Genomic Region:
- The study identifies large-scale segmental duplications flanking the fusion site, with several of these regions duplicated and scattered across multiple chromosomes.
- These duplications are predominantly located in subtelomeric and pericentromeric regions, suggesting their role in genomic instability and chromosomal evolution.
Paralogous Regions and Their Evolutionary Relationships:
- A 168-kilobase (kb) segment near the fusion site has 98%–99% sequence identity with three regions on chromosome 9 (9pter, 9p11.2, and 9q13).
- Another 67-kb region distal to the fusion site shows a high degree of homology to sequences in chromosome 22qter.
- Additionally, a 100-kb segment exhibits 96% sequence identity with a region in chromosome 2q11.2.
Comparative Genomics and Evolutionary Implications:
- By comparing the duplicated sequences and their arrangement in primates, the researchers traced the order of duplication events leading to their present distribution.
- The presence of specific repetitive elements within these duplicated segments serves as evolutionary markers that help infer their historical rearrangements.
- Some of these duplicated regions are associated with chromosomal inversion breakpoints, potentially contributing to evolutionary changes in primates.
- Recurrent structural rearrangements in these regions have been linked to human chromosomal disorders.

Conclusions and Implications:

The findings provide valuable insights into the structural evolution of human chromosome 2, which played a crucial role in human speciation.
Understanding these segmental duplications and their evolutionary trajectories sheds light on genomic instability, which may contribute to human genetic diseases.
The study highlights how large-scale chromosomal rearrangements, such as fusion and duplication, have influenced the evolutionary divergence of humans from other primates.

This research advances our understanding of human genome evolution and offers a foundation for studying the effects of structural variants in genetic disorders.

Dot, an interactive viewer for genome-genome comparison

Jit — Sun, 14 Jan 2018 11:57:34 -0600

Dot, an interactive dot plot viewer that allows genome scientists to visualize genome-genome alignments in order to evaluate new assemblies and perform exploratory comparative genomics.

Dot supports the output of MUMmer’s nucmer aligner the most commonly used software method for aligning genome assemblies. A quick script called DotPrep.py converts the delta file to a more streamlined coordinates file with an index that enables Dot to read in more alignments in certain regions on demand.

Dot, an interactive viewer for genome-genome comparison

https://dnanexus.github.io/dot/

Address of the bookmark: https://github.com/dnanexus/dot

TEDxCopenhagen - Morten Sommer - What Bacteria Means for the Good Life

Wed, 13 Aug 2014 05:07:19 -0500

Scientist and entrepreneur Morten Sommer will talk about how bacteria and microbes form an integral part of the human body and play a significant role in controlling human health and well About TEDx, x = independently organized event: In the spirit of ideas worth spreading, TEDx is a program of local, self-organized events that bring people together to share a TED-like experience. At a TEDx event, TEDTalks video and live speakers combine to spark deep discussion and connection in a small group. These local, self-organized events are branded TEDx, where x = independently organized TED event. The TED Conference provides general guidance for the TEDx program, but individual TEDx events are self-organized.* (*Subject to certain rules and regulations)

SiLiX: implements an ultra-efficient algorithm for the clustering of homologous sequences

Jit — Wed, 12 Dec 2018 09:22:41 -0600

The software package SiLiX implements an ultra-efficient algorithm for the clustering of homologous sequences, based on single transitive links (single linkage) with alignment coverage constraints.

SiLiX adopts a graph-theoretical framework to interpret similarity pairs as edges of a network. A very efficient algorithm, based on the Disjoint Sets Data Structure, allows the computation of sequence families with low time and space requirements.

A parallel version of SiLiX, based on MPI, is also available in this package and has been proved to be scalable, so that its allows the study of very large datasets.

SiLiX is already included in the analysis pipeline for HOGENOM.

Address of the bookmark: http://lbbe.univ-lyon1.fr/SiLiX?lang=fr

Trust But Verify: Sequencing Your Cell Lines Might Reveal an Uninvited Guest

LEGE — Wed, 04 Jun 2025 00:07:57 -0500

High-throughput sequencing has become indispensable in cell biology, enabling detailed insights into chromatin structure, gene expression, and regulatory dynamics. Yet, when faced with unexpectedly low mapping rates to the human genome, researchers often rush to troubleshoot technical parameters—sequencer quality, adapter trimming, or aligner settings.

Before you go down that path, consider this critical biological question:
Are you sequencing human cells—or bacterial contamination?

The Silent Saboteur: Mycoplasma in Cell Cultures

Mycoplasma contamination remains one of the most widespread and underdiagnosed issues in tissue culture work. Studies suggest that 15–35% of cell lines in use may be contaminated, often without visible signs. Unlike other microbial infections, Mycoplasma does not produce cloudiness, odor, or a change in pH. Many researchers won’t detect it unless they specifically test for it.

The consequences, however, are profound. Mycoplasma can significantly alter:

Host gene expression patterns
Cell proliferation rates
Epigenetic profiles and chromatin accessibility
Cytokine signaling and immune responses

In short, it can skew your results, compromise your biological conclusions, and invalidate weeks or months of research.

A Simple Diagnostic Step: Map Against Mycoplasma Genomes

If you encounter poor alignment rates to the human genome, consider mapping your reads to a Mycoplasma reference genome—or better yet, use a combined human + Mycoplasma reference. There have been cases where over half of all reads, initially assumed to be from human cells, were in fact bacterial in origin. This check is fast, easy, and could save your project.

How Contamination Happens—and Persists

Mycoplasma is small (0.1–0.3 μm), lacks a cell wall, and can pass through standard filters undetected. Common sources include:

Contaminated reagents (e.g., FBS)
Infected cell lines obtained from other labs
Poor aseptic technique or shared equipment

Once present, it spreads quickly between cultures and can persist for months, silently affecting results.

Why Treatment Is Difficult

While antibiotics such as Plasmocin or BM-Cyclin are sometimes used, they often offer only partial resolution and may themselves alter cell behavior. In many cases, the best course of action is to discard the contaminated culture and start with a fresh, verified stock.

Practical Recommendations for Researchers

Routinely test for Mycoplasma using PCR, qPCR, or fluorescence-based assays
Incorporate contamination screens into your sequencing QC pipeline
Use combined reference genomes when mapping ambiguous reads
Practice strict aseptic technique and monitor all incoming cell lines
Don’t ignore unexplained data anomalies—they might point to contamination

Closing Thought: Contamination Is a Biological Variable

It’s easy to view poor mapping as a technical issue, but sometimes the problem lies deeper—in the biology itself. Mycoplasma contamination doesn’t just interfere with sequencing; it interferes with science. As a research community, we must treat contamination not as an afterthought, but as a key variable to control.

So next time your reads won’t align, don’t just tune the aligner. Ask if your cells are telling the truth—or if they're hiding something.