Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.
Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:
Before starting the assembly, ensure that your raw sequencing data is high quality.
Input Data
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:
fastqc reads.fastq multiqc .
Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:
trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36
Select an assembly strategy based on your data type:
Short-Read Assemblers:
Long-Read Assemblers:
Hybrid Assemblers:
SPAdes is an excellent choice for small genomes, such as bacteria.
spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output
The output includes assembled contigs (contigs.fasta
) and scaffolds (scaffolds.fasta
).
Canu is designed for high-error long reads from PacBio or Nanopore.
canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq
The output will be in canu_output/genome.contigs.fasta
.
Unicycler combines short and long reads for improved assemblies.
unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output
After assembly, evaluate its quality using the following tools:
QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:
quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:
busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:
Bandage load assembly_graph.gfa
Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).
racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.
prokka --outdir annotation_output --prefix genome contigs.fasta
Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.
Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.