Validating a genome assembly is crucial to ensure its accuracy and completeness. Several methods can be employed to assess the quality of a genome assembly. Here are some of the best methods commonly used in the field:
Benchmarking against a reference genome: If a reference genome is available for the organism in question, aligning the assembled genome to the reference can help identify potential errors and gaps in the assembly. Various alignment tools, such as Bowtie, BWA, or minimap2, can be used for this purpose.
Assessment of assembly statistics: There are several metrics used to evaluate the quality of an assembly, such as N50 (contig or scaffold size at which 50% of the assembly is contained), L50 (number of contigs or scaffolds needed to reach N50), and genome coverage. Tools like QUAST or AssemblyStats can provide such statistics.
Read mapping and visualization: Mapping the raw sequencing reads back to the assembly and visualizing the alignments using tools like IGV (Integrative Genomics Viewer) can help identify regions with low coverage, misassemblies, or potential structural variations.
K-mer analysis: K-mers are short subsequences of DNA used to analyze genome properties. Tools like k-merGenie or Jellyfish can analyze k-mer frequencies in the assembly to assess its completeness and correctness.
Conserved gene analysis: Comparing the predicted genes from the assembly with a set of known conserved genes (e.g., from related species) can help evaluate the accuracy of gene content in the assembly.
Transcriptome mapping: If RNA-seq data is available, aligning the RNA reads to the genome assembly can help verify the integrity of gene structures and identify potential misassemblies.
Synteny analysis: Synteny refers to the conserved arrangement of genes between related genomes. Comparing the gene order and orientation of the assembled genome with that of a reference or related genomes can help detect large-scale rearrangements or errors.
Orthology analysis: Assessing orthologous gene relationships between the assembled genome and known genomes can provide additional insights into the accuracy of gene content and evolutionary relationships.
Validation using long-read sequencing: Utilizing long-read sequencing technologies, such as PacBio or Oxford Nanopore, can help resolve complex genomic regions and reduce errors in the assembly.
PCR and Sanger sequencing validation: Performing PCR-based experiments targeting specific genomic regions and validating them through Sanger sequencing can provide independent confirmation of assembly accuracy.
It's important to note that no single method can guarantee a perfect genome assembly, and a combination of several approaches is often used to achieve the most accurate and reliable results. The validation process may also depend on the available resources and data for the organism being studied.
Following methods can help to access the assembly correctness:
Remove the contaminated reads, map the it. If many reads didn't aligned you probably miss some regions in your assembly.
Use a reference genome, and if there are more than 2/MY breakpoints, then you have likely rearrangements or high genome fragmentation
Try cross comparing the multiple assemblies with different parameters and algorithm and cross compare. Pairwise genome alignments (ie. nucmer or lastal) of your assemblies to check for large inconsistencies between them.
Some tools like REAPR and ALE, PILON can help as well.