Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.
Find the detail of the reads repeats:
fq2fa ONT_A.fastq ONT_A.fasta
minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf
awk '{if($1==$6){print}}' AONT.paf > AONTself.paf
awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list
Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)
minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps
minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps
minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps
Assemble with miniasm:
miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta
minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf
awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf
minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps
Njoy the assembly !
Comments
urbe@urbo214b[allONT] minimap2 -xava-ont both.fastq both.fastq -t20 -X > AONT.paf []
[M::mm_idx_gen::107.693*1.96] collected minimizers
[M::mm_idx_gen::117.963*3.34] sorted minimizers
[M::main::117.963*3.34] loaded/built the index for 702657 target sequence(s)
[M::mm_mapopt_update::121.629*3.27] mid_occ = 525
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 702657
[M::mm_idx_stat::123.917*3.22] distinct minimizers: 182052563 (37.70% are singletons); average occurrences: 7.475; average spacing: 2.940
[M::worker_pipeline::208.871*9.76] mapped 120736 sequences
[M::worker_pipeline::288.250*12.60] mapped 101108 sequences
[M::worker_pipeline::370.153*14.25] mapped 87860 sequences
[M::worker_pipeline::446.515*15.24] mapped 83415 sequences
[M::worker_pipeline::520.833*15.92] mapped 82980 sequences
[M::worker_pipeline::600.252*16.47] mapped 85392 sequences
[M::worker_pipeline::674.217*16.86] mapped 85983 sequences
[M::worker_pipeline::748.729*17.18] mapped 55098 sequences
[M::worker_pipeline::822.918*17.44] mapped 54844 sequences
[M::worker_pipeline::899.174*17.66] mapped 55252 sequences
[M::worker_pipeline::966.032*17.83] mapped 54720 sequences
[M::worker_pipeline::1043.026*17.99] mapped 62540 sequences
[M::worker_pipeline::1120.649*18.13] mapped 58892 sequences
[M::worker_pipeline::1197.893*18.26] mapped 56897 sequences
[M::worker_pipeline::1276.147*18.37] mapped 55813 sequences
[M::worker_pipeline::1353.298*18.46] mapped 56524 sequences
[M::worker_pipeline::1425.963*18.54] mapped 55836 sequences
[M::worker_pipeline::1488.410*18.61] mapped 55247 sequences
[M::worker_pipeline::1527.475*18.63] mapped 45017 sequences
[M::mm_idx_gen::1641.600*17.47] collected minimizers
[M::mm_idx_gen::1649.738*17.48] sorted minimizers
[M::main::1649.738*17.48] loaded/built the index for 455528 target sequence(s)
[M::mm_mapopt_update::1649.738*17.48] mid_occ = 525
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 455528
[M::mm_idx_stat::1651.772*17.46] distinct minimizers: 186643623 (37.28% are singletons); average occurrences: 7.298; average spacing: 2.937
[M::worker_pipeline::1735.440*17.56] mapped 120736 sequences
[M::worker_pipeline::1812.396*17.66] mapped 101108 sequences
[M::worker_pipeline::1891.734*17.76] mapped 87860 sequences
[M::worker_pipeline::1970.370*17.85] mapped 83415 sequences
[M::worker_pipeline::2046.242*17.93] mapped 82980 sequences
[M::worker_pipeline::2123.177*18.01] mapped 85392 sequences
[M::worker_pipeline::2195.681*18.07] mapped 85983 sequences
[M::worker_pipeline::2271.864*18.14] mapped 55098 sequences
[M::worker_pipeline::2348.983*18.20] mapped 54844 sequences
[M::worker_pipeline::2417.805*18.25] mapped 55252 sequences
[M::worker_pipeline::2480.751*18.30] mapped 54720 sequences
[M::worker_pipeline::2551.193*18.34] mapped 62540 sequences
[M::worker_pipeline::2623.354*18.39] mapped 58892 sequences
[M::worker_pipeline::2701.793*18.44] mapped 56897 sequences
[M::worker_pipeline::2779.713*18.48] mapped 55813 sequences
[M::worker_pipeline::2857.669*18.53] mapped 56524 sequences
[M::worker_pipeline::2935.004*18.57] mapped 55836 sequences
[M::worker_pipeline::3001.654*18.60] mapped 55247 sequences
[M::worker_pipeline::3044.627*18.61] mapped 45017 sequences
[M::mm_idx_gen::3084.077*18.40] collected minimizers
[M::mm_idx_gen::3086.956*18.40] sorted minimizers
[M::main::3086.956*18.40] loaded/built the index for 155969 target sequence(s)
[M::mm_mapopt_update::3086.956*18.40] mid_occ = 525
[M::mm_idx_stat] kmer size: 15; skip: 5; is_HPC: 0; #seq: 155969
[M::mm_idx_stat::3089.212*18.39] distinct minimizers: 129075783 (48.61% are singletons); average occurrences: 3.628; average spacing: 2.938
[M::worker_pipeline::3119.876*18.39] mapped 120736 sequences
[M::worker_pipeline::3148.589*18.40] mapped 101108 sequences
[M::worker_pipeline::3178.007*18.42] mapped 87860 sequences
[M::worker_pipeline::3207.291*18.43] mapped 83415 sequences
[M::worker_pipeline::3235.240*18.45] mapped 82980 sequences
[M::worker_pipeline::3263.770*18.46] mapped 85392 sequences
[M::worker_pipeline::3290.201*18.47] mapped 85983 sequences
[M::worker_pipeline::3318.045*18.48] mapped 55098 sequences
[M::worker_pipeline::3347.963*18.49] mapped 54844 sequences
[M::worker_pipeline::3375.439*18.50] mapped 55252 sequences
[M::worker_pipeline::3399.842*18.51] mapped 54720 sequences
[M::worker_pipeline::3427.431*18.52] mapped 62540 sequences
[M::worker_pipeline::3455.095*18.53] mapped 58892 sequences
[M::worker_pipeline::3483.205*18.54] mapped 56897 sequences
[M::worker_pipeline::3513.617*18.54] mapped 55813 sequences
[M::worker_pipeline::3541.396*18.56] mapped 56524 sequences
[M::worker_pipeline::3569.479*18.57] mapped 55836 sequences
[M::worker_pipeline::3592.692*18.58] mapped 55247 sequences
[M::worker_pipeline::3607.738*18.58] mapped 45017 sequences
[M::main] Version: 2.3-r531
[M::main] CMD: minimap2 -xava-ont -t20 -X both.fastq both.fastq
[M::main] Real time: 3608.342 sec; CPU: 67037.124 sec
urbe@urbo214b[allONT] ~/Tools/miniasm/miniasm -f both.fastq AONT.paf > AONT.gfa []
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::107.578*1.00] read 138602741 hits; stored 132126233 hits and 760720 sequences (8675939030 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::130.468*1.00] 745887 query sequences remain after sub
[M::ma_hit_cut::134.429*1.00] 128171424 hits remain after cut
[M::ma_hit_flt::137.921*1.00] 99545705 hits remain after filtering; crude coverage after filtering: 91.41
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::144.152*1.00] 743866 query sequences remain after sub
[M::ma_hit_cut::146.263*1.00] 99187462 hits remain after cut
[M::ma_hit_contained::149.341*1.00] 18263 sequences and 142921 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 129305 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 83030 arcs
[M::asg_arc_del_multi] removed 1833 multi-arcs
[M::asg_arc_del_asymm] removed 994 asymmetric arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 2053 tips
[M::asg_pop_bubble] popped 254 bubbles and trimmed 20 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 1392 asymmetric arcs
[M::asg_arc_del_short] removed 4908 short overlaps
[M::asg_cut_tip] cut 549 tips
[M::asg_pop_bubble] popped 92 bubbles and trimmed 22 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 273 asymmetric arcs
[M::asg_arc_del_short] removed 325 short overlaps
[M::asg_cut_tip] cut 106 tips
[M::asg_pop_bubble] popped 15 bubbles and trimmed 6 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 191 asymmetric arcs
[M::asg_arc_del_short] removed 239 short overlaps
[M::asg_cut_tip] cut 54 tips
[M::asg_pop_bubble] popped 14 bubbles and trimmed 6 tips
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 28 internal sequences
[M::asg_cut_biloop] cut 45 small bi-loops
[M::asg_cut_tip] cut 1 tips
[M::asg_pop_bubble] popped 3 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 139 asymmetric arcs
[M::asg_arc_del_short] removed 147 short overlaps
[M::asg_cut_tip] cut 24 tips
[M::asg_pop_bubble] popped 1 bubbles and trimmed 0 tips
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: 0.2-r168-dirty
[M::main] CMD: /home/urbe/Tools/miniasm/miniasm -f both.fastq AONT.paf
[M::main] Real time: 168.768 sec; CPU: 168.772 sec
I followed this to assemble the genome using ONT (nanopore reads)
Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)
We start with an all against all comparison:
Then we can assemble
Convert GFA to FASTA:
And then count how many contigs:
Here’s the quick and dirty of what was done:
1 Run minimap:
This uses a pre-built set of defaults (the
ava-pb
in the code below) for analyzing PacBio data. Minimap only accepts two FASTQ files and you need to map your FASTQ file against itself. So, if you have multiple FASTQ sequencing files, you have to concatenate them into a single file prior to running minimap.2 Run miniasm:
This uses your concatenated FASTQ file and the PAF file output from the miniasm step. The code below is taken from the example provided in the miniasm documentation; there are other options available.
3 Convert miniasm output GFA to FASTA
The FASTA file is needed to re-run minimap in Step 4 below.
4 Run minimap with default settings
Using the default settings maps the FASTQ reads back to the contigs (the PAF file) created in the fist step. These mappings are required for Racon assembly (Step 5).
5 Run racon
The output file is the FASTA file listed below.
from Sam’s Notebook http://ift.tt/2fKBPUN
Must read paper https://genome.cshlp.org/content/27/5/737.full