Choose the appropriate parameters to run Canu and run it. The assembly will take about an hour. You can use two cores (parameter -maxThreads=2
) and you would like to disable cluster option, since we compute on a single Amazon server set off the option to compute on cluster useGrid=false
. This specifications should be for your project discussed with a local computing guru. The parameters that are in square brackets []
are optional, symbol |
stands for "or".
usage: canu [-correct | -trim | -assemble | -trim-assemble] \
[-s ] \
-p \
-d \
genomeSize=[g|m|k] \
-maxThreads=2 \
useGrid=false \
[other-options] \
read_file.fastq.gz
A default Canu
run produces usually high quality assembly, example of a command that was used for testing can be found below. However, there are still a lot of parameters that are possible to tweak. For example if we desire to assemble haplotypes separately of if we want to smash them together, we can alternate the error correction process.
canu -p test_asmbl \
-d asm_test3 \
genomeSize=2m \
-maxThreads=2 useGrid=false \
-pacbio-raw \ ~/pacbio/dna/sample_reads.fastq.gz
There is a brilliant section in documentation about parameter tweaking.
The output directory contains will contain many files. The most interesting ones are:
*.correctedReads.fasta.gz
: file containing the input sequences after correction, trim and split based on consensus evidence.*.trimmedReads.fastq
: file containing the sequences after correction and final trimming*.layout
: file containing informations about read inclusion in the final assembly*.gfa
: file containing the assembly graph by Canu*.contigs.fasta
: file containing everything that could be assembled and is part of the primary assemblyThe basic stats of assembly can be read from reports generated by the assembler, or calculated using standard UNIX command line tools.