Our Sponsors



Download BioinformaticsOnline(BOL) Apps in your chrome browser.




SPAdes hybrid genome assembly

When you have both Illumina and Nanopore data, then SPAdes remains a good option for hybrid assembly - SPAdes was used to produce the B fragilis assembly by Mick Watson’s group.

Again, running spades.py will show you the options:

spades.py

This produces:

SPAdes genome assembler v3.10.1

Usage: /usr/local/SPAdes-3.10.1-Linux/bin/spades.py [options] -o <output_dir>

Basic options:
-o      <output_dir>    directory to store all the resulting files (required)
--sc                    this flag is required for MDA (single-cell) data
--meta                  this flag is required for metagenomic sample data
--rna                   this flag is required for RNA-Seq data
--plasmid               runs plasmidSPAdes pipeline for plasmid detection
--iontorrent            this flag is required for IonTorrent data
--test                  runs SPAdes on toy dataset
-h/--help               prints this usage message
-v/--version            prints version

Input data:
--12    <filename>      file with interlaced forward and reverse paired-end reads
-1      <filename>      file with forward paired-end reads
-2      <filename>      file with reverse paired-end reads
-s      <filename>      file with unpaired reads
--pe<#>-12      <filename>      file with interlaced reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-1       <filename>      file with forward reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-2       <filename>      file with reverse reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-s       <filename>      file with unpaired reads for paired-end library number <#> (<#> = 1,2,..,9)
--pe<#>-<or>    orientation of reads for paired-end library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--s<#>          <filename>      file with unpaired reads for single reads library number <#> (<#> = 1,2,..,9)
--mp<#>-12      <filename>      file with interlaced reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-1       <filename>      file with forward reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-2       <filename>      file with reverse reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-s       <filename>      file with unpaired reads for mate-pair library number <#> (<#> = 1,2,..,9)
--mp<#>-<or>    orientation of reads for mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--hqmp<#>-12    <filename>      file with interlaced reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-1     <filename>      file with forward reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-2     <filename>      file with reverse reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-s     <filename>      file with unpaired reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9)
--hqmp<#>-<or>  orientation of reads for high-quality mate-pair library number <#> (<#> = 1,2,..,9; <or> = fr, rf, ff)
--nxmate<#>-1   <filename>      file with forward reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--nxmate<#>-2   <filename>      file with reverse reads for Lucigen NxMate library number <#> (<#> = 1,2,..,9)
--sanger        <filename>      file with Sanger reads
--pacbio        <filename>      file with PacBio reads
--nanopore      <filename>      file with Nanopore reads
--tslr  <filename>      file with TSLR-contigs
--trusted-contigs       <filename>      file with trusted contigs
--untrusted-contigs     <filename>      file with untrusted contigs

Pipeline options:
--only-error-correction runs only read error correction (without assembling)
--only-assembler        runs only assembling (without read error correction)
--careful               tries to reduce number of mismatches and short indels
--continue              continue run from the last available check-point
--restart-from  <cp>    restart run with updated options and from the specified check-point ('ec', 'as', 'k<int>', 'mc')
--disable-gzip-output   forces error correction not to compress the corrected reads
--disable-rr            disables repeat resolution stage of assembling

Advanced options:
--dataset       <filename>      file with dataset description in YAML format
-t/--threads    <int>           number of threads
                                [default: 16]
-m/--memory     <int>           RAM limit for SPAdes in Gb (terminates if exceeded)
                                [default: 250]
--tmp-dir       <dirname>       directory for temporary files
                                [default: <output_dir>/tmp]
-k              <int,int,...>   comma-separated list of k-mer sizes (must be odd and
                                less than 128) [default: 'auto']
--cov-cutoff    <float>         coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset  <33 or 64>      PHRED quality offset in the input reads (33 or 64)
                                [default: auto-detect]

As you can see this is also a “pipeline” of tools that can be switched on or off. SPAdes takes quite a long time, so for the purposes of this practical, something like this may suffice:

spades.py -t 4 \
          -m 32 \
          -k 31,51,71 \
          --only-assembler \
          -1 miseq.1.fastq -2 miseq.2.fastq \
          --nanopore minion.fastq \
          -o hybrid_assembly

In turn, these parameters mean

  • use 4 threads
  • max memory is 32Gb
  • use 3 kmer values to build the de bruijn graph(s) - 31, 51 and 71
  • only run the assembler, not the correction algorithm (for speed)
  • read 1 and read 2 of the MiSeq data
  • the nanopore data
  • put the output in folder “hybrid_assembly”

Comments

  • Rahul Nayak 2405 days ago

     use SPAdes to assemble the data.  SPAdes is a swiss-army knife of genome assembly tools, and by default includes read correction.  This takes up lots of RAM, so we are going to skip it.  We will also only use 3 kmers to save time:

    
    ./SPAdes-3.6.2-Linux/bin/spades.py --only-assembler 
    				   -t 4 -k 21,51,71 
    				   -1 SRR2627175_1.fastq.gz 
    				   -2 SRR2627175_2.fastq.gz 
    				   --nanopore minion.pass.2D.fastq 
    				   -o SPAdes_hybrid & 
    

    Use samtools to extract the top contig:

    
    head -n 1 SPAdes_hybrid/contigs.fasta
    samtools faidx SPAdes_hybrid/contigs.fasta
    samtools faidx SPAdes_hybrid/contigs.fasta NODE_1_length_4620446_cov_135.169_ID_22238 > single_contig.fa
    

    Finally, a quick comparison to the reference:

    
    sudo apt-get install mummer 
    curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_000913.3&rettype=fasta&retmode=txt" > NC_000913.3.fa
    nucmer NC_000913.3.fa single_contig.fa
    mummerplot -png out.delta
    display out.png &