BOL: Related items

Monkeypox virus isolate MPXV_USA_2022_MA001, complete genome

Jit — Tue, 26 Jul 2022 06:21:07 -0500

LOCUS       ON563414              197205 bp    DNA     linear   VRL 30-MAY-2022
DEFINITION  Monkeypox virus isolate MPXV_USA_2022_MA001, complete genome.
ACCESSION   ON563414
VERSION     ON563414.3
KEYWORDS    .
SOURCE      Monkeypox virus (monkeypox)
  ORGANISM  Monkeypox virus
            Viruses; Varidnaviria; Bamfordvirae; Nucleocytoviricota;
            Pokkesviricetes; Chitovirales; Poxviridae; Chordopoxvirinae;
            Orthopoxvirus.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/nuccore/ON563414

miniasm: very fast OLC-based de novo assembler for noisy long reads

Jit — Mon, 27 Nov 2017 07:58:49 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Minimap and miniasm are ultrafast tools for (i) mapping and (ii) assembly. Designed for long, noisy reads, they do not have a correction or consensus step, and therefore the resulting assemblies are contiguous (i.e. long) but very noisy (i.e. full of errors)

We start with an all against all comparison:

minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Then we can assemble

miniasm -f reads.fq reads.paf.gz > reads.gfa

Convert GFA to FASTA:

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

And then count how many contigs:

grep ">" reads.fa | wc -l

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Address of the bookmark: https://github.com/lh3/miniasm

Run miniasm assembler on nanopore reads !

Jit — Mon, 18 Dec 2017 04:07:50 -0600

Find the detail of the reads repeats:

fq2fa ONT_A.fastq ONT_A.fasta

minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf

awk '{if($1==$6){print}}' AONT.paf > AONTself.paf

awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list

Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)

minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps

minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps

minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps

Assemble with miniasm:

miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta

minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf

awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf

minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps

Njoy the assembly !

Flye: Fast and accurate de novo assembler for single molecule sequencing reads

Jit — Fri, 04 May 2018 19:16:22 -0500

Flye is a de novo assembler for long and noisy reads, such as those produced by PacBio and Oxford Nanopore Technologies. The algorithm uses an A-Bruijn graph to find the overlaps between reads and does not require them to be error-corrected. After the initial assembly, Flye performs an extra repeat classification and analysis step to improve the structural accuracy of the resulting sequence. The package also includes a polisher module, which produces the final assembly of high nucleotide-level quality.

Address of the bookmark: https://github.com/fenderglass/Flye

Circlator: automated circularization of genome assemblies using long sequencing reads

Poonam Mahapatra — Tue, 15 May 2018 09:42:32 -0500

A tool to circularize genome assemblies. The algorithm and benchmarks are described in the Genome Biology manuscript. Citation: "Circlator: automated circularization of genome assemblies using long sequencing reads", Hunt et al, Genome Biology 2015 Dec 29;16(1):294. doi: 10.1186/s13059-015-0849-0. PMID: 26714481.

Address of the bookmark: http://sanger-pathogens.github.io/circlator/

pbalign: maps PacBio reads to reference sequences and saves alignments to a BAM file

Jit — Thu, 24 May 2018 10:06:52 -0500

pbalign aligns PacBio reads to reference sequences, filters aligned reads according to user-specific filtering criteria, and converts the output to either the SAM format or PacBio Compare HDF5 (e.g., .cmp.h5) format. The output Compare HDF5 file will be compatible with Quiver if --forQuiver option is specified.

Address of the bookmark: https://github.com/PacificBiosciences/pbalign

npScarf: real-time scaffolder using SPAdes contigs and Nanopore sequencing reads

Shruti Paniwala — Mon, 11 Jun 2018 05:14:57 -0500

npScarf (jsa.np.npscarf) is a program that connect contigs from a draft genomes to generate sequences that are closer to finish. These pipelines can run on a single laptop for microbial datasets. In real-time mode, it can be integrated with simple structural analyses such as gene ordering, plasmid forming.

Address of the bookmark: http://japsa.readthedocs.io/en/latest/tools/jsa.np.npscarf.html

AlignQC: A tool for assessing an alignment, and generating reports that are easy to share

Jit — Tue, 07 Aug 2018 04:41:07 -0500

Long read alignment analysis. Generate a reports on sequence alignments for mappability vs read sizes, error patterns, annotations and rarefraction curve analysis. The most basic analysis only requires a BAM file, and outputs a web browser compatible xhtml to visualize/share/store/extract analysis results.

https://f1000research.com/articles/6-100/

https://github.com/jason-weirather/AlignQC

Address of the bookmark: https://www.healthcare.uiowa.edu/labs/au/AlignQC/

Rebaler: program for conducting reference-based assemblies using long reads.

Jit — Tue, 18 Sep 2018 07:52:41 -0500

Rebaler is a program for conducting reference-based assemblies using long reads. It relies mainly on minimap2 for alignment and Racon for making consensus sequences.

I made Rebaler for bacterial genomes (specifically for the task of testing basecallers). It should in principle work for non-bacterial genomes as well, but I haven't tested it.

Address of the bookmark: https://github.com/rrwick/Rebaler

Pacasus: Correction of palindromes in long reads from PacBio and Nanopore

BioStar — Mon, 12 Nov 2018 05:26:48 -0600

Tool for detecting and cleaning PacBio / Nanopore long reads after whole genome amplification. Check the poster from the Revolutionizing Next-Generation Sequencing (2nd edition) conference in the source folder: https://github.com/swarris/Pacasus/blob/master/vib2017.pdf.

The prepint version is found on http://www.biorxiv.org/content/early/2017/08/09/173872

It uses the pyPaSWAS framework for sequence alignment (https://github.com/swarris/pyPaSWAS)

Address of the bookmark: https://github.com/swarris/Pacasus