BOL: REAPR: a universal tool for genome assembly evaluation

Bookmarks
Jitendra Prajapati
REAPR: a universal tool for genome assembly evaluation

REAPR: a universal tool for genome assembly evaluation

By Jitendra Prajapati 3621 days ago Comments (1)

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2013-14-5-r47

REAPR is a tool that evaluates the accuracy of a genome assembly using mapped paired end reads, without the use of a reference genome for comparison. It can be used in any stage of an assembly pipeline to automatically break incorrect scaffolds and flag other errors in an assembly for manual inspection. It reports mis-assemblies and other warnings, and produces a new broken assembly based on the error calls.

The software requires as input an assembly in FASTA format and paired reads mapped to the assembly in a BAM file. Mapping information such as the fragment coverage and insert size distribution is analysed to locate mis-assemblies. REAPR works best using mapped read pairs from a large insert library (at least 1000bp). Additionally, if a short insert Illumina library is also available, REAPR can combine this with the large insert library in order to score each base of the assembly.

http://www.sanger.ac.uk/science/tools/reapr

Comments

- Neel@neelam
Neel 2007 days ago
Reapr is a tool trying to find explicit errors in the assembly based on incongruently mapped reads. It is heavily based on too low span coverage, or reads mapping too far or too close to each other. The program will also break up contigs/scaffolds at spurious sites to form smaller (but hopefully correct) contigs. Reapr runs pretty slowly, sadly,
Reapr is a bit fuzzy with contig names, but luckily it’s given us a tool to check if things are ok before we proceed! The command reapr facheck <assembly.fasta> will tell you if everything’s ok! in this case, no output is good output, since the only output from the command is the potential problems with the contig names. If you run into any problems, run reapr facheck <assembly.fasta> <renamed_assembly.fasta>, and you will get an assembly file with renamed contigs.
Once the names are ok, we continue:
The first thing we reapr needs, is a list of all “perfect” reads. This is reads that have a perfect map to the reference. Reapr is finicky though, and can’t use libraries with different read lengths, so you’ll have to use assemblies based on the raw data for this. Run the command reapr perfectmap to get information on how to create a perfect mapping file, and create a perfect mapping called <assembler>_perfect. This should take about a minute.
The next tool we need is reapr smaltmap which creates a bam file of read-pair mappings. Do the same thing you did with perfectmap and create an output file called <assembler>_smalt.bam. This should take about twenty minutes.
Finally we can use the smalt mapping, and the perfect mapping to run the reapr pipeline. Run reapr pipeline to get help on how to run, and then run the pipeline. Store the results in reapr_<assembler>. This should take about ten minutes.
There are several checks you can do after running Reapr (detailed here) but for now we’ll stick to looking at the split output file, called 04.break.broken_assembly.fa. Use this file together with the original assembly to generate a quast report. How does the results look after reapr?

BOL

Jitendra Prajapati

Our Sponsors

REAPR: a universal tool for genome assembly evaluation

Comments