BOL: Related items

Merqury: reference-free quality and phasing assessment for genome assemblies

Jit — Sat, 06 Jun 2020 05:38:34 -0500

Often, genome assembly projects have illumina whole genome sequencing reads available for the assembled individual. The k-mer spectrum of this read set can be used for independently evaluating assembly quality without the need of a high quality reference. Merqury provides a set of tools for this purpose.

https://github.com/marbl/meryl

Address of the bookmark: https://github.com/marbl/merqury

AccessSyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies

Jit — Sat, 01 Feb 2020 13:38:49 -0600

AccessSyRI: finding genomic rearrangements andlocal sequence differences from whole-genome assemblies

SyRI, a pairwise whole-genome comparison tool for chromosome-level assemblies. SyRI starts by finding rearranged regions and then searches for differences in the sequences, which are distinguished for residing in syntenic or rearranged regions. This distinction is important as rearranged regions are inherited differently compared to syntenic regions.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1911-0

Address of the bookmark: https://github.com/schneebergerlab/syri

RagTag: a collection of software tools for scaffolding and improving modern genome assemblies

Jit — Sat, 11 Sep 2021 00:28:14 -0500

RagTag is a collection of software tools for scaffolding and improving modern genome assemblies. Tasks include:

Homology-based misassembly correction
Homology-based assembly scaffolding and patching
Scaffold merging

Address of the bookmark: https://github.com/malonge/RagTag

Run miniasm assembler on nanopore reads !

Jit — Mon, 18 Dec 2017 04:07:50 -0600

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

Find the detail of the reads repeats:

fq2fa ONT_A.fastq ONT_A.fasta

minimap2 -xava-ont ONT_A.fasta ONT_A.fasta -t10 -X > AONT.paf

awk '{if($1==$6){print}}' AONT.paf > AONTself.paf

awk '$5=="-"' AONTself.paf | awk '{print $1}'| sort|uniq > invertedrepeat.list

Generated a few palindrome and repeats plots (highlighting only repeats largest than 10, 20 and 30 kb)

minidot -f 5 -m 30000 AONTself.paf > AONTself30000.eps
sed 's/_template_pass_FAH31515//' AONTself30000.eps > AONTself30000final.eps

minidot -f 5 -m 20000 AONTself.paf > AONTself20000.eps
sed 's/_template_pass_FAH31515//' AONTself20000.eps > AONTself20000final.eps

minidot -f 5 -m 10000 AONTself.paf > AONTself10000.eps
sed 's/_template_pass_FAH31515//' AONTself10000.eps > AONTself10000final.eps

Assemble with miniasm:

miniasm -f ONT_A.fasta AONT.paf > AONT.gfa
grep '^S' AONT.gfa |awk '{print ">"$2"\n"$3}' > AONT_miniasm.fasta

minimap2 -xasm10 AONT_miniasm.fasta AONT_miniasm.fasta -t1 -X > AONT_miniasm.paf

awk '{if($1==$6){print}}' AONT_miniasm.paf > AONT_miniasm_self.paf

minidot -f 5 -m 10000 AONT_miniasm_self.paf > AONT_miniasm_self10000.eps

Njoy the assembly !

Scalpel

Shruti Paniwala — Wed, 20 Aug 2014 02:07:58 -0500

A team from Cold Spring Harbor Laboratory has released an algorithm, called Scalpel, for finding insertions and deletions in next generation sequencing data sets. Scalpel, which is open source and available for download on SourceForge, outperformed the popular tools GATK HaplotypeCaller and SOAPindel in test runs on both simulated and real whole human exomes.

Like other indel callers, Scalpel works by performing de novo assembly of regions of interest, so that misalignment to the reference genome cannot obscure the presence of an insertion or deletion. Scalpel's innovation is to repeatedly check its assembly before comparing to the reference genome, to account for simple sequence repeats that are a regular source of error in indel calling. When Scalpel assembles an exon, it collects reads that map to that exon (including partial matches), splits them into k-mers, and creates a de Bruijn graph to span the exon; however, if it detects repeats in the map, it iteratively increases the size of the k-mers by one base until the repeats are eliminated. This ensures that the final assembly of the exon is highly accurate while minimizing compute time.

The Cold Spring Harbor team's validation of Scalpel, published over the weekend in Nature Methods, compares Scalpel's performance on a live whole exome against HaplotypeCaller and SOAPindel. The donor is an individual with serious neurological disorders, which may be linked to a high incidence of indels. One thousand indels from this individual's exome, called by one or more of the informatics pipelines, were selected for focused resequencing. This resequencing revealed a 77% true positive rate for Scalpel calls, dramatically better than the rates for either of the competing tools; Scalpel performed especially well with indels longer than five base pairs, a traditional weak point for indel callers.

Finally, the authors demonstrate Scalpel's use on a large set of genetic data from nearly 600 families who donated samples to the Simons Simplex Collection, a project of the Simons Foundation Autism Research Initiative. Scalpel found a very high enrichment for indels in children affected by autism, compared with their unaffected siblings, a pattern that persisted even after excluding common variants.

Picard

Neel — Fri, 29 Apr 2016 08:21:54 -0500

Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. These file formats are defined in the Hts-specs repository. See especially the SAM specification and the VCF specification.

Note that the information on this page is targeted at end-users. For developers, the source code, building instructions and implementation/development resources are available on GitHub.

The Picard toolkit is open-source under the MIT license and free for all uses.

Enjoy!

Address of the bookmark: http://broadinstitute.github.io/picard/

Andi

Jit — Fri, 13 May 2016 05:16:35 -0500

This is the andi program for estimating the evolutionary distance between closely related genomes. These distances can be used to rapidly infer phylogenies for big sets of genomes. Because andi does not compute full alignments, it is so efficient that it scales even up to thousands of bacterial genomes.

This readme covers all necessary instructions for the impatient to get andi up and running. For extensive instructions please consult the manual.

More at https://github.com/evolbioinf/andi/

Address of the bookmark: http://bioinformatics.oxfordjournals.org/content/early/2015/01/13/bioinformatics.btu815.full

Bpipe - a tool for running and managing bioinformatics pipelines

Radha Agarkar — Sat, 21 May 2016 22:42:16 -0500

Bpipe provides a platform for running big bioinformatics jobs that consist of a series of processing stages - known as 'pipelines'.

January 20th, 2016 - New! Bpipe 0.9.9 released!
Download latest, all
Documentation
Mailing List (Google Group)

Bpipe has been published in Bioinformatics! If you use Bpipe, please cite:

Sadedin S, Pope B & Oshlack A, Bpipe: A Tool for Running and Managing Bioinformatics Pipelines, Bioinformatics

Address of the bookmark: http://docs.bpipe.org/

CovCal: Coverage / Read Count Calculator

Jit — Wed, 15 Jun 2016 18:08:13 -0500

Coverage / Read Count Calculator

Calculate how much sequencing you need to hit a target depth of coverage (or vice versa).

Instructions: set the read length/configuration and genome size, then select what you want to calculate.

Written by Stephen Turner, based on the Lander-Waterman formula, inspired by a similar calculator written by James Hadfield. Coverage is calculated as C=LN/G and reads as N=CG/L where C = Coverage (X),L = Read length (bp), G = Haploid genome size (bp), and N = Number of reads. Source code on GitHub.

Address of the bookmark: http://apps.bioconnector.virginia.edu/covcalc/

HivePlot

Jit — Thu, 16 Feb 2017 11:39:34 -0600

The hive plot is a rational visualization method for drawing networks. Nodes are mapped to and positioned on radially distributed linear axes — this mapping is based on network structural properties. Edges are drawn as curved links. Simple and interpretable.

The purpose of the hive plot is to establish a new baseline for visualization of large networks — a method that is both general and tunable and useful as a starting point in visually exploring network structure.

More at http://www.hiveplot.com/

Address of the bookmark: http://www.hiveplot.com/