BOL: Related items

mafTools

Radha Agarkar — Sat, 21 May 2016 22:40:21 -0500

Bioinformatics tools for dealing with Multiple Alignment Format (MAF) files.

Address of the bookmark: https://github.com/dentearl/mafTools

Mercator: Multiple Whole-Genome Orthology Map Construction

Jit — Tue, 19 May 2020 16:46:22 -0500

Whole-genome homology maps attempt to identify the evolutionary relationships between and within multiple genomes. The term "syntenic" is often used to describe regions of multiple genomes that are believed to have evolved from the same region in an ancestral genome. However, it has been pointed out that this use of the term is incorrect (Passarge et al. 1999) and thus we will use the terms "homologous", "orthologous", and "paralogous" instead. Ideally, given K genomes, we would like to identify all orthologous genomic regions as well as paralogous regions within each genome and hypothetical ancestral genome. Maps listing these relationships are extremely valuable to researchers performing comparative analyses of genomic sequence. Here we present our initial work in the form a program called Mercator that constructs orthology maps between multiple whole genomes.

Address of the bookmark: https://www.biostat.wisc.edu/~cdewey/mercator/

Cactus: a reference-free whole-genome multiple alignment program

Jit — Mon, 12 Aug 2019 07:52:33 -0500

Cactus is a reference-free whole-genome multiple alignment program. The principal algorithms are described here: https://doi.org/10.1101/gr.123356.111

Cactus uses substantial resources. For primate-sized genomes (3 gigabases each), you should expect Cactus to use approximately 120 CPU-days of compute per genome, with about 120 GB of RAM used at peak. The requirements scale roughly quadratically, so aligning two 1-megabase bacterial genomes takes only 1.5 CPU-hours and 14 GB RAM.

Address of the bookmark: https://github.com/ComparativeGenomicsToolkit/cactus

Caretta – A multiple protein structure alignment and feature extraction suite

Rahul Nayak — Fri, 18 Dec 2020 02:09:44 -0600

Caretta – a multiple protein structure alignment and feature extraction suite

Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning.

Address of the bookmark: http://www.bioinformatics.nl/caretta/

AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references

Manisha Mishra — Tue, 17 Apr 2018 16:21:20 -0500

AlignGraph is a software that extends and joins contigs or scaffolds by reassembling them with help provided by a reference genome of a closely related organism.

Using AlignGraph

AlignGraph --read1 reads_1.fa --read2 reads_2.fa --contig contigs.fa --genome genome.fa --distanceLow distanceLow --distanceHigh distancehigh --extendedContig extendedContigs.fa --remainingContig remainingContigs.fa [--kMer k --insertVariation insertVariation --coverage coverage --part p --fastMap --ratioCheck --iterativeMap --misassemblyRemoval --resume]

Address of the bookmark: https://github.com/baoe/AlignGraph

pyScaf

Bulbul — Mon, 19 Dec 2016 14:20:33 -0600

pyScaf orders contigs from genome assemblies utilising several types of information:

paired-end (PE) and/or mate-pair libraries (NGS-based mode)
long reads (NGS-based mode)
synteny to the genome of some related species (reference-based mode)

Scaffolding

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned globally (end-to-end) onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table).
Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Address of the bookmark: https://github.com/lpryszcz/pyScaf

getopts.pl file

Jit — Fri, 15 Jun 2018 04:43:03 -0500

SSPACE_longread complain for getopts.pl file.

To resolve this, download and have in SSPACED-Longreads folder.

Cheers :)

Opera: An optimal genome scaffolding program

Jit — Mon, 27 Nov 2017 10:18:20 -0600

Opera (Optimal Paired-End Read Assembler) is a sequence assembly program (http://en.wikipedia.org/wiki/Sequence_assembly ). It uses information from paired-end or long reads to optimally order and orient contigs assembled from shotgun-sequencing reads.

An updated version called OPERA-LG has been re-engineered with features for the assembly of large and complex genomes.

Song Gao, Denis Bertrand, Burton K. H. Chia and Niranjan Nagarajan. OPERA-LG: efficient and exact scaffolding of large, repeat-rich eukaryotic genomes with performance guarantees. Genome Biology, May 2016, doi: 10.1186/s13059-016-0951-y.

Song Gao, Wing-Kin Sung, Niranjan Nagarajan. Opera: reconstructing optimal genomic scaffolds with high-throughput paired-end sequences. Journal of Computational Biology, Sept. 2011, doi:10.1089/cmb.2011.0170.

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0951-y

Address of the bookmark: https://sourceforge.net/projects/operasf/

HapSolo: An optimization approach for removing secondary haplotigs during diploid genome assembly and scaffolding.

Jit — Mon, 26 Oct 2020 21:23:36 -0500

Despite marked recent improvements in long-read sequencing technology, the assembly of diploid genomes remains a difficult task. A major obstacle is distinguishing between alternative contigs that represent highly heterozygous regions. If primary and secondary contigs are not properly identified, the primary assembly will overrepresent both the size and complexity of the genome, which complicates downstream analysis such as scaffolding.

More at https://github.com/esolares/HapSolo

Address of the bookmark: https://github.com/esolares/HapSolo

dcGOR

Martin Jones — Sat, 08 Nov 2014 14:54:28 -0600

An R package for analysing ontologies and protein domain annotations has been published in PLoS Computational Biology (http://dx.doi.org/10.1371/journal.pcbi.1003929). The package is distributed as part of CRAN (http://cran.r-project.org/package=dcGOR), and also at GitHub for version control.

The dedicated website is available in http://supfam.org/dcGOR, from which several demos are also provided:

1. Analysing SCOP domains: http://supfam.org/dcGOR/demo-Fang.html

2. Analysing Pfam domains: http://supfam.org/dcGOR/demo-Basu.html

3. Analysing InterPro domains: http://supfam.org/dcGOR/demo-Customisation.html