BOL: Related items

Tallymer: method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

Jit — Thu, 15 Feb 2018 10:21:02 -0600

Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set whole genome shotgun sequences from maize (B73) (total size 10⁹ bp).
Tallymer was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available.

A manual can be found here.

Address of the bookmark: https://www.zbh.uni-hamburg.de/forschung/arbeitsgruppe-genominformatik/software/tallymer.html

GenomeMapper: Simultaneous alignment of short reads against multiple genomes

Jit — Fri, 25 May 2018 09:29:44 -0500

GenomeMapper is a short read mapping tool designed for accurate read alignments. It quickly aligns millions of reads either with ungapped or gapped alignments. It can be used to align against multiple genomes simulanteously or against a single reference. If you are unsure which one is the appropriate GenomeMapper, you might want to use the latter https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2768987/

Address of the bookmark: http://1001genomes.org/software/genomemapper.html

CSBFinder: Discovery of colinear syntenic blocks across thousands of prokaryotic genomes

Jit — Wed, 24 Oct 2018 22:12:27 -0500

CSBFinder is a standalone Desktop java application with a graphical user interface, that can also be executed via command line.

CSBFinder implements a novel methodology for the discovery, ranking, and taxonomic distribution analysis of colinear syntenic blocks (CSBs) - groups of genes that are consistently located close to each other, in the same order, across a wide range of taxa. CSBFinder incorporates an efficient algorithm that identifies CSBs in large genomic datasets. The discovered CSBs are ranked according to a probabilistic score and clustered to families according to their gene content similarity.

Address of the bookmark: https://github.com/dinasv/CSBFinder

Kevler: Reference-free variant discovery in large eukaryotic genomes

Jit — Tue, 28 Jan 2020 03:21:53 -0600

Welcome to kevlar, software for predicting de novo genetic variants without mapping reads to a reference genome! kevlar's k-mer abundance based method calls single nucleotide variants (SNVs), multinucleotide variants (MNVs), insertion/deletion variants (indels), and structural variants (SVs) simultaneously with a single simple model.

More at https://kevlar.readthedocs.io/en/latest/

https://www.cell.com/iscience/pdf/S2589-0042(19)30259-7.pdf

Address of the bookmark: https://github.com/kevlar-dev/kevlar

Published a dataset of 363 genomes from approximately 92 percent of bird families

Jit — Thu, 19 Nov 2020 07:04:41 -0600

A research team published a dataset of 363 genomes from approximately 92 percent of bird families and showed the significance of sampling dense organisms for biodiversity research. The study was jointly conducted by Chinese and international institutions and museums and was led by researchers from the Kunming Institute of Zoology (KIZ) of the Chinese Academy of Sciences (CAS). Total of 267 were newly published among the 363 sequenced genomes. They were mainly taken from samples of avian tissue kept in museums around the world, enabling researchers to sequence rare and endangered birds' genomes.

Its descendants have adapted to a wide variety of ecological niches since the first bird formed more than 150 million years ago, giving rise to small, hovering hummingbirds, plunge-diving pelicans and showy paradise birds. More than 10,000 bird species live on the planet today - and now scientists are well on their way to capturing a full genetic image of that diversity.

B10K is expanding its efforts to encompass the next stage of avian classification with 363 genomes complete. The team will sequence thousands of extra genomes in this process, attempting to represent each of the approximately 2,300 bird genera.

The genomic resource is expected to provide new insights on evolutionary processes in cross-species comparative studies and assist in efforts to protect species, according to the research findings reported as a cover story in the journal Nature.

Ref at Dense sampling of bird diversity increases power of comparative genomics https://www.nature.com/articles/s41586-020-2873-9

Proksee: in-depth characterization and visualization of bacterial genomes

LEGE — Tue, 09 May 2023 19:38:52 -0500

Proksee is an expert system for genome assembly, annotation and visualization. To begin using Proksee, provide a complete genome sequence, sequencing reads or a CGView/Proksee map JSON file.

Address of the bookmark: https://proksee.ca/

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads

Jit — Wed, 07 Jun 2017 04:18:16 -0500

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads http://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html

Features

    Mapping position agnostic to alignment parameters.
    Consistently very high sensitivity and precision across different error profiles, rates and sequencing technologies even with default parameters.
    Circular genome handling to resolve coverage drops near ends of the genome.
    E-value.
    Meaningful mapping quality.
    Various alignment strategies (semiglobal bit-vector and Gotoh, anchored).
    Overlapping of reads for de novo assembly.
    Transcriptome mapping through internal construction of a transcriptome from a given genomic reference and a GTF file.
    ...and much more.

GraphMap is also used as an overlapper in a new de novo genome assembly project called Ra (https://github.com/mariokostelac/ra-integrate).
Ra attempts to create de novo assemblies from raw nanopore and PacBio reads without requiring error correction, for which a highly sensitive overlapper is required.

Currently, development of a new spliced-alignment mode for mapping RNA-seq reads is under way.
Description of the current effort as well as how to reach the experimental implementation can be found here: doc/rnaseq.md.

Address of the bookmark: https://github.com/isovic/graphmap

Hercules: a profile HMM-based hybrid error correction algorithm for long reads

Jit — Mon, 20 Aug 2018 14:14:11 -0500

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. In such cases researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better and more accurate integration of these two technologies. Results: We designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platformtextquoterights error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads. Availability:

Hercules source code is available at https://github.com/BilkentCompGen/Hercules

Address of the bookmark: https://github.com/BilkentCompGen/Hercules

HairSplitter: assembling long reads in an unknown number of haplotypes

BioStar — Wed, 07 Dec 2022 00:13:40 -0600

Pros and cons of HairSplitter Limitations of HairSplitter:

Not very fast: it re-polishes the whole assembly

Limited in the number of haplotypes

Strengths of HairSplitter:

Very modular, can be used with any assembler

Naive: makes no assumption on ploidy, parameter-free

Safe: won’t artificially duplicate contigs

HairSplitter splits collapsed assemblies from “draft” assemblies obtained by any means

HairSplitter can recover haplotypes and distinguish repeated elements

Only needs sequencing reads, potentially error-prone

HairSplitter splits collapsed assemblies from “draft” assemblies obtained by any means

HairSplitter can recover haplotypes and distinguish repeated elements

Only needs sequencing reads, potentially error-prone

Not really available yet (github.com/RolandFaure/HairSplitter)

https://hal.archives-ouvertes.fr/hal-03864075/file/RolandFaure_presentation_SeqBIM_2022.pdf

Address of the bookmark: https://hal.archives-ouvertes.fr/hal-03817928/document

Jabba: Hybrid Error Correction for Long Sequencing Reads

Jit — Fri, 05 Jan 2018 03:58:14 -0600

Jabba is a hybrid error correction tool to correct third generation (PacBio / ONT) sequencing data, using second generation (Illumina) data.

Input

Jabba takes as input a concatenated de Bruijn graph and a set of sequences:

the de Bruijn graph should appear in fasta format with 1 entry per node, the meta information should be in the format:
>NODE
the set of sequences should be in fasta or fastq format. These sequences will be corrected (e.g. PacBio reads). The corrections will be written to a file Jabba fasta.
The output is a file in fasta format with corrections of the long reads, and additionally a file in the input format containing uncorrected reads.

https://github.com/biointec/jabba/wiki

https://almob.biomedcentral.com/articles/10.1186/s13015-016-0075-7

Address of the bookmark: https://github.com/biointec/jabba