BOL: Related items

Tallymer: method to compute K-mer frequencies and its application to annotate large repetitive plant genomes

Jit — Thu, 15 Feb 2018 10:21:02 -0600

Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species. In particular, Tallymer was used to index a set whole genome shotgun sequences from maize (B73) (total size 10⁹ bp).
Tallymer was effective in a variety of applications to aid genome annotation in maize, despite limitations imposed by the relatively low coverage of sequence available.

A manual can be found here.

Address of the bookmark: https://www.zbh.uni-hamburg.de/forschung/arbeitsgruppe-genominformatik/software/tallymer.html

CSBFinder: Discovery of colinear syntenic blocks across thousands of prokaryotic genomes

Jit — Wed, 24 Oct 2018 22:12:27 -0500

CSBFinder is a standalone Desktop java application with a graphical user interface, that can also be executed via command line.

CSBFinder implements a novel methodology for the discovery, ranking, and taxonomic distribution analysis of colinear syntenic blocks (CSBs) - groups of genes that are consistently located close to each other, in the same order, across a wide range of taxa. CSBFinder incorporates an efficient algorithm that identifies CSBs in large genomic datasets. The discovered CSBs are ranked according to a probabilistic score and clustered to families according to their gene content similarity.

Address of the bookmark: https://github.com/dinasv/CSBFinder

HaploTypo: a variant-calling pipeline for phased genomes

Jit — Thu, 19 Dec 2019 07:33:40 -0600

An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome.

Availability and Implementation

HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image.

Address of the bookmark: https://github.com/gabaldonlab/haplotypo

RefKA: A fast and efficient long-read genome assembly approach for large and complex genomes

Rahul Nayak — Fri, 01 May 2020 03:00:40 -0500

RefKA, a reference-based approach for long read genome assembly. This approach relies on breaking up a closely related reference genome into bins, aligning k-mers unique to each bin with PacBio reads, and then assembling each bin in parallel followed by a final bin-stitching step.

Address of the bookmark: https://github.com/AppliedBioinformatics/RefKA

Published a dataset of 363 genomes from approximately 92 percent of bird families

Jit — Thu, 19 Nov 2020 07:04:41 -0600

A research team published a dataset of 363 genomes from approximately 92 percent of bird families and showed the significance of sampling dense organisms for biodiversity research. The study was jointly conducted by Chinese and international institutions and museums and was led by researchers from the Kunming Institute of Zoology (KIZ) of the Chinese Academy of Sciences (CAS). Total of 267 were newly published among the 363 sequenced genomes. They were mainly taken from samples of avian tissue kept in museums around the world, enabling researchers to sequence rare and endangered birds' genomes.

Its descendants have adapted to a wide variety of ecological niches since the first bird formed more than 150 million years ago, giving rise to small, hovering hummingbirds, plunge-diving pelicans and showy paradise birds. More than 10,000 bird species live on the planet today - and now scientists are well on their way to capturing a full genetic image of that diversity.

B10K is expanding its efforts to encompass the next stage of avian classification with 363 genomes complete. The team will sequence thousands of extra genomes in this process, attempting to represent each of the approximately 2,300 bird genera.

The genomic resource is expected to provide new insights on evolutionary processes in cross-species comparative studies and assist in efforts to protect species, according to the research findings reported as a cover story in the journal Nature.

Ref at Dense sampling of bird diversity increases power of comparative genomics https://www.nature.com/articles/s41586-020-2873-9

OrthoVenn3: an integrated platform for exploring and visualizing orthologous data across genomes

Abhi — Tue, 02 May 2023 00:48:28 -0500

OrthoVenn3 is a powerful tool for comparative genomics analysis, used as a web server for full genome comparisons, annotation, and evolutionary analysis of orthologous clusters across multiple species. It has already been used by thousands of users from over 60 countries.

Address of the bookmark: https://orthovenn3.bioinfotoolkits.net/

Illumina reveals first dataset of long reads

Rahul Agarwal — Fri, 23 Aug 2013 06:29:14 -0500

With the help of Moleculo technology , acquired by Illumina releases new service for long reads sequencing i.e., FastTrack Long Reads.

Average read length is around 8,500 base pairs in release dataset. Best thing about this, there is not much effect on cost and quality of data.

You can also check following pages for publications on long reads and more:

http://www.illumina.com/services/long-read-sequencing-service.ilmn

http://blog.basespace.illumina.com/2013/07/22/first-data-set-from-fasttrack-long-reads-early-access-service/

PANDASEQ

Shruti Paniwala — Mon, 23 Jan 2017 04:54:32 -0600

PANDASEQ assembles paired-end Illumina reads into sequences, trying to correct for errors and uncalled bases. The assembler reads two files in FASTQ format with quality information. If amplification primers were used (e.g., to isolate a variable region of the 16S gene, or the constant regions around zinc finger binding residues), they can be removed from the sequence during assembly. The final sequence will correct any uncalled bases in the overlapping region using the complementary strand. When mismatches occur in the overlapping region, the base with the better quality score is chosen.
The algorithm is as follows:

1.Find the positions where the forward and reverse primers match best above the threshold and discard the ends of the sequence, including the primer.
2.Pick and overlap to maximise the probability of the forward and reverse reads having come from a single piece of DNA.
3.Identify the masking of the end of the read with the quality score B or # as done by CASAVA and adjust the probabilities in this region.
4.Construct an assembled sequence between the primers and calculate the quality.
5.Check for various constraints, including quality, length, uncalled bases, and user-supplied modules.

http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html

Address of the bookmark: http://neufeldserver.uwaterloo.ca/~apmasell/pandaseq_man1.html

BFC: a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data

Jit — Thu, 31 May 2018 09:35:23 -0500

BFC is a standalone high-performance tool for correcting sequencing errors from Illumina sequencing data. It is specifically designed for high-coverage whole-genome human data, though also performs well for small genomes. The BFC algorithm is a variant of the classical spectrum alignment algorithm introduced by Pevzner et al (2001). It uses an exhaustive search to find a k-mer path through a read that minimizes a heuristic objective function jointly considering penalties on correction, quality and k-mer support. This algorithm was first implemented in my fermi assembler and then refined a few times in fermi, fermi2 and now in BFC. In the k-mer counting phase, BFC uses a blocked bloom filter to filter out most singleton k-mers and keeps the rest in a hash table (Melsted and Pritchard, 2011). The use of bloom filter is how BFC is named, though other correctors such as Lighter and Bless actually rely more on bloom filter than BFC. https://github.com/lh3/bfc

Address of the bookmark: https://github.com/lh3/bfc

De novo Genome Assembly for Illumina Data

Rahul Nayak — Mon, 20 Jan 2020 05:13:29 -0600

Written and maintained by Simon Gladman - Melbourne Bioinformatics (formerly VLSCI)

Protocol Overview / Introduction

In this protocol we discuss and outline the process of de novo assembly for small to medium sized genomes.

https://www.melbournebioinformatics.org.au/tutorials/tutorials/assembly/assembly-protocol/

Address of the bookmark: https://www.melbournebioinformatics.org.au/tutorials/tutorials/assembly/assembly-protocol/