BOL: Related items

CLARK: Fast, accurate and versatile sequence classification system

Jit — Sat, 15 Feb 2020 01:49:01 -0600

CLARK, a method based on a supervised sequence classification using discriminative k-mers. Considering two distinct specific classification problems (see the article for details), namely (1) the taxonomic classification of metagenomic reads to known bacterial genomes, and (2) the assignment of BAC clones and transcript to chromosome arms/centromeres (in the absence of a finished assembly for the reference genome), CLARK outperforms in classification speed and precision the best state-of-the-art methods.

http://clark.cs.ucr.edu/Spaced/

Address of the bookmark: http://clark.cs.ucr.edu/Spaced/

Understanding DUMP files from NCBI Taxonomy database !

Shruti Paniwala — Fri, 15 Jul 2022 04:29:05 -0500

*.dmp files are bcp-like dump from GenBank taxonomy database

General information.

Field terminator is "\t|\t"

Row terminator is "\t|\n"

nodes.dmp file consists of taxonomy nodes. The description for each node includes the following

fields:

tax_id -- node id in GenBank taxonomy database

parent tax_id -- parent node id in GenBank taxonomy database

rank -- rank of this node (superkingdom, kingdom, ...)

embl code -- locus-name prefix; not unique

division id -- see division.dmp file

inherited div flag (1 or 0) -- 1 if node inherits division from parent

genetic code id -- see gencode.dmp file

inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent

mitochondrial genetic code id -- see gencode.dmp file

inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent

GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage

hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet

comments -- free-text comments and citations

Taxonomy names file (names.dmp):

tax_id -- the id of node associated with this name

name_txt -- name itself

unique name -- the unique variant of this name if name not unique

name class -- (synonym, common name, ...)

Divisions file (division.dmp):

division id -- taxonomy database division id

division cde -- GenBank division code (three characters)

division name -- e.g. BCT, PLN, VRT, MAM, PRI...

comments

Genetic codes file (gencode.dmp):

genetic code id -- GenBank genetic code id

abbreviation -- genetic code name abbreviation

name -- genetic code name

cde -- translation table for this genetic code

starts -- start codons for this genetic code

Deleted nodes file (delnodes.dmp):

tax_id -- deleted node id

Merged nodes file (merged.dmp):

old_tax_id -- id of nodes which has been merged

new_tax_id -- id of nodes which is result of merging

Citations file (citations.dmp):

cit_id -- the unique id of citation

cit_key -- citation key

pubmed_id -- unique id in PubMed database (0 if not in PubMed)

medline_id -- unique id in MedLine database (0 if not in MedLine)

url -- URL associated with citation

text -- any text (usually article name and authors).

-- The following characters are escaped in this text by a backslash:

-- newline (appear as "\n"),

-- tab character ("\t"),

-- double quotes ('\"'),

-- backslash character ("\\").

taxid_list -- list of node ids separated by a single space

Dengue Lineages !

LEGE — Fri, 16 Aug 2024 04:40:14 -0500

Our dengue virus lineage system splits up the current genotypes into major and minor lineages to provide additional spatiotemporal resolution and a common language to discuss important genomic diversity. A full description of the lineage system can be found here.

https://dengue-lineages.org/

Address of the bookmark: https://dengue-lineages.org/

PERGA: A Paired-End Read Guided De Novo Assembler for Extending Contigs Using SVM and Look Ahead Approach

Rahul Nayak — Tue, 05 Jun 2018 09:57:11 -0500

PERGA - Paired End Reads Guided Assembler PERGA is a novel sequence reads guided de novo assembly approach which adopts greedy-like prediction strategy for assembling reads to contigs and scaffolds. Instead of using single-end reads to construct contig, PERGA uses paired-end reads and different read overlap sizes from O ≥ Omax to Omin to resolve the gaps and branches. Moreover, by constructing a decision model using machine learning approach based on branch features, PERGA can determine the correct extension in 99.7% of cases. PERGA will try to extend the contigs by all feasible nucleotides and determine if these multiple extensions due to sequencing errors or repeats by using looking ahead technology, and it also try to separate the different repeats of nearby genomic regions to make the assembly result more longer and accurate. The simulated E.coli paired-end reads data are generated using GemSim (KE McElroy, F Luciani, T Thomas. Gemsim: General, Error-Model Based Simulator of Next-Generation Sequencing Data. BMC Genomics 2012, 13:74), with coverage 50x, 60x, 100x, read lengths 100-bp, and can be downloaded from https://github.com/zhuxiao/data_PERGA.

Address of the bookmark: https://github.com/hitbio/PERGA

UniqueKmer: Generate unique KMERs for every contig in a FASTA file

Abhi — Fri, 17 Dec 2021 00:08:15 -0600

Generate unique k-mers for every contig in a FASTA file.

Unique k-mer is consisted of k-mer keys (i.e. ATCGATCCTTAAGG) that are only presented in one contig, but not presented in any other contigs (for both forward and reverse strands).

This tool accepts the input of a FASTA file consisting of many contigs, and extract unique k-mers for each contig.

The output unique k-mer file and Genome file can be used for fastv: https://github.com/OpenGene/fastv, which is an ultra-fast tool to identify and visualize microbial sequences from sequencing data.

https://github.com/OpenGene/UniqueKMER

Address of the bookmark: https://github.com/OpenGene/UniqueKMER

DAGchainer: Computing Chains of Syntenic Genes in Complete Genomes

Abhimanyu Singh — Fri, 17 Feb 2017 16:13:35 -0600

The DAGchainer software computes chains of syntenic genes found within complete genome sequences. As input, DAGchainer accepts a list of gene pairs with sequence homology along with their genome coordinates. Using a scoring function which accounts for the distance between neighboring genes on each DNA molecule and the BLAST E-value score between homologs, maximally scoring chains of ordered gene pairs are computed and reported. This algorithm can be used to mine large evolutionary conserved regions of genomes between two organisms. Alternatively, by examining colinear sets of homologous genes found within a single genome, segmental genome duplications can be revealed.

This software distribution includes both the DAGchainer utility and a Java-based graphical interface that allows the inputs and outputs to be navigated and interrogated dynamically.

Address of the bookmark: http://dagchainer.sourceforge.net/

dipSPAdes: Assembler for Highly Polymorphic Diploid Genomes.

Jit — Wed, 20 Dec 2017 18:35:16 -0600

While the number of sequenced diploid genomes have been steadily increasing in the last few years, assembly of highly polymorphic (HP) diploid genomes remains challenging. As a result, there is a shortage of tools for assembling HP genomes from the next generation sequencing (NGS) data. The initial approaches to assembling HP genomes were proposed in the pre-NGS era and are not well suited for NGS projects. To address this limitation, we developed the first de Bruijn graph assembler, dipSPAdes, for HP genomes that significantly improves on the state-of-the-art assemblers for HP diploid genomes.

Address of the bookmark: https://www.ncbi.nlm.nih.gov/pubmed/25734602

Bactopia: a Flexible Pipeline for Complete Analysis of Bacterial Genomes

Abhi — Wed, 15 May 2024 14:36:12 -0500

Bactopia is a flexible pipeline for complete analysis of bacterial genomes. The goal of Bactopia is to process your data with a broad set of tools, so that you can get to the fun part of analyses quicker!

Bactopia can be split into two main parts: Bactopia Analysis Pipeline, and Bactopia Tools.

Bactopia Analysis Pipeline is the main per-isolate workflow in Bactopia. Built with Nextflow, input FASTQs (local or available from SRA/ENA) are put through numerous analyses including: quality control, assembly, annotation, minmer sketch queries, sequence typing, and more.

Bactopia Tools are a set a independent workflows fo

Address of the bookmark: https://github.com/bactopia/bactopia

karyoploteR: plot whole genomes with arbitrary data

Abhimanyu Singh — Fri, 02 Feb 2018 03:24:28 -0600

karyoploteR is an R package to create karyoplots, that is, representations of whole genomes with arbitrary data plotted on them. It is inspired by the R base graphics system and does not depend on other graphics packages. The aim of karyoploteR is to offer the user an easy way to plot data along the genome to get broad genome-wide view to facilitate the identification of genome wide relations and distributions.

Address of the bookmark: https://bernatgel.github.io/karyoploter_tutorial/

Harvest: a suite of core-genome alignment and visualization tools

Jit — Fri, 08 Dec 2017 07:16:03 -0600

Harvest is a suite of core-genome alignment and visualization tools for quickly analyzing thousands of intraspecific microbial genomes, including variant calls, recombination detection, and phylogenetic trees.

Tools

Parsnp - Core-genome alignment and analysis
Gingr - Interactive visualization of alignments, trees and variants
HarvestTools - Archiving and postprocessing

Address of the bookmark: https://harvest.readthedocs.io/en/latest/