BOL: Related items

Understanding DUMP files from NCBI Taxonomy database !

Shruti Paniwala — Fri, 15 Jul 2022 04:29:05 -0500

*.dmp files are bcp-like dump from GenBank taxonomy database

General information.

Field terminator is "\t|\t"

Row terminator is "\t|\n"

nodes.dmp file consists of taxonomy nodes. The description for each node includes the following

fields:

tax_id -- node id in GenBank taxonomy database

parent tax_id -- parent node id in GenBank taxonomy database

rank -- rank of this node (superkingdom, kingdom, ...)

embl code -- locus-name prefix; not unique

division id -- see division.dmp file

inherited div flag (1 or 0) -- 1 if node inherits division from parent

genetic code id -- see gencode.dmp file

inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent

mitochondrial genetic code id -- see gencode.dmp file

inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent

GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage

hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet

comments -- free-text comments and citations

Taxonomy names file (names.dmp):

tax_id -- the id of node associated with this name

name_txt -- name itself

unique name -- the unique variant of this name if name not unique

name class -- (synonym, common name, ...)

Divisions file (division.dmp):

division id -- taxonomy database division id

division cde -- GenBank division code (three characters)

division name -- e.g. BCT, PLN, VRT, MAM, PRI...

comments

Genetic codes file (gencode.dmp):

genetic code id -- GenBank genetic code id

abbreviation -- genetic code name abbreviation

name -- genetic code name

cde -- translation table for this genetic code

starts -- start codons for this genetic code

Deleted nodes file (delnodes.dmp):

tax_id -- deleted node id

Merged nodes file (merged.dmp):

old_tax_id -- id of nodes which has been merged

new_tax_id -- id of nodes which is result of merging

Citations file (citations.dmp):

cit_id -- the unique id of citation

cit_key -- citation key

pubmed_id -- unique id in PubMed database (0 if not in PubMed)

medline_id -- unique id in MedLine database (0 if not in MedLine)

url -- URL associated with citation

text -- any text (usually article name and authors).

-- The following characters are escaped in this text by a backslash:

-- newline (appear as "\n"),

-- tab character ("\t"),

-- double quotes ('\"'),

-- backslash character ("\\").

taxid_list -- list of node ids separated by a single space

Lifemap

Jit — Mon, 10 Apr 2017 05:42:37 -0500

Lifemap is an interactive tool to explore the WHOLE NCBI TAXONOMY. The concept used in Lifemap is similar to the one used in cartography with tools like Google Maps© or Open Street Maps: exploring is done by zooming and panning.

The current tree contains ALL species present in NCBI taxonomy as of October 18th, 2016: 1,135,169 species including 10,545 Archaea, 418,777 Bacteria and 705,847 Eukaryotes. The Lifemap tree is updated every two weeks.

All the nodes in the tree are clickable. This displays various information and options:

The species name (and the associated common name if there is one)
The rank (kingdom, family, class, species...)
Ability to go to the corresponding node/species on NCBI web site (displayed in a new window)
Possibility to download the corresponding subtree in newick extended format
Possibilty to get the whole lineage from the current node/tip to the root of the tree.

Address of the bookmark: http://lifemap-ncbi.univ-lyon1.fr/

A Beginner's Guide to Using Kraken for Taxonomic Classification

Neel — Fri, 13 Dec 2024 11:29:03 -0600

Kraken is a popular bioinformatics tool designed for fast and accurate taxonomic classification of metagenomic sequences. Its efficiency and precision make it a go-to resource for analyzing microbial communities, including bacteria, viruses, archaea, and fungi. Whether you're new to bioinformatics or experienced in the field, Kraken is an indispensable tool for taxonomic analysis.

In this blog, we’ll walk through the basics of Kraken, from installation to running an analysis, and highlight its key features and applications.

What is Kraken?

Kraken is a sequence classification tool that assigns taxonomic labels to DNA sequences using exact k-mer matching. It uses a reference database of genomes, dividing sequences into k-mers and identifying matches in a computationally efficient way.

Key Features of Kraken

Speed: Kraken processes data much faster than alignment-based methods.
Accuracy: It uses a precise k-mer matching algorithm for high-resolution taxonomic assignments.
Scalability: It can handle large metagenomic datasets.
Custom Databases: You can build and use custom databases tailored to your research needs.

Installing Kraken

System Requirements
- A Unix-based operating system (Linux/macOS).
- Sufficient computational resources for database building (RAM and disk space).
Installation Steps
- Clone the Kraken repository from GitHub:
  
  git clone https://github.com/DerrickWood/kraken.git cd kraken
- Compile the Kraken binaries:
  
  make
- Add Kraken to your PATH for easy access:
  
  export PATH=$PATH:/path/to/kraken

Preparing a Database

Kraken requires a database of reference genomes. You can use a pre-built database or create a custom one.

Downloading a Pre-built Database
Kraken offers pre-built databases, such as the MiniKraken database, which is lightweight and suitable for smaller datasets. Download it using:

kraken-build --download-library minikraken
Building a Custom Database
To include specific genomes, download FASTA files and build the database:

kraken-build --download-library bacteria --threads 4 --db my_database kraken-build --build --db my_database

This process may take considerable time and resources, depending on the size of the database.

Running Kraken

Once the database is ready, you can classify sequences.

Basic Usage
Use the following command to classify sequences:

kraken --db my_database --threads 4 --fastq-input input_sequences.fastq --output kraken_output.txt

Key options:
- --db: Specifies the database.
- --threads: Number of threads for parallel processing.
- --fastq-input: Indicates input file format (FASTQ/FASTA).
Interpreting Results
Kraken generates an output file with columns for sequence IDs, taxonomic classifications, and the confidence score.

Visualizing Kraken Results

Kraken results can be visualized using tools like Krona or converted to human-readable reports using kraken-report.

Generate a Report

kraken-report --db my_database kraken_output.txt > kraken_report.txt
Krona Visualization
Install Krona and convert Kraken output for visualization:

cut -f2,3 kraken_output.txt | ktImportTaxonomy -o krona_output.html

Open the HTML file in your browser to interactively explore the taxonomic classifications.

Advanced Usage

Confidence Thresholds
Adjust the confidence threshold for classification using the --confidence option. Higher values reduce false positives but may miss some true positives:

kraken --db my_database --confidence 0.1 --fastq-input input.fastq
Paired-End Reads
For paired-end sequencing data, use:

kraken --db my_database --paired reads_1.fastq reads_2.fastq
Customizing K-mers
Kraken allows you to set custom k-mer lengths during database building for specific applications.

Applications of Kraken

Microbial Ecology: Characterizing microbial communities in soil, water, and the human microbiome.
Pathogen Detection: Identifying pathogens in clinical samples.
Fungal Research: Analyzing fungal diversity in metagenomic datasets.
Environmental Monitoring: Tracking microbial populations in diverse habitats.

Conclusion

Kraken is a versatile and efficient tool for taxonomic classification in metagenomics. Its speed, accuracy, and flexibility make it a favorite among bioinformaticians. By following this guide, you can set up and use Kraken to unlock insights into microbial and fungal communities, paving the way for discoveries in ecology, medicine, and biotechnology.

kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity

Rahul Nayak — Tue, 29 May 2018 08:37:53 -0500

The k-mer Weighted Inner Product.

This software implements a de novo, alignment free measure of sample genetic dissimilarity which operates upon raw sequencing reads. It is able to calculate the genetic dissimilarity between samples without any reference genome, and without assembling one.

De novo estimates of genetic relatedness from next-gen sequencing data https://kwip.readthedocs.org

Address of the bookmark: https://github.com/kdmurray91/kwip

Free Genomics data !

BioStar — Fri, 07 Feb 2020 14:08:31 -0600

The specimens were collected by the Oxford Wytham Woods and Edinburgh Lohse lab teams. DNA extraction and sequencing was carried out by the Sanger Institute Scientific Operations teams. Assemblies were carried out by the Tree of Life team (Shane McCarthy) and colleagues in Pacific Biosciences (Jonas Korlach).

https://www.darwintreeoflife.org/an-initial-set-of-raw-genome-assemblies-from-the-darwin-tree-of-life-project/

Address of the bookmark: https://www.darwintreeoflife.org/an-initial-set-of-raw-genome-assemblies-from-the-darwin-tree-of-life-project/

Kraken: ultrafast metagenomic sequence classification using exact alignments

Jit — Mon, 27 Jun 2016 11:01:44 -0500

Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.

Krona

https://sourceforge.net/p/krona/home/krona/

Address of the bookmark: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/

Classification of SARS-CoV2 Variant !

Jit — Fri, 26 Nov 2021 12:53:12 -0600

The scientists established some guidelines for determining whether a variant is a legitimate branch of an existing lineage:

The variant should be transmitted from its original location to another "geographically distinct population"—say, another country or a province of a large and populous country.
It should differ from its ancestor by at least one nucleotide.
At least 95% of its genetic code should have been sequenced at least five times from different samples.

TACOA: Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach

Poonam Mahapatra — Tue, 15 May 2018 09:52:28 -0500

TACOA is a software that can accurately predict the taxonomic origin of genomic fragments from metagenomic data sets by combining the advantages of the k -NN approach with a smoothing kernel function. TACOA can be easily installed and run on a desktop computer, therefore allowing researchers to locally analyze their metagenomic sequence data or integrate it into their pipelines.

Address of the bookmark: http://www.cebitec.uni-bielefeld.de/index.php/2-uncategorised/99-tacoa

Understanding pango networks !

Abhi — Sat, 16 Oct 2021 14:02:36 -0500

In the vast majority of instances it is expected that Pango lineage names and designations will conform to the following rules. These rules also act as guidelines for the decisions made by the Lineage Designation Committee.

https://www.pango.network/the-pango-nomenclature-system/statement-of-nomenclature-rules/

https://www.pango.network/how-does-the-system-work/what-are-pango-lineages/

Reference paper

https://www.nature.com/articles/s41564-020-0770-5

Address of the bookmark: https://www.pango.network/the-pango-nomenclature-system/statement-of-nomenclature-rules/

Type of SSR

BioStar — Thu, 09 Mar 2023 04:35:41 -0600

Types of SSRs (simple sequence repeats), SSRs are short DNA sequences consisting of a tandem repeat of a few nucleotides, typically 2-6 nucleotides in length. There are different types of SSRs based on the length and pattern of the repeated sequence, as well as the presence or absence of interruptions of non-repeated nucleotides within the repeat array. The four types of SSRs are:

Perfect SSR: This is the simplest type of SSR, where the same repeat motif is present adjacent to each other without any interruption of any other nucleotide. For example, a perfect SSR with the repeat motif "CAT" would be "CATCATCATCAT", where the "CAT" sequence is repeated four times.
Imperfect SSR: This type of SSR contains repeat motifs that are interrupted by one or a few non-repeat nucleotides. For example, an imperfect SSR with the repeat motif "CAT" would be "CATCATGGCATCATCAT", where the "CAT" sequence is repeated twice, but interrupted by "GG".
Compound perfect SSR: This type of SSR contains two or more repeat motifs lying adjacent to each other, separated by no or very few intervening nucleotides. For example, a compound perfect SSR with the repeat motifs "CAT" and "GTC" would be "CATCATCATGTCGTC", where the "CAT" sequence is repeated three times, followed by the "GTC" sequence repeated twice.
Compound imperfect SSR: This type of SSR contains two or more repeat motifs interrupted by several non-repeat nucleotides. For example, a compound imperfect SSR with the repeat motifs "CAT" and "GTC" would be "CATCATCATNNNNNNNGTCGTCGTC", where the "CAT" sequence is repeated three times, interrupted by several non-repeat nucleotides, followed by the "GTC" sequence repeated three times.