BOL: Related items

k-mers tutorial - classification and taxonomy

Neel — Thu, 26 Aug 2021 10:28:43 -0500

DNA k-mers underlie much of our assembly work, and we (along with many others!) have spent a lot of time thinking about how to store k-mer graphs efficiently, discard redundant data, and count them efficiently.

More recently, we've been enthused about using k-mer based similarity measures and computing and searching k-mer-based sketch search databases for all the things.

But I haven't spent too much talking about using k-mers for taxonomy, although that has become an ahem area of interest recently, if you read into our papers a bit.

In this blog post I'm going to fix this by doing a little bit of a literature review and waxing enthusiastic about other people's work. Then in a future blog post I'll talk about how we're building off of this work in fun! and interesting? ways!

Address of the bookmark: http://ivory.idyll.org/blog/2017-something-about-kmers.html

A Beginner's Guide to Using Kraken for Taxonomic Classification

Neel — Fri, 13 Dec 2024 11:29:03 -0600

Kraken is a popular bioinformatics tool designed for fast and accurate taxonomic classification of metagenomic sequences. Its efficiency and precision make it a go-to resource for analyzing microbial communities, including bacteria, viruses, archaea, and fungi. Whether you're new to bioinformatics or experienced in the field, Kraken is an indispensable tool for taxonomic analysis.

In this blog, we’ll walk through the basics of Kraken, from installation to running an analysis, and highlight its key features and applications.

What is Kraken?

Kraken is a sequence classification tool that assigns taxonomic labels to DNA sequences using exact k-mer matching. It uses a reference database of genomes, dividing sequences into k-mers and identifying matches in a computationally efficient way.

Key Features of Kraken

Speed: Kraken processes data much faster than alignment-based methods.
Accuracy: It uses a precise k-mer matching algorithm for high-resolution taxonomic assignments.
Scalability: It can handle large metagenomic datasets.
Custom Databases: You can build and use custom databases tailored to your research needs.

Installing Kraken

System Requirements
- A Unix-based operating system (Linux/macOS).
- Sufficient computational resources for database building (RAM and disk space).
Installation Steps
- Clone the Kraken repository from GitHub:
  
  git clone https://github.com/DerrickWood/kraken.git cd kraken
- Compile the Kraken binaries:
  
  make
- Add Kraken to your PATH for easy access:
  
  export PATH=$PATH:/path/to/kraken

Preparing a Database

Kraken requires a database of reference genomes. You can use a pre-built database or create a custom one.

Downloading a Pre-built Database
Kraken offers pre-built databases, such as the MiniKraken database, which is lightweight and suitable for smaller datasets. Download it using:

kraken-build --download-library minikraken
Building a Custom Database
To include specific genomes, download FASTA files and build the database:

kraken-build --download-library bacteria --threads 4 --db my_database kraken-build --build --db my_database

This process may take considerable time and resources, depending on the size of the database.

Running Kraken

Once the database is ready, you can classify sequences.

Basic Usage
Use the following command to classify sequences:

kraken --db my_database --threads 4 --fastq-input input_sequences.fastq --output kraken_output.txt

Key options:
- --db: Specifies the database.
- --threads: Number of threads for parallel processing.
- --fastq-input: Indicates input file format (FASTQ/FASTA).
Interpreting Results
Kraken generates an output file with columns for sequence IDs, taxonomic classifications, and the confidence score.

Visualizing Kraken Results

Kraken results can be visualized using tools like Krona or converted to human-readable reports using kraken-report.

Generate a Report

kraken-report --db my_database kraken_output.txt > kraken_report.txt
Krona Visualization
Install Krona and convert Kraken output for visualization:

cut -f2,3 kraken_output.txt | ktImportTaxonomy -o krona_output.html

Open the HTML file in your browser to interactively explore the taxonomic classifications.

Advanced Usage

Confidence Thresholds
Adjust the confidence threshold for classification using the --confidence option. Higher values reduce false positives but may miss some true positives:

kraken --db my_database --confidence 0.1 --fastq-input input.fastq
Paired-End Reads
For paired-end sequencing data, use:

kraken --db my_database --paired reads_1.fastq reads_2.fastq
Customizing K-mers
Kraken allows you to set custom k-mer lengths during database building for specific applications.

Applications of Kraken

Microbial Ecology: Characterizing microbial communities in soil, water, and the human microbiome.
Pathogen Detection: Identifying pathogens in clinical samples.
Fungal Research: Analyzing fungal diversity in metagenomic datasets.
Environmental Monitoring: Tracking microbial populations in diverse habitats.

Conclusion

Kraken is a versatile and efficient tool for taxonomic classification in metagenomics. Its speed, accuracy, and flexibility make it a favorite among bioinformaticians. By following this guide, you can set up and use Kraken to unlock insights into microbial and fungal communities, paving the way for discoveries in ecology, medicine, and biotechnology.

Lifemap

Jit — Mon, 10 Apr 2017 05:42:37 -0500

Lifemap is an interactive tool to explore the WHOLE NCBI TAXONOMY. The concept used in Lifemap is similar to the one used in cartography with tools like Google Maps© or Open Street Maps: exploring is done by zooming and panning.

The current tree contains ALL species present in NCBI taxonomy as of October 18th, 2016: 1,135,169 species including 10,545 Archaea, 418,777 Bacteria and 705,847 Eukaryotes. The Lifemap tree is updated every two weeks.

All the nodes in the tree are clickable. This displays various information and options:

The species name (and the associated common name if there is one)
The rank (kingdom, family, class, species...)
Ability to go to the corresponding node/species on NCBI web site (displayed in a new window)
Possibility to download the corresponding subtree in newick extended format
Possibilty to get the whole lineage from the current node/tip to the root of the tree.

Address of the bookmark: http://lifemap-ncbi.univ-lyon1.fr/

Kraken: ultrafast metagenomic sequence classification using exact alignments

Jit — Mon, 27 Jun 2016 11:01:44 -0500

Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.

Krona

https://sourceforge.net/p/krona/home/krona/

Address of the bookmark: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/

Classification of SARS-CoV2 Variant !

Jit — Fri, 26 Nov 2021 12:53:12 -0600

The scientists established some guidelines for determining whether a variant is a legitimate branch of an existing lineage:

The variant should be transmitted from its original location to another "geographically distinct population"—say, another country or a province of a large and populous country.
It should differ from its ancestor by at least one nucleotide.
At least 95% of its genetic code should have been sequenced at least five times from different samples.

TACOA: Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach

Poonam Mahapatra — Tue, 15 May 2018 09:52:28 -0500

TACOA is a software that can accurately predict the taxonomic origin of genomic fragments from metagenomic data sets by combining the advantages of the k -NN approach with a smoothing kernel function. TACOA can be easily installed and run on a desktop computer, therefore allowing researchers to locally analyze their metagenomic sequence data or integrate it into their pipelines.

Address of the bookmark: http://www.cebitec.uni-bielefeld.de/index.php/2-uncategorised/99-tacoa

Type of SSR

BioStar — Thu, 09 Mar 2023 04:35:41 -0600

Types of SSRs (simple sequence repeats), SSRs are short DNA sequences consisting of a tandem repeat of a few nucleotides, typically 2-6 nucleotides in length. There are different types of SSRs based on the length and pattern of the repeated sequence, as well as the presence or absence of interruptions of non-repeated nucleotides within the repeat array. The four types of SSRs are:

Perfect SSR: This is the simplest type of SSR, where the same repeat motif is present adjacent to each other without any interruption of any other nucleotide. For example, a perfect SSR with the repeat motif "CAT" would be "CATCATCATCAT", where the "CAT" sequence is repeated four times.
Imperfect SSR: This type of SSR contains repeat motifs that are interrupted by one or a few non-repeat nucleotides. For example, an imperfect SSR with the repeat motif "CAT" would be "CATCATGGCATCATCAT", where the "CAT" sequence is repeated twice, but interrupted by "GG".
Compound perfect SSR: This type of SSR contains two or more repeat motifs lying adjacent to each other, separated by no or very few intervening nucleotides. For example, a compound perfect SSR with the repeat motifs "CAT" and "GTC" would be "CATCATCATGTCGTC", where the "CAT" sequence is repeated three times, followed by the "GTC" sequence repeated twice.
Compound imperfect SSR: This type of SSR contains two or more repeat motifs interrupted by several non-repeat nucleotides. For example, a compound imperfect SSR with the repeat motifs "CAT" and "GTC" would be "CATCATCATNNNNNNNGTCGTCGTC", where the "CAT" sequence is repeated three times, interrupted by several non-repeat nucleotides, followed by the "GTC" sequence repeated three times.

Look up a biological numbers

Jitendra Narayan — Fri, 23 Aug 2013 03:27:45 -0500

Did you ever need to look up a number like the volume of a cell or the cellular concentration of ATP, only to find yourself spending much more time than you wanted on the Internet or flipping through textbooks - all without much success?

Well, it didn’t happen only to you. It is often surprising how difficult it can be to find concrete biological numbers, even for properties that have been measured numerous times. To help solve this for one and all, BioNumbers (the database of key numbers in molecular biology) was created. Along with the numbers, you'll find the relevant references to the original literature, useful comments, and related numbers.

To cite BioNumbers please refer to: Milo et al. Nucl. Acids Res. (2010) 38: D750-D753. When using a specific entry from the database it is highly recommended that you also specify the BioNumbers 6 digit ID, e.g. "BNID 100986, Milo et al 2010".

Address of the bookmark: http://bionumbers.hms.harvard.edu/

HistoneDB 2.0 – with variants

Anjana — Fri, 03 Jun 2016 05:06:20 -0500

This histone database can be used to explore the diversity of histone proteins and their sequence variants in many organisms. The resource was established to better understand how sequence variation may affect functional and structural features of nucleosomes. To get started, select a histone type to explore its variants.

More at http://www.ncbi.nlm.nih.gov/projects/HistoneDB2.0/index.fcgi/browse/

Address of the bookmark: http://www.ncbi.nlm.nih.gov/projects/HistoneDB2.0/index.fcgi/browse/

GOLD:Genomes Online Database

Jit — Wed, 26 Jul 2017 07:49:29 -0500

GOLD:Genomes Online Database, is a World Wide Web resource for comprehensive access to information regarding genome and metagenome sequencing projects, and their associated metadata, around the world.

https://gold.jgi.doe.gov/

Address of the bookmark: https://gold.jgi.doe.gov/