BOL: Related items

Curated set of ribosomal RNA (rRNA) reference sequences (targeted loci) with verifiable organism

Rahul Nayak — Sun, 23 Feb 2020 02:17:30 -0600

MCBI have a curated set of ribosomal RNA (rRNA) reference sequences (targeted loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples. To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms. The new databases are:

*16S ribosomal RNA (Bacteria and Archaea)

*18S ribosomal RNA sequences (SSU) from Fungi type and reference material

*28S ribosomal RNA sequences (LSU) from Fungi type and reference material

*Internal transcribed spacer region (ITS) from Fungi type and reference material

You can also download these from the BLAST db FTP area. See the NCBI Insights post for more detail.

Useful links

-----------------

BLAST form with rRNA/ITS databases

BLAST db download

Targeted loci

If you have any questions or concerns, please contact blast-help@ncbi.nlm.nih.gov

IgBLAST 1.17 is now available with improved identification of productive V gene sequences

Jit — Sun, 01 Nov 2020 16:52:58 -0600

A new release of IgBLAST (1.17), the popular package for classifying and analyzing immunoglobulin and T cell receptor sequences, is now available on the web and from the FTP site. The updated package is better at identifying productive V gene sequences. We added a new field , “V frame shift”, to the IgBLAST output to indicate whether the V gene translation frame contains a frame-shift. We have also updated the definition of a productive V(D)J sequence to now exclude those with internal frame shifts.

See the new IgBLAST manual on the NCBI GitHub site for more information on setting up and running IgBLAST.

If you have any questions or concerns, please email us at blast-help@ncbi.nlm.nih.gov

SVbyEye: R Package to visualize alignments between two or multiple DNA sequences

LEGE — Tue, 17 Sep 2024 02:34:57 -0500

R Package to visualize alignments between two or multiple DNA sequences including
a number of functionalities to facilitate processing of alignments in PAF format.

SVbyEye, an open-source R package to visualize and annotate sequence-to-sequence alignments along with various functionalities to process alignments in PAF format. The tool facilitates the characterization of complex SVs in the context of sequence homology helping resolve the mechanisms underlying their formation. Availability and implementation SVbyEye is available at https://github.com/daewoooo/SVbyEye.

Author: David Porubsky

Address of the bookmark: https://github.com/daewoooo/SVbyEye

Steps to find all the repeats in the genome !

Neel — Thu, 31 Aug 2023 02:43:28 -0500

To find repeats in a genome from 2 to 9 length using a Perl script, you can use the RepeatMasker tool with the "--length" option[0]. Here's a step-by-step guide:

Install RepeatMasker: First, you need to install RepeatMasker on your system. You can download it from the RepeatMasker website[0].

Prepare the genome sequence: Make sure you have the genome sequence in a FASTA file format. Let's assume the file is named "genome.fasta".

./RepeatMasker -pa -nolow -norna -no_is -div -lib RepeatMaskerLib.embl -gff -xsmall -small -poly -species -dir -length - genome.fasta

Replace the following placeholders with appropriate values:

: The number of processors/threads you want to use for parallel processing.
: The divergence value for the species you are analyzing. You can find divergence values for different species in the RepeatMasker documentation[0].
: The name of the species you are analyzing.
: The directory where you want the output files to be saved.
and : The minimum and maximum lengths of the repeats you want to find (in this case, 2 and 9).

Analyze the output: RepeatMasker will generate several output files, including a .out file. You can parse this file to extract the information you need. There is a Perl tool called "one_code_to_find_them_all.pl" that can help you parse RepeatMasker output files[0]. You can download it from the source provided.

Use the provided Perl script: Once you have the "one_code_to_find_them_all.pl" script, you can run it to conveniently parse the RepeatMasker output files. Here's an example of how to use it:

perl one_code_to_find_them_all.pl --rm --length

Replace with the path to your RepeatMasker .out file, and with the path to a file containing the lengths of the reference elements.

This script will generate several output files, including .log.txt and .copynumber.csv, which contain quantitative information about the identified repeat elements.

Remember to adjust the parameters and options according to your specific needs and the characteristics of your genome.

VG: variation graph data structures, interchange formats, alignment, genotyping, and variant calling methods

Jit — Tue, 28 Jan 2020 03:53:24 -0600

Variation graphs provide a succinct encoding of the sequences of many genomes. A variation graph (in particular as implemented in vg) is composed of:

nodes, which are labeled by sequences and ids
edges, which connect two nodes via either of their respective ends
paths, describe genomes, sequence alignments, and annotations (such as gene models and transcripts) as walks through nodes connected by edges

Address of the bookmark: https://github.com/vgteam/vg

INC-Seq: accurate single molecule reads using nanopore sequencing

Jit — Mon, 27 Nov 2017 10:38:56 -0600

INC-Seq reads enabled accurate species-level classification, identification of species at 0.1 % abundance and robust quantification of relative abundances, providing a cheap and effective approach for pathogen detection and microbiome profiling on the MinION system.

Address of the bookmark: https://github.com/CSB5/INC-Seq

ARCS: scaffolding genome drafts with linked reads

Rahul Nayak — Tue, 06 Mar 2018 16:35:26 -0600

ARCS, an application that utilizes the barcoding information contained in linked reads to further organize draft genomes into highly contiguous assemblies. We show how the contiguity of an ABySS H.sapiensgenome assembly can be increased over six-fold, using moderate coverage (25-fold) Chromium data. We expect ARCS to have broad utility in harnessing the barcoding information contained in linked read data for connecting high-quality sequences in genome assembly drafts.

Address of the bookmark: https://github.com/bcgsc/ARCS/

HISAT2: a fast and sensitive alignment program for mapping next-generation sequencing reads

Rahul Nayak — Tue, 08 May 2018 04:27:22 -0500

HISAT2 is a fast and sensitive alignment program for mapping next-generation sequencing reads (both DNA and RNA) to a population of human genomes (as well as to a single reference genome). Based on an extension of BWT for graphs [Sirén et al. 2014], we designed and implemented a graph FM index (GFM), an original approach and its first implementation to the best of our knowledge. In addition to using one global GFM index that represents a population of human genomes, HISAT2 uses a large set of small GFM indexes that collectively cover the whole genome (each index representing a genomic region of 56 Kbp, with 55,000 indexes needed to cover the human population). These small indexes (called local indexes), combined with several alignment strategies, enable rapid and accurate alignment of sequencing reads. This new indexing scheme is called a Hierarchical Graph FM index (HGFM).

more at https://ccb.jhu.edu/software/hisat2/index.shtml

Address of the bookmark: https://github.com/infphilo/hisat2

Circlator: automated circularization of genome assemblies using long sequencing reads

Poonam Mahapatra — Tue, 15 May 2018 09:42:32 -0500

A tool to circularize genome assemblies. The algorithm and benchmarks are described in the Genome Biology manuscript. Citation: "Circlator: automated circularization of genome assemblies using long sequencing reads", Hunt et al, Genome Biology 2015 Dec 29;16(1):294. doi: 10.1186/s13059-015-0849-0. PMID: 26714481.

Address of the bookmark: http://sanger-pathogens.github.io/circlator/

Cerulean: A hybrid assembly using high throughput short and long reads

Rahul Nayak — Tue, 05 Jun 2018 10:10:15 -0500

Cerulean extends contigs assembled using short read datasets like Illumina paired-end reads using long reads like PacBio RS long reads. Cerulean v0.1 has been implemented with bacterial genomes in mind. The method is fully described in Deshpande, V., Fung, E. D., Pham, S., & Bafna, V. (2013). Cerulean: A hybrid assembly using high throughput short and long reads. arXiv preprint arXiv:1307.7933. http://arxiv.org/abs/1307.7933

Address of the bookmark: https://sourceforge.net/projects/ceruleanassembler/