BOL: Related items

gVolante: Completeness Assessment of Genome/Transcriptome Sequences

Neel — Sun, 13 Jan 2019 07:03:25 -0600

A brand-new web server, gVolante, which provides an online tool for (i) on-demand completeness assessment of sequence sets by means of the previously developed pipelines CEGMA and BUSCO and (ii) browsing pre-computed completeness scores for publicly available data in its database section

Address of the bookmark: https://gvolante.riken.jp/analysis.html

GenBank release 257.0 is now available!

Neel — Wed, 23 Aug 2023 00:23:23 -0500

GenBank release 257.0 is now available! This release has 25.10 trillion bases and 3.69 billion records. Learn more: https://ncbiinsights.ncbi.nlm.nih.gov/2023/08/21/genbank-release-257/

GenBank release 257.0 (8/15/2023) is now available on the NCBI FTP site. This release has 25.10 trillion bases and 3.69 billion records.

The current release has:

246,119,175 traditional records containing 2,112,058,517,945 base pairs of sequence data
2,631,493,489 WGS records containing 22,294,446,104,543 base pairs of sequence data
686,271,945 bulk-oriented TSA records containing 646,176,166,908 base pairs of sequence data
124,421,006 bulk-oriented TLS records containing 48,289,699,026 base pairs of sequence data

pbalign: maps PacBio reads to reference sequences and saves alignments to a BAM file

Jit — Thu, 24 May 2018 10:06:52 -0500

pbalign aligns PacBio reads to reference sequences, filters aligned reads according to user-specific filtering criteria, and converts the output to either the SAM format or PacBio Compare HDF5 (e.g., .cmp.h5) format. The output Compare HDF5 file will be compatible with Quiver if --forQuiver option is specified.

Address of the bookmark: https://github.com/PacificBiosciences/pbalign

COSINE: non-seeding method for mapping long noisy sequences

Jit — Fri, 26 Oct 2018 00:41:59 -0500

Third generation sequencing (TGS) are highly promising technologies but the long and noisy reads from TGS are difficult to align using existing algorithms. Here, we present COSINE, a conceptually new method designed specifically for aligning long reads contaminated by a high level of errors.

Address of the bookmark: https://github.com/SUwonglab/COSINE

Cogent: a tool for reconstructing the coding genome using high-quality full-length transcriptome sequences.

Jit — Tue, 18 Jun 2019 05:33:04 -0500

Cogent is a tool that identifies gene families and reconstructs the coding genome using high-quality transcriptome data without a reference genome, and can be used to check assemblies for the presence of these known coding sequences.

Cogent is a tool for reconstructing the coding genome using high-quality full-length transcriptome sequences. It is designed to be used on Iso-Seq data and in cases where there is no reference genome or the ref genome is highly incomplete.

See a recent presentation on Cogent being applied to the Cuttlefish Iso-Seq data.

Cogent preliminary draft paper (updated 2016Dec version), Supplementary

Please see wiki for details on usage.

Address of the bookmark: https://github.com/Magdoll/Cogent

Curated set of ribosomal RNA (rRNA) reference sequences (targeted loci) with verifiable organism

Rahul Nayak — Sun, 23 Feb 2020 02:17:30 -0600

MCBI have a curated set of ribosomal RNA (rRNA) reference sequences (targeted loci) with verifiable organism sources and current names. This set is critical for correctly identifying and classifying prokaryotic (bacteria and archaea) and fungal samples. To provide easy access to these sequences, we recently added a separate rRNA/ITS databases section on the nucleotide BLAST page for these targeted sequences that makes it convenient to quickly identify source organisms. The new databases are:

*16S ribosomal RNA (Bacteria and Archaea)

*18S ribosomal RNA sequences (SSU) from Fungi type and reference material

*28S ribosomal RNA sequences (LSU) from Fungi type and reference material

*Internal transcribed spacer region (ITS) from Fungi type and reference material

You can also download these from the BLAST db FTP area. See the NCBI Insights post for more detail.

Useful links

-----------------

BLAST form with rRNA/ITS databases

BLAST db download

Targeted loci

If you have any questions or concerns, please contact blast-help@ncbi.nlm.nih.gov

kebabs: package provides functionality for kernel based analysis of biological sequences via Support Vector Machine (SVM) based methods

Rahul Nayak — Fri, 04 Mar 2022 00:14:11 -0600

The kebabs package provides functionality for kernel based analysis of biological sequences via Support Vector Machine (SVM) based methods. Biological sequences include DNA, RNA, and amino acid (AA) sequences. Sequence kernels define similarity measures between sequences. The package implements some of the most important kernels for sequence analysis in a very flexible and efficient way and extends the standard position-independent functionality of these kernels in a novel way to take the position of patterns in the sequences into account for the similarity measure.

http://www.bioinf.jku.at/software/kebabs/

http://bioconductor.org/packages/release/bioc/vignettes/kebabs/inst/doc/kebabs.pdf

Address of the bookmark: http://www.bioinf.jku.at/software/kebabs/

doubletrouble: identify duplicated genes from whole-genome protein sequences and classify

LEGE — Tue, 05 Mar 2024 00:23:49 -0600

doubletrouble aims to identify duplicated genes from whole-genome protein sequences and classify them based on their modes of duplication. The duplication modes are i. segmental duplication (SD); ii. tandem duplication (TD); iii. proximal duplication (PD); iv. transposed duplication (TRD) and; v. dispersed duplication (DD). Transposon-derived duplicates (TRD) can be further subdivided into rTRD (retrotransposon-derived duplication) and dTRD (DNA transposon-derived duplication). If users want a simpler classification scheme, duplicates can also be classified into SD- and SSD-derived (small-scale duplication) gene pairs. Besides classifying gene pairs, users can also classify genes, so that each gene is assigned a unique mode of duplication. Users can also calculate substitution rates per substitution site (i.e., Ka and Ks) from duplicate pairs, find peaks in Ks distributions with Gaussian Mixture Models (GMMs), and classify gene pairs into age groups based on Ks peaks.

Address of the bookmark: https://bioconductor.org/packages/release/bioc/html/doubletrouble.html

MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons

Jit — Mon, 18 Feb 2019 04:21:50 -0600

MACSE aligns coding NT sequences with respect to their AA translation while allowing NT sequences to contain multiple frameshifts and/or stop codons. MACSE is hence the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.

For further details about the underlying algorithm see the original publication:
MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons.
Vincent Ranwez, Sébastien Harispe, Frédéric Delsuc, Emmanuel JP Douzery
PLoS One 2011, 6(9): e22594.

Address of the bookmark: https://bioweb.supagro.inra.fr/macse/index.php?menu=releases

MUMmer4: A fast and versatile genome alignment system

Jit — Sat, 03 Feb 2018 04:59:17 -0600

MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes;

Address of the bookmark: https://mummer4.github.io/