BOL: Related items

Tiara: deep learning-based classification system for eukaryotic sequences

Rahul Nayak — Mon, 14 Mar 2022 23:02:11 -0500

With a large number of metagenomic datasets becoming available, eukaryotic metagenomics emerged as a new challenge. The proper classification of eukaryotic nuclear and organellar genomes is an essential step toward a better understanding of eukaryotic diversity.

Address of the bookmark: https://academic.oup.com/bioinformatics/article/38/2/344/6375939

Basics of BLAST Programs !

BioStar — Fri, 26 Jul 2024 06:04:26 -0500

The Basic Local Alignment Search Tool (BLAST) is a powerful bioinformatics program used to compare an input sequence (such as DNA, RNA, or protein sequences) against a database of sequences to find regions of similarity. Developed by the National Center for Biotechnology Information (NCBI), BLAST is widely used for identifying species, finding functional and evolutionary relationships between sequences, and predicting the function of novel sequences.

Key Features of BLAST:
1. Sequence Comparison: BLAST searches for local alignments between the query sequence and sequences in a database. It identifies regions of similarity, which can help infer functional and evolutionary relationships.

2. Speed and Efficiency: BLAST uses heuristic algorithms, making it faster than exhaustive search methods, suitable for large-scale database searches.

3. Versatility: There are several versions of BLAST for different types of sequence comparisons:
- blastn: Compares a nucleotide query sequence against a nucleotide sequence database.
- blastp: Compares a protein query sequence against a protein sequence database.
- blastx: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
- tblastn: Compares a protein query sequence against a nucleotide sequence database translated in all reading frames.
- tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

4. Scoring and E-value: BLAST results are scored based on the quality and length of the alignments. The E-value (expect value) indicates the number of alignments one can expect to find by chance, with lower E-values representing more significant matches.

5. Output Formats: BLAST provides results in various formats, including plain text, HTML, XML, and JSON, making it adaptable for different types of analyses and integrations with other tools.

Applications of BLAST:
- Genomic Research: Identifying genes, understanding genetic diversity, and mapping genome sequences.
- Protein Function Prediction: Inferring the function of unknown proteins by comparing them to known protein sequences.
- Evolutionary Studies: Exploring evolutionary relationships between organisms by comparing their genetic material.
- Medical Research: Identifying pathogens, understanding disease mechanisms, and developing treatments by comparing sequences of interest.

Overall, BLAST is an essential tool in bioinformatics, offering a reliable and efficient way to analyze and interpret biological sequence data.

MACSE: Multiple Alignment of Coding SEquences Accounting for Frameshifts and Stop Codons

Jit — Mon, 18 Feb 2019 04:21:50 -0600

MACSE aligns coding NT sequences with respect to their AA translation while allowing NT sequences to contain multiple frameshifts and/or stop codons. MACSE is hence the first automatic solution to align protein-coding gene datasets containing non-functional sequences (pseudogenes) without disrupting the underlying codon structure. It has also proved useful in detecting undocumented frameshifts in public database sequences and in aligning next-generation sequencing reads/contigs against a reference coding sequence.

For further details about the underlying algorithm see the original publication:
MACSE: Multiple Alignment of Coding SEquences accounting for frameshifts and stop codons.
Vincent Ranwez, Sébastien Harispe, Frédéric Delsuc, Emmanuel JP Douzery
PLoS One 2011, 6(9): e22594.

Address of the bookmark: https://bioweb.supagro.inra.fr/macse/index.php?menu=releases

HipSTR: Haplotype inference and phasing for Short Tandem Repeats

BioJoker — Thu, 07 Mar 2019 21:13:06 -0600

HipSTR was specifically developed to deal with these errors in the hopes of obtaining more robust STR genotypes. In particular, it accomplishes this by:

Learning locus-specific PCR stutter models using an EM algorithm
Mining candidate STR alleles from population-scale sequencing data
Employing a specialized hidden Markov model to align reads to candidate alleles while accounting for STR artifacts
Utilizing phased SNP haplotypes to genotype and phase STRs

Address of the bookmark: https://github.com/tfwillems/HipSTR

CoLoRMap: Correcting Long Reads by Mapping short reads

Jit — Mon, 20 Aug 2018 14:17:05 -0500

Second generation sequencing technologies paved the way to an exceptional increase in the number of sequenced genomes, both prokaryotic and eukaryotic. However, short reads are difficult to assemble and often lead to highly fragmented assemblies. The recent developments in long reads sequencing methods offer a promising way to address this issue. However, so far long reads are characterized by a high error rate, and assembling from long reads require a high depth of coverage. This motivates the development of hybrid approaches that leverage the high quality of short reads to correct errors in long reads.We introduce CoLoRMap, a hybrid method for correcting noisy long reads, such as the ones produced by PacBio sequencing technology, using high-quality Illumina paired-end reads mapped onto the long reads. Our algorithm is based on two novel ideas: using a classical shortest path algorithm to find a sequence of overlapping short reads that minimizes the edit score to a long read and extending corrected regions by local assembly of unmapped mates of mapped short reads. Our results on bacterial, fungal and insect data sets show that CoLoRMap compares well with existing hybrid correction methods.The source code of CoLoRMap is freely available for non-commercial use at https://github.com/sfu-compbio/colormap

ehaghshe@sfu.ca or cedric.chauve@sfu.ca

Address of the bookmark: https://github.com/sfu-compbio/colormap

QuasR: Quantification and annotation of short reads in R

Neel — Fri, 13 Aug 2021 07:44:05 -0500

The QuasR package (short for Quantify and annotate short reads in R) integrates the functionality of several R packages (such as IRanges (Lawrence et al. 2013) and Rsamtools) and external software (e.g. bowtie, through the Rbowtie package, and HISAT2, through the Rhisat2 package). The package aims to cover the whole analysis workflow of typical high throughput sequencing experiments, starting from the raw sequence reads, over pre-processing and alignment, up to quantification. A single R script can contain all steps of a complete analysis, making it simple to document, reproduce or share the workflow containing all relevant details.

Address of the bookmark: https://www.bioconductor.org/packages/devel/bioc/vignettes/QuasR/inst/doc/QuasR.html

BLAST+ 2.11.0 release is now available on FTP site !

Jit — Sat, 14 Nov 2020 21:37:53 -0600

BLAST+ 2.11.0 release is now available from our FTP site. The main advance is the ability to provide usage reports to NCBI to help us improve BLAST. This information is limited to the name of the BLAST program, some basic database metadata, a few BLAST parameters, as well the number and total size of your queries. See the Privacy document for more details on the information we collect, how we will use it, and how you can opt-out of reporting.

Another new feature allows threading by query batch in rpsblast/rpstblastn. Enabling this option using -m t provides more efficient searching with large numbers of queries. See release notes for details on more improvements and bug fixes.

Useful Links
------------
NCBI Insights: https://ncbiinsights.ncbi.nlm.nih.gov/2020/11/12/blast-2-11-0/

BLAST FTP: https://go.usa.gov/x7QQ3
Privacy document: https://go.usa.gov/x7QQe
Release notes: https://go.usa.gov/x7Qnv

RITA: Rapid identification of high-confidence taxonomic assignments for metagenomic data

Jit — Mon, 27 Nov 2017 08:25:33 -0600

RITA is a standalone software package and Web server for taxonomic assignment of metagenomic sequence reads. By combining homology predictions from BLAST or UBLAST with compositional classifications from a Naive Bayes classifier, RITA is able to achieve very high accuracy on short reads. Unlike other hybrid approaches which combine these predictions for all sequences to be classified, RITA uses a pipeline to first identify cases where both types of classifier are in agreement, which constitute the highest-confidence set. Sequences not classified in this manner are subjected to a series of downstream classification steps.

This work has been accepted for publication:

MacDonald NJ, Parks DH, and Beiko RG. Rapid identification of taxonomic assignments. Accepted to Nucleic Acids Research April 4, 2012.

If you have any questions or bug reports, please let us know at .

Address of the bookmark: http://kiwi.cs.dal.ca/Software/RITA

quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification.

Abhi — Sat, 08 Jun 2024 15:54:36 -0500

quarTeT is a collection of tools for T2T genome assembly and basic analysis in automatic workflow.

Task include:

AssemblyMapper : reference-guided genome assembly
GapFiller : long-reads based gap filling
TeloExplorer : telomere identification
CentroMiner : centromere candidate prediction

https://academic.oup.com/hr/article/10/8/uhad127/7197191?login=false

Address of the bookmark: http://www.atcgn.com:8080/quarTeT/home.html

MafTools

Jit — Thu, 16 Feb 2017 11:16:01 -0600

maftools - An R package to summarize, analyze and visualize MAF files. Introduction.

With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widley accepted and used to store variants detected. The Cancer Genome Atlas Project has seqenced over 30 different cancers with sample size of each cancer type being over 200. The resulting data consisting of genetic variants is stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner either from TCGA sources or any in-house studies as long as the data is in MAF format. Maftools can also handle ICGC Simple Somatic Mutation format.

maftools is on bioRxiv

Please cite the below if you find this tool useful for you.

Mayakonda, A. and H.P. Koeffler, Maftools: Efficient analysis, visualization and summarization of MAF files from large-scale cohort based cancer studies. bioRxiv, 2016. doi: http://dx.doi.org/10.1101/052662

Address of the bookmark: https://github.com/PoisonAlien/maftools