BOL: Related items

Converting BLAST output into CSV

Poonam Mahapatra — Mon, 11 Dec 2017 04:17:58 -0600

Suppose we wanted to do something with all this BLAST output. Generally, that’s the case - you want to retrieve all matches, or do a reciprocal BLAST, or something.

As with most programs that run on UNIX, the text output is in some specific format. If the program is popular enough, there will be one or more parsers written for that format – these are just utilities written to help you retrieve whatever information you are interested in from the output.

Let’s conclude this tutorial by converting the BLAST output in out.txt into a spreadsheet format, using a Python script.

First, we need to get the script. We’ll do that using the ‘git’ program:

git clone https://github.com/ngs-docs/ngs-scripts.git /root/ngs-scripts

We’ll discuss ‘git’ more later; for now, just think of it as a way to get ahold of a particular set of files. In this case, we’ve placed the files in /root/ngs-scripts/, and you’re looking to run the script blast/blast-to-csv.py using Python:

python /root/ngs-scripts/blast/blast-to-csv.py out.txt

This outputs a spread-sheet like list of names and e-values. To save this to a file, do:

python /root/ngs-scripts/blast/blast-to-csv.py out.txt > ~out.csv

If you have Excel installed, try double clicking on it.

BLAST+ 2.11.0 release is now available on FTP site !

Jit — Sat, 14 Nov 2020 21:37:53 -0600

BLAST+ 2.11.0 release is now available from our FTP site. The main advance is the ability to provide usage reports to NCBI to help us improve BLAST. This information is limited to the name of the BLAST program, some basic database metadata, a few BLAST parameters, as well the number and total size of your queries. See the Privacy document for more details on the information we collect, how we will use it, and how you can opt-out of reporting.

Another new feature allows threading by query batch in rpsblast/rpstblastn. Enabling this option using -m t provides more efficient searching with large numbers of queries. See release notes for details on more improvements and bug fixes.

Useful Links
------------
NCBI Insights: https://ncbiinsights.ncbi.nlm.nih.gov/2020/11/12/blast-2-11-0/

BLAST FTP: https://go.usa.gov/x7QQ3
Privacy document: https://go.usa.gov/x7QQe
Release notes: https://go.usa.gov/x7Qnv

Basics of BLAST Programs !

BioStar — Fri, 26 Jul 2024 06:04:26 -0500

The Basic Local Alignment Search Tool (BLAST) is a powerful bioinformatics program used to compare an input sequence (such as DNA, RNA, or protein sequences) against a database of sequences to find regions of similarity. Developed by the National Center for Biotechnology Information (NCBI), BLAST is widely used for identifying species, finding functional and evolutionary relationships between sequences, and predicting the function of novel sequences.

Key Features of BLAST:
1. Sequence Comparison: BLAST searches for local alignments between the query sequence and sequences in a database. It identifies regions of similarity, which can help infer functional and evolutionary relationships.

2. Speed and Efficiency: BLAST uses heuristic algorithms, making it faster than exhaustive search methods, suitable for large-scale database searches.

3. Versatility: There are several versions of BLAST for different types of sequence comparisons:
- blastn: Compares a nucleotide query sequence against a nucleotide sequence database.
- blastp: Compares a protein query sequence against a protein sequence database.
- blastx: Compares a nucleotide query sequence translated in all reading frames against a protein sequence database.
- tblastn: Compares a protein query sequence against a nucleotide sequence database translated in all reading frames.
- tblastx: Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

4. Scoring and E-value: BLAST results are scored based on the quality and length of the alignments. The E-value (expect value) indicates the number of alignments one can expect to find by chance, with lower E-values representing more significant matches.

5. Output Formats: BLAST provides results in various formats, including plain text, HTML, XML, and JSON, making it adaptable for different types of analyses and integrations with other tools.

Applications of BLAST:
- Genomic Research: Identifying genes, understanding genetic diversity, and mapping genome sequences.
- Protein Function Prediction: Inferring the function of unknown proteins by comparing them to known protein sequences.
- Evolutionary Studies: Exploring evolutionary relationships between organisms by comparing their genetic material.
- Medical Research: Identifying pathogens, understanding disease mechanisms, and developing treatments by comparing sequences of interest.

Overall, BLAST is an essential tool in bioinformatics, offering a reliable and efficient way to analyze and interpret biological sequence data.

Understanding BLASTn output format 6 !

Rahul Nayak — Wed, 27 Jun 2018 18:38:21 -0500

BLASTn output format 6

BLASTn maps DNA against DNA, for example gene sequences against a reference genome

blastn -query genes.ffn -subject genome.fna -outfmt 6

BLASTn tabular output format 6

Column headers:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

1.	qseqid	query (e.g., gene) sequence id
2.	sseqid	subject (e.g., reference genome) sequence id
3.	pident	percentage of identical matches
4.	length	alignment length
5.	mismatch	number of mismatches
6.	gapopen	number of gap openings
7.	qstart	start of alignment in query
8.	qend	end of alignment in query
9.	sstart	start of alignment in subject
10.	send	end of alignment in subject
11.	evalue	expect value
12.	bitscore	bit score

Define your own output format

by adding the option -outfmt, as for example:

-outfmt "6 qseqid sseqid pident qlen length mismatch gapope evalue bitscore"

supported format specifiers are:
qseqid    Query Seq-id
qgi   Query GI
qacc    Query accesion
qaccver   Query accesion.version
qlen    Query sequence length
sseqid    Subject Seq-id
sallseqid All subject Seq-id(s), separated by a ';'
sgi       Subject GI
sallgi    All subject GIs
sacc      Subject accession
saccver   Subject accession.version
sallacc   All subject accessions
slen      Subject sequence length
qstart    Start of alignment in query
qend    End of alignment in query
sstart    Start of alignment in subject
send      End of alignment in subject
qseq      Aligned part of query sequence
sseq      Aligned part of subject sequence
evalue    Expect value
bitscore  Bit score
score   Raw score
length    Alignment length
pident    Percentage of identical matches
nident    Number of identical matches
mismatch  Number of mismatches
positive  Number of positive-scoring matches
gapopen   Number of gap openings
gaps      Total number of gaps
ppos      Percentage of positive-scoring matches
frames    Query and subject frames separated by a '/'
qframe    Query frame
sframe    Subject frame
btop      Blast traceback operations (BTOP)
staxids   Subject Taxonomy ID(s), separated by a ';'
sscinames Subject Scientific Name(s), separated by a ';'
scomnames Subject Common Name(s), separated by a ';'
sblastnames Subject Blast Name(s), separated by a ';'   (in alphabetical order)
sskingdoms  Subject Super Kingdom(s), separated by a ';'     (in alphabetical order)
stitle    Subject Title
salltitles  All Subject Title(s), separated by a '<>'
sstrand   Subject Strand
qcovs   Query Coverage Per Subject
qcovhsp   Query Coverage Per HSP

default values are:
-outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore"

REST API

Neel — Mon, 04 Oct 2021 12:46:40 -0500

REST API

The Representational State Transfer (REST) sample clients are provided for a number of programming languages. For details of how to use these clients, download the client and run the program without any arguments.

Language	Download	Requirements
Perl	psiblast.pl	LWP and XML::Simple
Python	psiblast.py	xmltramp2

For details see Environment setup for REST Web Services and Examples for Perl REST Web Services Clients pages.

KOALA: KEGG's internal annotation tool for K number assignment of KEGG GENES using SSEARCH computation

Abhimanyu Singh — Wed, 12 Dec 2018 09:16:55 -0600

KOALA (KEGG Orthology And Links Annotation) is KEGG's internal annotation tool for K number assignment of KEGG GENES using SSEARCH computation. BlastKOALA and GhostKOALA assign K numbers to the user's sequence data by BLAST and GHOSTX searches, respectively, against a nonredundant set of KEGG GENES. Annotate Sequence in KEGG Mapper and Pathogen Checker in KEGG Pathogen are special interfaces to the BlastKOALA server and can be executed in an interactive mode. See Step-by-step Instructions.

Reference: Kanehisa, M., Sato, Y., and Morishima, K. (2016) BlastKOALA and GhostKOALA: KEGG tools for functional characterization of genome and metagenome sequences. J. Mol. Biol. 428, 726-731. [pubmed] [pdf]

Address of the bookmark: https://www.kegg.jp/blastkoala/

npScarf: real-time scaffolder using SPAdes contigs and Nanopore sequencing reads

Shruti Paniwala — Mon, 11 Jun 2018 05:14:57 -0500

npScarf (jsa.np.npscarf) is a program that connect contigs from a draft genomes to generate sequences that are closer to finish. These pipelines can run on a single laptop for microbial datasets. In real-time mode, it can be integrated with simple structural analyses such as gene ordering, plasmid forming.

Address of the bookmark: http://japsa.readthedocs.io/en/latest/tools/jsa.np.npscarf.html

SuRankCo: supervised ranking of contigs in de novo assemblies

Neel — Wed, 24 May 2017 04:46:52 -0500

SuRankCo is a machine learning based software to score and rank contigs from de novo assemblies of next generation sequencing data. It trains with alignments of contigs with known reference genomes and predicts scores and ranking for contigs which have no related reference genome yet.

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-015-0644-7

Address of the bookmark: https://sourceforge.net/projects/surankco/

PGAP-X: Extension on pan-genome analysis pipeline

Rahul Nayak — Tue, 23 Jan 2018 11:41:43 -0600

PGAP-X is a microbial comparative genomic analysis platform with graphic interface. Serials of algorithms and methodologies have been developed and integrated to analyze and visualize genomics structure variation, gene distribution with different conservative levels, and genetic variation from pan-genome sight. At the same time, analytical result data from many other programs, including genome alignment result and orthologs clusters, are also supported to be further analyzed or visualized in PGAP-X. The workflow and feature snapshot in PGAP-X were shown as Fig.1 and Fig.2.

Address of the bookmark: https://pgapx.ybzhao.com/

LoRDEC: a hybrid error correction program for long, PacBio reads

Jit — Mon, 10 Apr 2017 04:16:09 -0500

LoRDEC is a program to correct sequencing errors in long reads from 3rd generation sequencing with high error rate, and is especially intended for PacBio reads. It uses a hybrid strategy, meaning that it uses two sets of reads: the reference read set, whose error rate is assumed to be small, and the PacBio read set, which is then corrected using the reference set. Typically, the reference set contains Illumina reads.

Usually, errors in PacBio reads include many insertions and deletions, and comparatively less substitutions. LoRDEC can correct errors of all these types.
After correction, a larger portion of the sequence of PacBio reads is usable for detection of region of similarity with other sequences, for aligning them to the contigs of an assembly, etc.

Why is LoRDEC different?

It is efficient and can process large read data sets, included from eukaryotic or vertebrate species, on a usual computing server, and even works on desktop/laptop computers.
It adopts a novel graph based approach: it builds a succinct De Bruijn Graph (DBG) representing the short reads, and seeks a corrective sequence for each erroneous region of a long read by traversing chosen paths in the graph.

Address of the bookmark: http://www.atgc-montpellier.fr/lordec/