BOL: Related items

BLAST

Wed, 25 Sep 2013 10:56:23 -0500

Dr. Rob Edwards describes how BLAST works

BRIG

Jit — Thu, 16 Feb 2017 13:14:25 -0600

BRIG is a free cross-platform (Windows/Mac/Unix) application that can display circular comparisons between a large number of genomes, with a focus on handling genome assembly data. The application is available at:http://sourceforge.net/projects/brig

If you have any questions or comments, post them on one of the trackers on BRIG’s SourceForge page:http://sourceforge.net/tracker/?group_id=328245.

Features:

Images show similarity between a central reference sequence and other sequences as concentric rings.
BRIG will perform all BLAST comparisons and file parsing automatically via a simple GUI.
Contig boundaries and read coverage can be displayed for draft genomes; customized graphs and annotations can be displayed.
Using a user-defined set of genes as input, BRIG can display gene presence, absence, truncation or sequence variation in a set of complete genomes, draft genomes or even raw, unassembled sequence data.
BRIG also accepts SAM-formatted read-mapping files enabling genomic regions present in unassembled sequence data from multiple samples to be compared simultaneously

Address of the bookmark: http://brig.sourceforge.net/

Basic command-line to run BLAST

Shruti Paniwala — Wed, 14 Mar 2018 05:10:34 -0500

The goal of this tutorial is to run you through a demonstration of the command line, which you may not have seen or used much before.

All of the commands below can copy/pasted.

Install software

Copy and paste the following commands

sudo apt-get update && sudo apt-get -y install python ncbi-blast+

This updates the software list and installs the Python programming language and NCBI BLAST+.

Get Data

Grab some data to play with. Grab some cow and human RefSeq proteins:

wget ftp://ftp.ncbi.nih.gov/refseq/B_taurus/mRNA_Prot/cow.1.protein.faa.gz
wget ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.1.protein.faa.gz

This is only the first part of the human and cow protein files - there are 24 files total for human.

The database files are both gzipped, so lets unzip them

gunzip *gz
ls

Take a look at the head of each file:

head cow.1.protein.faa
head human.1.protein.faa

These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s just a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.

Note that the files are in fasta format, even though they end if ”.faa” instead of the usual ”.fasta”. This NCBI’s way of denoting that this is a fasta file with amino acids instead of nucleotides.

How many sequences are in each one?

grep -c '^>' cow.1.protein.faa
grep -c '^>' human.1.protein.faa

This grep command uses the c flag, which reports a count of lines with match to the pattern. In this case, the pattern is a regular expression, meaning match only lines that begin with a >.

This is a bit too big, lets take a smaller set for practice. Lets take the first two sequences of the cow proteins, which we can see are on the first 6 lines

head -6 cow.1.protein.faa > cow.small.faa

BLAST

Now we can blast these two cow sequences against the set of human sequences. First, we need to tell blast about our database. BLAST needs to do some pre-work on the database file prior to searching. This helps to make the software work a lot faster. Because you installed your own version of the sotware, you need to tell the shell where the software is located. Use the full path and the makeblastdb command:

makeblastdb -in human.1.protein.faa -dbtype prot
ls

Note that this makes a lot of extra files, with the same name as the database plus new extensions (.pin, .psq, etc). To make blast work, these files, called index files, must be in the same directory as the fasta file.

blastp [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-negative_seqidlist filename]
[-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
[-db_hard_mask filtering_algorithm] [-subject subject_input_file]
[-subject_loc range] [-query input_file] [-out output_file]
[-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
[-gapextend extend_penalty] [-qcov_hsp_perc float_value]
[-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value]
[-sum_stats bool_value] [-seg SEG_options] [-soft_masking soft_masking]
[-matrix matrix_name] [-threshold float_value] [-culling_limit int_value]
[-best_hit_overhang float_value] [-best_hit_score_edge float_value]
[-window_size int_value] [-lcase_masking] [-query_loc range]
[-parse_deflines] [-outfmt format] [-show_gis]
[-num_descriptions int_value] [-num_alignments int_value]
[-line_length line_length] [-html] [-max_target_seqs num_sequences]
[-num_threads int_value] [-ungapped] [-remote] [-comp_based_stats compo]
[-use_sw_tback] [-version]

Now we can run the blast job. We will use blastp, which is appropriate for protein to protein comparisons.

blastp -query cow.small.faa -db human.1.protein.faa

This gives us a lot of information on the terminal screen. But this is difficult to save and use later - Blast also gives the option of saving the text to a file.

    blastp -query cow.small.faa -db human.1.protein.faa -out cow_vs_human_blast_results.txt
ls

Take a look at the results using less. Note that there can be more than one match between the query and the same subject. These are referred to as high-scoring segment pairs (HSPs).

less cow_vs_human_blast_results.txt

So how do you know about all the options, such as the flag to create an output file? Lets also take a look at the help pages. Unfortunately there are no man pages (those are usually reserved for shell commands, but some software authors will provide them as well), but there is a text help output

blastp -help

To scroll through slowly

blastp -help | less

To quit the less screen, press the q key.

Parameters of interest include the -evalue (Default is 10?!?) and the -outfmt

Lets filter for more statistically significant matches with a different output format:

blastp \
-query cow.small.faa \
-db human.1.protein.faa \
-out cow_vs_human_blast_results.tab \
-evalue 1e-5 \
-outfmt 7

I broke the long single command into many lines with by “escaping” the newline. That forward slash tells the command line “Wait, I’m not done yet!”. So it waits for the next line of the command before executing.

Check out the results with less.

Lets try a medium sized data set next

head -199 cow.1.protein.faa > cow.medium.faa

What size is this db?

grep -c '^>' cow.medium.faa

Lets run the blast again, but this time lets return only the best hit for each query.

blastp \
-query cow.medium.faa \
-db human.1.protein.faa \
-out cow_vs_human_blast_results.tab \
-evalue 1e-5 \
-outfmt 6 \
-max_target_seqs 1

Summary

Review:

command line programs such as blast use flags to get information about how and what to do
blast options can be found by typing blastp -help
break a command up over many lines by using `` to “escape” the new line

Blastn

blastn [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-negative_seqidlist filename]
[-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
[-db_hard_mask filtering_algorithm] [-subject subject_input_file]
[-subject_loc range] [-query input_file] [-out output_file]
[-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
[-gapextend extend_penalty] [-perc_identity float_value]
[-qcov_hsp_perc float_value] [-max_hsps int_value]
[-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value]
[-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
[-min_raw_gapped_score int_value] [-template_type type]
[-template_length int_value] [-dust DUST_options]
[-filtering_db filtering_database]
[-window_masker_taxid window_masker_taxid]
[-window_masker_db window_masker_db] [-soft_masking soft_masking]
[-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
[-best_hit_score_edge float_value] [-window_size int_value]
[-off_diagonal_range int_value] [-use_index boolean] [-index_name string]
[-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines]
[-outfmt format] [-show_gis] [-num_descriptions int_value]
[-num_alignments int_value] [-line_length line_length] [-html]
[-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
[-version]

DESCRIPTION
Nucleotide-Nucleotide BLAST 2.7.0+

Magic-BLAST

Shruti Paniwala — Fri, 20 Mar 2020 15:18:36 -0500

Magic-BLAST is a tool for mapping large next-generation RNA or DNA sequencing runs against a whole genome or transcriptome. Each alignment optimizes a composite score, taking into account simultaneously the two reads of a pair, and in case of RNA-seq, locating the candidate introns and adding up the score of all exons. This is very different from other versions of BLAST, where each exon is scored as a separate hit and read-pairing is ignored.

Address of the bookmark: https://ncbi.github.io/magicblast/

Cleaner BLAST Databases for More Accurate Results

LEGE — Tue, 23 Apr 2024 01:23:08 -0500

Do you use BLAST to identify a sequence or the evolutionary scope of a gene? That can be challenging if contaminated and misclassified sequences are in the BLAST databases and show up in your search results. To address this problem, we now use the NCBI quality assurance tools listed below to systematically remove these misleading sequences from the default nucleotide (nt) and protein (nr) BLAST databases.

Foreign Contamination Screen tool for genome cross-species screening (FCS-GX) detects contamination from foreign organisms in genomes and other sequences using the genome cross-species aligner (GX)
Average Nucleotide Identity (ANI) evaluates the taxonomic classification of prokaryotic genome assemblies. Sequences from genomes marked up as ‘unverified source organism’ are considered suspect and removed.

Ref https://ncbiinsights.ncbi.nlm.nih.gov/2024/04/22/cleaner-blast-databases-more-accurate-results/

Circoletto: visualizing sequence similarity with Circos

Jit — Fri, 09 Feb 2018 10:23:40 -0600

Circoletto, an online visualization tool based on Circos, which provides a fast, aesthetically pleasing and informative overview of sequence similarity search results.

Online version and downloadable software package for offline use (source code in PERL) freely available at http://bat.ina.certh.gr/tools/circoletto/

Contact:ndarz@certh.gr

Address of the bookmark: http://tools.bat.infspire.org/circoletto/

TwinBLAST: When Two Is Better than One

Jit — Sat, 07 Sep 2019 08:50:08 -0500

TwinBLAST is a web-based tool for viewing 2 BLAST reports simultaneouslyside-by-side. It uses ExtJS (www.sencha.com/products/extjs/) to provide 2independently scrollable panels. BioPerl (www.bioperl.org) is used to indexraw BLAST reports and Bio::Graphics is used to draw pictograms of the BLASThits.

https://github.com/IGS/twinblast

https://mra.asm.org/content/8/35/e00842-19

Address of the bookmark: https://github.com/IGS/twinblast

BEAP: Blast Extension and Assembly Program

Shruti Paniwala — Mon, 11 Jun 2018 04:52:56 -0500

The Blast Extension and Assembly Program (BEAP) is a computer program that uses a short starting DNA fragment, often a EST or partial gene segment, as "primer", to recursively blast nucleotide databases in an attempt to obtain all sequences that overlaps, directly or indirectly, with the "primer" therefore help to "extend" the length of the original sequence for constructing a "full length" sequence for functional analysis, or at least to obtain neighboring regions of the segment for SNP discovery and linkage disequilibrium analysis. The confidence of assembling the resulting sequences is achieved by using a known genome, such as human genome, as a reference. https://www.animalgenome.org/tools/beap/

Address of the bookmark: https://www.animalgenome.org/tools/beap/

Biological databases !

BioStar — Wed, 12 Feb 2020 01:16:29 -0600

Now a days there are a lots of genomics databases available around the world. This bookmark is created to provide all links in one place ...

ftp://ftp.ncbi.nih.gov/genomes/

https://hgdownload.soe.ucsc.edu/downloads.html

Address of the bookmark: ftp://ftp.ncbi.nih.gov/genomes/

Breeding Insight

BioStar — Wed, 06 Jan 2021 19:49:21 -0600

Breeding Insight at Cornell University will leverage recent improvements in genomics and open source informatics components, and in partnership with small breeding programs, will enable these programs to harness powerful digital tools to accelerate their genetic gains

Breeding Insight is funded by the U.S. Department of Agriculture (USDA) Agricultural Research Service (ARS) through Cornell University. The USDA ARS delivers scientific solutions to national and global agricultural challenges. As a global leader in agricultural discovery through scientific excellence, ARS is committed to delivering cutting-edge, scientific tools and innovative solutions for American farmers, producers, industry, and communities to support the nourishment and well-being of all people; sustaining our nation’s agroecosystems and natural resources; and ensuring the economic competitiveness and excellence of our agriculture.

Address of the bookmark: https://www.breedinginsight.org/