Alternative content
The goal of this tutorial is to run you through a demonstration of the command line, which you may not have seen or used much before.
All of the commands below can copy/pasted.
Copy and paste the following commands
sudo apt-get update && sudo apt-get -y install python ncbi-blast+
This updates the software list and installs the Python programming language and NCBI BLAST+.
Grab some data to play with. Grab some cow and human RefSeq proteins:
This is only the first part of the human and cow protein files - there are 24 files total for human.
The database files are both gzipped, so lets unzip them
gunzip *gz ls
Take a look at the head of each file:
head cow.1.protein.faa head human.1.protein.faa
These are protein sequences in FASTA format. FASTA format is something many of you have probably seen in one form or another – it’s pretty ubiquitous. It’s just a text file, containing records; each record starts with a line beginning with a ‘>’, and then contains one or more lines of sequence text.
Note that the files are in fasta format, even though they end if ”.faa” instead of the usual ”.fasta”. This NCBI’s way of denoting that this is a fasta file with amino acids instead of nucleotides.
How many sequences are in each one?
grep -c '^>' cow.1.protein.faa grep -c '^>' human.1.protein.faa
This grep command uses the c flag, which reports a count of lines with match to the pattern. In this case, the pattern is a regular expression, meaning match only lines that begin with a >.
This is a bit too big, lets take a smaller set for practice. Lets take the first two sequences of the cow proteins, which we can see are on the first 6 lines
head -6 cow.1.protein.faa > cow.small.faa
Now we can blast these two cow sequences against the set of human sequences. First, we need to tell blast about our database. BLAST needs to do some pre-work on the database file prior to searching. This helps to make the software work a lot faster. Because you installed your own version of the sotware, you need to tell the shell where the software is located. Use the full path and the makeblastdb command:
makeblastdb -in human.1.protein.faa -dbtype prot ls
Note that this makes a lot of extra files, with the same name as the database plus new extensions (.pin, .psq, etc). To make blast work, these files, called index files, must be in the same directory as the fasta file.
blastp [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-negative_seqidlist filename]
[-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
[-db_hard_mask filtering_algorithm] [-subject subject_input_file]
[-subject_loc range] [-query input_file] [-out output_file]
[-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
[-gapextend extend_penalty] [-qcov_hsp_perc float_value]
[-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value]
[-sum_stats bool_value] [-seg SEG_options] [-soft_masking soft_masking]
[-matrix matrix_name] [-threshold float_value] [-culling_limit int_value]
[-best_hit_overhang float_value] [-best_hit_score_edge float_value]
[-window_size int_value] [-lcase_masking] [-query_loc range]
[-parse_deflines] [-outfmt format] [-show_gis]
[-num_descriptions int_value] [-num_alignments int_value]
[-line_length line_length] [-html] [-max_target_seqs num_sequences]
[-num_threads int_value] [-ungapped] [-remote] [-comp_based_stats compo]
[-use_sw_tback] [-version]
Now we can run the blast job. We will use blastp, which is appropriate for protein to protein comparisons.
blastp -query cow.small.faa -db human.1.protein.faa
This gives us a lot of information on the terminal screen. But this is difficult to save and use later - Blast also gives the option of saving the text to a file.
blastp -query cow.small.faa -db human.1.protein.faa -out cow_vs_human_blast_results.txt ls
Take a look at the results using less. Note that there can be more than one match between the query and the same subject. These are referred to as high-scoring segment pairs (HSPs).
less cow_vs_human_blast_results.txt
So how do you know about all the options, such as the flag to create an output file? Lets also take a look at the help pages. Unfortunately there are no man pages (those are usually reserved for shell commands, but some software authors will provide them as well), but there is a text help output
blastp -help
To scroll through slowly
blastp -help | less
To quit the less screen, press the q key.
Parameters of interest include the -evalue (Default is 10?!?) and the -outfmt
Lets filter for more statistically significant matches with a different output format:
blastp \ -query cow.small.faa \ -db human.1.protein.faa \ -out cow_vs_human_blast_results.tab \ -evalue 1e-5 \ -outfmt 7
I broke the long single command into many lines with by “escaping” the newline. That forward slash tells the command line “Wait, I’m not done yet!”. So it waits for the next line of the command before executing.
Check out the results with less.
Lets try a medium sized data set next
head -199 cow.1.protein.faa > cow.medium.faa
What size is this db?
grep -c '^>' cow.medium.faa
Lets run the blast again, but this time lets return only the best hit for each query.
blastp \ -query cow.medium.faa \ -db human.1.protein.faa \ -out cow_vs_human_blast_results.tab \ -evalue 1e-5 \ -outfmt 6 \ -max_target_seqs 1
Review:
Blastn
blastn [-h] [-help] [-import_search_strategy filename]
[-export_search_strategy filename] [-task task_name] [-db database_name]
[-dbsize num_letters] [-gilist filename] [-seqidlist filename]
[-negative_gilist filename] [-negative_seqidlist filename]
[-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
[-db_hard_mask filtering_algorithm] [-subject subject_input_file]
[-subject_loc range] [-query input_file] [-out output_file]
[-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
[-gapextend extend_penalty] [-perc_identity float_value]
[-qcov_hsp_perc float_value] [-max_hsps int_value]
[-xdrop_ungap float_value] [-xdrop_gap float_value]
[-xdrop_gap_final float_value] [-searchsp int_value]
[-sum_stats bool_value] [-penalty penalty] [-reward reward] [-no_greedy]
[-min_raw_gapped_score int_value] [-template_type type]
[-template_length int_value] [-dust DUST_options]
[-filtering_db filtering_database]
[-window_masker_taxid window_masker_taxid]
[-window_masker_db window_masker_db] [-soft_masking soft_masking]
[-ungapped] [-culling_limit int_value] [-best_hit_overhang float_value]
[-best_hit_score_edge float_value] [-window_size int_value]
[-off_diagonal_range int_value] [-use_index boolean] [-index_name string]
[-lcase_masking] [-query_loc range] [-strand strand] [-parse_deflines]
[-outfmt format] [-show_gis] [-num_descriptions int_value]
[-num_alignments int_value] [-line_length line_length] [-html]
[-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
[-version]
DESCRIPTION
Nucleotide-Nucleotide BLAST 2.7.0+
Comments
Thanks for this blog ... here is some reference for BOL users
urbe@urbo214b[Human_vs_Avaga_protein] wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/protein.fa.gz
--2018-03-14 10:21:39-- ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/protein/protein.fa.gz
=> ‘protein.fa.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::7
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.12|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /genomes/Homo_sapiens/protein ... done.
==> SIZE protein.fa.gz ... 12021576
==> PASV ... done. ==> RETR protein.fa.gz ... done.
Length: 12021576 (11M) (unauthoritative)
protein.fa.gz 100%[===================>] 11,46M 11,9MB/s in 1,0s
2018-03-14 10:21:42 (11,9 MB/s) - ‘protein.fa.gz’ saved [12021576]
urbe@urbo214b[Human_vs_Avaga_protein] gunzip protein.fa.gz []
urbe@urbo214b[Human_vs_Avaga_protein] wget http://www.genoscope.cns.fr/adineta/data/Adineta_vaga.v2.pep.fa.gz
--2018-03-14 10:23:14-- http://www.genoscope.cns.fr/adineta/data/Adineta_vaga.v2.pep.fa.gz
Resolving www.genoscope.cns.fr (www.genoscope.cns.fr)... 195.83.221.23
Connecting to www.genoscope.cns.fr (www.genoscope.cns.fr)|195.83.221.23|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11437049 (11M) [application/x-gzip]
Saving to: ‘Adineta_vaga.v2.pep.fa.gz’
Adineta_vaga.v2.pep 100%[===================>] 10,91M 8,77MB/s in 1,2s
2018-03-14 10:23:15 (8,77 MB/s) - ‘Adineta_vaga.v2.pep.fa.gz’ saved [11437049/11437049]
urbe@urbo214b[Human_vs_Avaga_protein] gunzip Adineta_vaga.v2.pep.fa.gz []
urbe@urbo214b[Human_vs_Avaga_protein]
urbe@urbo214b[Human_vs_Avaga_protein] wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus/protein/protein.fa.gz
--2018-03-14 11:49:30-- ftp://ftp.ncbi.nlm.nih.gov/genomes/Mus_musculus/protein/protein.fa.gz
=> ‘protein.fa.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.7, 2607:f220:41e:250::11
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done. ==> PWD ... done.
==> TYPE I ... done. ==> CWD (1) /genomes/Mus_musculus/protein ... done.
==> SIZE protein.fa.gz ... 10463933
==> PASV ... done. ==> RETR protein.fa.gz ... done.
Length: 10463933 (10,0M) (unauthoritative)
protein.fa.gz 100%[===================>] 9,98M 9,75MB/s in 1,0s
2018-03-14 11:49:32 (9,75 MB/s) - ‘protein.fa.gz’ saved [10463933]
urbe@urbo214b[Human_vs_Avaga_protein] []
urbe@urbo214b[Human_vs_Avaga_protein] gunzip protein.fa.gz []
urbe@urbo214b[Human_vs_Avaga_protein] mv protein.fa mouse.protein.fa []
urbe@urbo214b[Human_vs_Avaga_protein] makeblastdb -in mouse.protein.fa -dbtype prot
Building a new DB, current time: 03/14/2018 11:52:20
New DB name: /home/urbe/DATA/Human_vs_Avaga_protein/mouse.protein.fa
New DB title: mouse.protein.fa
Sequence type: Protein
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 76216 sequences in 2.87095 seconds.
urbe@urbo214b[Human_vs_Avaga_protein] blastp -query Adineta_vaga.v2.pep.fa -db mouse.protein.fa -out seeHumAvagaAln2_mouse.tab -evalue 1e-20 -outfmt '6 qseqid qlen qstart qend sseqid slen sstart send pident gaps evalue length frames qcovs bitscore' -max_target_seqs 1 -num_threads 40 -qcov_hsp_perc 90 -threshold 90 -ungapped -comp_based_stats F -max_hsps 1
urbe@urbo214b[Human_vs_Avaga_protein] makeblastdb -in human_protein.fa -dbtype prot
urbe@urbo214b[Human_vs_Avaga_protein] blastp -query Adineta_vaga.v2.pep.fa -db human_protein.fa -out seeHumAvagaAln.tab -evalue 1e-20 -outfmt '6 qseqid staxid qstart qend sallseqi sstart send evalue length frames qcovs bitscore' -max_target_seqs 1 -num_threads 40 -qcov_hsp_perc 90 -threshold 90 -ungapped -comp_based_stats F -max_hsps 1
Original paper https://www.sciencedirect.com/science/article/pii/S0022283605803602
Detail tutorial here https://www.ncbi.nlm.nih.gov/books/NBK279684/
Following steps need to followed while doing any homology search.
BLAST (Basic Local Alignment Search Tool) is a widely used program for searching a database of sequences to find those that match a query sequence. Here are the general steps for performing a BLAST search:
Choose the BLAST program: BLAST has several different programs that are tailored to different types of queries and databases. The most common programs are blastp, blastn, blastx, tblastn, and tblastx. Choose the program that is appropriate for your query and database.
Prepare the query sequence: The query sequence should be in a FASTA format and can be entered directly into the BLAST website or saved as a file for local searches.
Choose the database: BLAST searches can be performed against many different databases, including the NCBI GenBank, RefSeq, and non-redundant databases. Choose the database that is most appropriate for your search.
Set the parameters: BLAST has many different parameters that can be adjusted to customize the search. The most important parameters include the e-value threshold, the word size, and the gap penalties.
Perform the search: Once the query sequence and parameters have been set, the BLAST search can be run. This can be done online using the NCBI BLAST website or locally using the command-line interface.
Interpret the results: After the search is complete, the results can be viewed and downloaded. The results include information about the sequences that matched the query sequence, including the alignment score, identity, and e-value.
Overall, BLAST is a powerful tool for searching sequence databases and can be used for many different types of analyses, including gene annotation, functional annotation, and comparative genomics.
BLAST compares a query sequence to a database of known sequences and identifies the most similar matches. This is useful for identifying the function of a newly discovered gene or for understanding the evolutionary relationships between different organisms.
BLAST works by breaking down the query sequence into smaller segments, called "words," and searching for matches to those words in the database. It then expands the matches to identify longer regions of similarity, called "alignments," and calculates a score for each alignment that reflects its degree of similarity.
BLAST is widely used in biological research because it is fast, efficient, and easy to use. It has helped researchers make many important discoveries, including identifying new genes and understanding the genetic basis of diseases.
Understanding the parameters
https://www.arabidopsis.org/Blast/BLASToptions.jsp