BOL: Related items

A Step-by-Step Guide to Running BLAST Offline

LEGE — Sat, 07 Dec 2024 22:32:37 -0600

BLAST (Basic Local Alignment Search Tool) is a powerful algorithm used to compare nucleotide or protein sequences to sequence databases, identifying regions of similarity. Running BLAST offline provides more control, ensures data security, and allows customization for specific research needs. Here’s a detailed guide to set up and run BLAST locally on your system.

Step 1: Install BLAST

Download BLAST:
- Visit the NCBI BLAST+ download page to download the appropriate version for your operating system (Windows, macOS, or Linux).
Install BLAST:
- Extract the downloaded archive. For Linux/Mac, use:
```
tar -xvzf ncbi-blast-*.tar.gz
cd ncbi-blast-*
```
- Add the BLAST binary folder to your system PATH for easier access:
```
export PATH=$PATH:/path/to/ncbi-blast-*/bin
```
Verify Installation:
Run the following command to ensure BLAST is installed correctly:
```
blastn -version
```

Step 2: Prepare a Local Database

To run BLAST offline, you’ll need a sequence database.

Download a Pre-Built Database (Optional):
- NCBI provides ready-to-use databases such as nt, nr, and Swiss-Prot. Use the update_blastdb.pl script (bundled with BLAST) to download these:
```
update_blastdb.pl --decompress nt
```
Create a Custom Database:
If you have specific sequences to use as a database:
- Prepare a FASTA file containing the sequences.
- Use makeblastdb to create a database:
```
makeblastdb -in your_sequences.fasta -dbtype [nucl|prot] -out custom_db
```
  Replace [nucl|prot] with nucl for nucleotide sequences or prot for protein sequences.

Step 3: Prepare the Query Sequence

Save your query sequence(s) in FASTA format.
Ensure the file is properly formatted, with a header line starting with > followed by the sequence name, and the sequence on subsequent lines.

Example:

>query_sequence
ATGCGTAGCTAGCGTAGCTAGCTAGCTA

Step 4: Run BLAST

Choose the Appropriate BLAST Tool:
Depending on your data type:
- blastn: For nucleotide-nucleotide searches.
- blastp: For protein-protein searches.
- blastx: Translates nucleotide sequences into proteins and compares them to a protein database.
- tblastn: Compares protein sequences to a nucleotide database.
- tblastx: Translates both nucleotide query and database sequences.
Run the Command:
Example command for blastn:
```
blastn -query query.fasta -db custom_db -out results.txt -outfmt 6 -evalue 1e-5
```
Explanation of Parameters:
- -query: Specifies the query file.
- -db: Points to the local database.
- -out: Output file name.
- -outfmt: Output format (e.g., 6 for tabular format).
- -evalue: E-value cutoff for significance.

Step 5: Interpret Results

Output Formats:
- Default (outfmt 0): Human-readable format.
- Tabular (outfmt 6): Includes fields like query ID, subject ID, percent identity, alignment length, etc.
Analyze Results:
Use tools like grep, Python, or R to parse and filter results for downstream analysis.

Step 6: Optimize Performance

For large datasets, BLAST can be resource-intensive. To improve performance:

Multithreading:
Use the -num_threads option to leverage multiple CPU cores:

blastn -query query.fasta -db custom_db -out results.txt -num_threads 4

Database Subsetting:
Split large databases into smaller chunks for faster searches.
Adjust Parameters:
- Lower the -evalue threshold for stricter matches.
- Use -max_target_seqs to limit the number of results per query.

Step 7: Update Databases (Optional)

If using NCBI databases, regularly update them to ensure the inclusion of the latest sequences:

update_blastdb.pl --decompress nt

Conclusion

Running BLAST offline is a straightforward process that offers flexibility and security for bioinformaticians working with sensitive data. By following this guide, you can harness the power of BLAST to analyze sequences efficiently and gain valuable biological insights.

For advanced use cases, explore BLAST’s customization options, such as custom scoring matrices, filtering, and iterative searches with tools like PSI-BLAST. Happy BLASTing!

Explore taxdump files !

Jit — Sat, 08 Feb 2020 04:44:55 -0600

This is an extract of taxdump-readme.txt to be found at 
ftp://ftp.ncbi.nih.gov/pub/taxonomy/

The content of the archive
--------------------------

It may look like this:

delnodes.dmp
division.dmp
gencode.dmp
merged.dmp
names.dmp
nodes.dmp
readme.txt

The readme.txt file gives a brief description of *.dmp files. These files
contain taxonomic information and are briefly described below. Each of the
files store one record in the single line that are delimited by "\t|\n"
(tab, vertical bar, and newline) characters. Each record consists of one 
or more fields delimited by "\t|\t" (tab, vertical bar, and tab) characters.
The brief description of field position and meaning for each file follows.

nodes.dmp
---------

This file represents taxonomy nodes. The description for each node includes 
the following fields:

	tax_id					-- node id in GenBank taxonomy database
 	parent tax_id				-- parent node id in GenBank taxonomy database
 	rank					-- rank of this node (superkingdom, kingdom, ...) 
 	embl code				-- locus-name prefix; not unique
 	division id				-- see division.dmp file
 	inherited div flag  (1 or 0)		-- 1 if node inherits division from parent
 	genetic code id				-- see gencode.dmp file
 	inherited GC  flag  (1 or 0)		-- 1 if node inherits genetic code from parent
 	mitochondrial genetic code id		-- see gencode.dmp file
 	inherited MGC flag  (1 or 0)		-- 1 if node inherits mitochondrial gencode from parent
 	GenBank hidden flag (1 or 0)            -- 1 if name is suppressed in GenBank entry lineage
 	hidden subtree root flag (1 or 0)       -- 1 if this subtree has no sequence data yet
 	comments				-- free-text comments and citations

names.dmp
---------
Taxonomy names file has these fields:

	tax_id					-- the id of node associated with this name
	name_txt				-- name itself
	unique name				-- the unique variant of this name if name not unique
	name class				-- (synonym, common name, ...)

division.dmp
------------
Divisions file has these fields:
	division id				-- taxonomy database division id
	division cde				-- GenBank division code (three characters)
	division name				-- e.g. BCT, PLN, VRT, MAM, PRI...
	comments

gencode.dmp
-----------
Genetic codes file:

	genetic code id				-- GenBank genetic code id
	abbreviation				-- genetic code name abbreviation
	name					-- genetic code name
	cde					-- translation table for this genetic code
	starts					-- start codons for this genetic code

delnodes.dmp
------------
Deleted nodes (nodes that existed but were deleted) file field:

	tax_id					-- deleted node id

merged.dmp
----------
Merged nodes file fields:

	old_tax_id                              -- id of nodes which has been merged
	new_tax_id                              -- id of nodes which is result of merging

Conserved Domain Database (CDD) version 3.11 released

Shikha Logwani — Wed, 19 Feb 2014 15:02:40 -0600

National Center for Biotechnology Information (NCBI) Conserved Domain Database (CDD) version 3.11 is now available with 596 new or updated NCBI-curated and 49,641 total domain models. The new version now contains the most recent Pfam release 27.

Updates to the Conserved Domain Database include:

Position-specific score matrices (PSSMs) have been recomputed for many models in CDD, and frequency tables have been added to the PSSMs;

The search databases distributed as part of this release can now be used with the more recent versions of RPS-BLAST (BLAST release 2.2.28 and up) using composition-based scoring. This abolishes the need to mask out compositionally biased regions in query sequences;

Domain annotation displays in CD-Search, BATCH CD-Search, and other services now all use a uniform display style. A new display option in CD-Search and BATCH CD-Search provides “standard” results, in addition to “concise” and “full” results. “Standard” results will provide, for each region on the query sequence, the best0-scoring domain model (if any) from each of CDD’s database providers (Pfam, SMART, COG, TIGRFAMs, Protein Clusters, and the NCBI in-house curation project), but will suppress redundancy from within a single provider's results list.

You can access CDD at the Conserved Domains homepage and find updated content on the CDD FTP site.

Reference:

NCBI Website

HistoneDB 2.0 – with variants

Anjana — Fri, 03 Jun 2016 05:06:20 -0500

This histone database can be used to explore the diversity of histone proteins and their sequence variants in many organisms. The resource was established to better understand how sequence variation may affect functional and structural features of nucleosomes. To get started, select a histone type to explore its variants.

More at http://www.ncbi.nlm.nih.gov/projects/HistoneDB2.0/index.fcgi/browse/

Address of the bookmark: http://www.ncbi.nlm.nih.gov/projects/HistoneDB2.0/index.fcgi/browse/

Download assemblies from NCBI

Bulbul — Mon, 15 May 2017 06:02:32 -0500

A new “Download assemblies” button is now available in the Assembly database. This makes it easy to download data for multiple genomes without having to write scripts.

For example, you can run a search in Assembly and use check boxes (see left side of screenshot below) to refine the set of genome assemblies of interest. Then, just open the “Download assemblies” menu, choose the source database (GenBank or RefSeq), choose the file type, and start the download. An archive file will be saved to your computer that can be expanded into a folder containing your selected genome data files.

More at https://ncbiinsights.ncbi.nlm.nih.gov/2017/05/08/genome-data-download-made-easy/

IgBLAST: a popular NCBI package for classifying and analyzing immunoglobulin (IG) and T cell receptor (TCR) variable domain sequences

BioJoker — Thu, 23 Jan 2020 11:34:37 -0600

NCBI team released a new version of IgBLAST with four new improvements. IgBLAST is a popular NCBI package for classifying and analyzing immunoglobulin (IG) and T cell receptor (TCR) variable domain sequences. Improvements are:

1. Support for the new FWR4 annotation feature in the AIRR format, both in standard format and in the AIRR alignment format.

2. The previous “-penalty” parameter was renamed as -V_penalty to be consistent with other IgBLAST penalty options.

3. Restored constant internal BLAST search parameters for domain annotation (i.e., FWR/CDR) such that this process is not influenced by user parameters.

4. Corrected FWR/CDR annotations for certain mouse VK and rat VH germline genes.

IgBLAST 1.15.0 is available for download from the BLAST FTP area. See the the new manual on GitHub for information about setting up and running IgBLAST.

If you have any questions or concerns, please contact blast-help@ncbi.nlm.nih.gov

Bioinformatics Codes Search

Jitendra Narayan — Thu, 15 Aug 2013 11:08:52 -0500

I bet, this website will be your best friend in near future. This helps us to explore the existing open source codes and learn from it.

You can find some useful open source bioinformatics codes for your analysis work. You can use the left bar options to filtere out or narrow down your search result. This webpage can be an useful resource for a beginners bioinformatician as it contain several bioinformatics basics script that are commonly used by biological programmers and biologist.

Stand on the slumped, dandruff-covered shoulders of millions of computer nerds. _/\_

Enjoy the code and research work.

http://code.ohloh.net/search?s=bioinformatics

Address of the bookmark: http://code.ohloh.net/search?s=bioinformatics

Tools for Searching Repeats And Palindromic Sequences

Radha Agarkar — Sat, 21 May 2016 22:32:25 -0500

What are genomic interspersed repeats?

In the mid 1960's scientists discovered that many genomes contain stretches of highly repetitive DNA sequences ( see Reassociation Kinetics Experiments, and C-Value Paradox ). These sequences were later characterized and placed into five categories:

Simple Repeats - Duplications of simple sets of DNA bases (typically 1-5bp) such as A, CA, CGG etc.
Tandem Repeats - Typically found at the centromeres and telomeres of chromosomes these are duplications of more complex 100-200 base sequences.
Segmental Duplications - Large blocks of 10-300 kilobases which are that have been copied to another region of the genome.
Interspersed Repeats
Processed Pseudogenes, Retrotranscripts, SINES - Non-functional copies of RNA genes which have been reintegrated into the genome with the assitance of a reverse transcriptase.
DNA Transposons
Retrovirus Retrotransposons
Non-Retrovirus Retrotransposons ( LINES )

Currently up to 50% of the human genome is repetitive in nature and as improvements are made in detection methods this number is expected to increase.

On the other hand; In genetics, the term palindrome refers to a sequence of nucleotides along a DNA (deoxyribonucleic acid) or RNA (ribonucleic acid) strand that contains the same series of nitrogenous bases regardless from which direction the strand is analyzed. Akin to a language palindrome—wherein a word or phrase is spelled the same left-to-right as right-to-left (e.g., the word RADAR or the phrase "able was I ere I saw elba")—with genetic palindromes it does not matter whether the nucleic acid strand is read starting from the 3' (three prime) end or the 5' (five prime) end of the strand.

Recent research on palindromes centers on understanding palindrome formation during gene amplification. Other studies have attempted to relate palindrome formation to molecular mechanisms involved in double stranded breaks and in the formation of inverted repeats. Assisted by high speed computers, other groups of scientists link palindrome formation to the conservation of genetic information.

Related to the direction of transcription by RNA polymerase, DNA strands have upstream and downstream terminus defined by differing chemical groups at each end. The ends of each strand of DNA or RNA are termed the 5' (phosphate bound to the 5' position carbon) and 3' (phosphate bound to the 3' carbon) ends to indicate a polarity within the molecule. Using the letters A, T, C, G, to represent the nitrogenous bases adenine, thymine, cytosine, and guanine found in DNA, and the letters A, U, C, G to represent the nitrogenous bases adenine, uracil, cytosine, guanine found in RNA (Note that uracil in RNA replaces the thymine found in DNA), geneticists usually represent DNA by a series of base codes (e.g., 5' AATCGGATTGCA 3'). The base codes are usually arranged from the 5' end to the 3' end.

Because of specific base pairing in DNA (i.e., adenine (A) always bonds with (thymine (T) and cytosine (C) always bonds with guanine (G)) the complimentary stand to the sequence 5' AATCGGATTGCA 3' would be 3' TTAGCCTAACGT 5'.

With palindromes the sequences on the complimentary strands read the same in either direction. For example, a sequence of 5' GAATTC3' on one strand would be complimented by a 3' CTTAAG 5' strand. In either case, when either strand is read from the 5' prime end the sequence is GAATTC. Another example of a palindrome would be the sequence 5' CGAAGC 3' that, when reversed, still reads CGAAGC.

Palindromes are important sequences within nucleic acids. Often they are the site of binding for specific enzymes (e.g., restriction endobucleases) designed to cut the DNA strands at specific locations (i.e., at palindromes).

Palindromes may arise from brakeage and chromosomal inversions that form inverted repeats that compliment each other. When a palindrome results from an inversion, it is often referred to as an inverted repeat. For example, the sequence 5' CGAAGC 3', if inverted (reversed 180°), still reads CGAAGC.

The European Molecular Biology Open Software Suite (EMBOSS) includes some basic tools for finding tandem repeats and inverted repeats (see B.6.22. Applications in group Nucleic:repeats). There are many on-line services providing the EMBOSS tools, for example:

Wageningen Bioinformatics Webportal EMBOSS explorer
Mobyle@Pasteur
Soaplab2 Web Services at Vital-IT

For more sophisticated repeat finding you will want to look at tools using Repbase for example:

Other nucleotide repeat finding methods found by a couple of web searches:

MMseqs2.0: ultra fast and sensitive protein search and clustering suite

Jit — Thu, 22 Mar 2018 10:40:51 -0500

MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein sequence sets. MMseqs2 is open source GPL-licensed software implemented in C++ for Linux, MacOS, and (as beta version, via cygwin) Windows. The software is designed to run on multiple cores and servers and exhibits very good scalability. MMseqs2 can run 10000 times faster than BLAST. At 100 times its speed it achieves almost the same sensitivity. It can perform profile searches with the same sensitivity as PSI-BLAST at over 400 times its speed.

The MMseqs2 user guide is available as Github Wiki or as PDF file (Thanks to pandoc!)

Please cite: Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017).

Address of the bookmark: https://github.com/soedinglab/MMseqs2

Common Bioinformatics Interview Questions !

Jit — Sat, 23 Jan 2021 06:07:50 -0600

The possibility of an interview for a bioinformatics position in the life sciences may be very disquieting, but the same concerns emerge time and again in my experience. So, it is exceedingly worthwhile to plan for future bioinformatics interview questions. Doing this will really give you the advantage in obtaining the position.

The following 5 questions are those that I have heard many times during the job-search process. There is no reason for not planning responses in such situations.

1. Tell Us About Yourself
This is a very typical opener in interviews. It's a perfect question to ask, and getting something planned will really help you concentrate and ease in the conversation. However, you need to make sure that your response is applicable to the job you're interviewing.
It's probably better to keep your answer professional. Try to include these in the answer as well: where did your love of science and bioinformatics come from? How the heck did you end up in this field? Why programming and scripting ?

2. What is your plan for your bioinformatics career? / How do you look at yourself in five years? / How are your personal objectives to accomplish these goals / What are the plan for your research fundings ?

Your CV/resume has already impressed the selection panel if you have been invited for an interview. The questions from the bioinformatics interview team provide an incentive for you to market yourself and illustrate the work in question with the most appropriate knowledge.

3. What do you understand about the job description/What would your suggested research path be if you were a successful candidate?
Summarize the specifics of the advertised bioinformatics position in your own words. Follow on with some suggestions of how you want to extend your research and create your own projects within the community.

4. Will you work as a group or do you want to work on your own?
This requirement can vary from jobs to job, so when addressing, be alert. A company/research PI may need a bioinformatician that is able to work on a single project autonomously, or they may need a person who can help direct and organize a team. In your response, refer to the job description.

5. What particular methods have you used to date with your experiments?
You might have experience with all the laboratory techniques described in the job description, but stress the ones you highly experienced with. Highlight your professional abilities and stress that you are extremely capable of mastering new techniques with others ...

At the end of the day, remember that you're questioning the jury as well as they're interviewing you. You will ought to think of any questions you would like the interview panel to pose. This indicates that you have done your homework and serious about the position.

All the best for your future job interview.