BOL: Related items

BLAST+ updated !!!

Jit — Tue, 16 Jun 2015 16:55:24 -0500

A new version (2.2.31) of the stand-alone BLAST executables (Linux, Windows and MacOSX on FTP) is now available. New features include support for BLAST-XML2 specification (information here) and JSON BLAST output format, as well as several bug fixes and improvements. The BLAST AMI at AWS will also be updated to 2.2.31 (see this BLAST Help page for more information). For a full list of improvements, see the release notes.

More at http://www.ncbi.nlm.nih.gov/news/06-16-2015-blast-plus-update/?

sequenceserver

Jit — Fri, 10 Mar 2017 08:51:55 -0600

SequenceServer lets you rapidly set up a BLAST+ server with an intuitive user interface for use locally or over the web.

More at http://sequenceserver.com.

Address of the bookmark: https://github.com/wurmlab/sequenceserver

BLAST nr version 5 database, (nr_v5)

Jit — Fri, 23 Aug 2019 11:35:35 -0500

NCBI have made changes the nr version 5 database, (nr_v5), to facilitate better search results and improved performance by reducing the number of redundant titles in the nr_v5 database used by webBLAST, which is also available for BLAST+ users.

The changes in nr preserve the taxonomic diversity of the entries in the database while reducing the number of titles for identical sequences. GenPept accessions are still accessible via www.ncbi.nlm.nih.gov/protein/$GENBANK_ACCESSION or the IPG website https://www.ncbi.nlm.nih.gov/ipg/.

The "Identical Proteins" link in the alignments section of the webBLAST results takes you to a full list of all accessions associated with a sequence.

For BLAST+ users downloading nr_v5: the database is now approximately 50% smaller, resulting in faster downloads and BLAST searches, and smaller disk space requirements. The database is downloadable at: ftp://ftp.ncbi.nlm.nih.gov/blast/db/v5/

For BLAST+ there is a cleanup script to help you manage the transition to this smaller database. The script removes unused database volumes: ftp://ftp.ncbi.nlm.nih.gov/blast/temp/cleanup-blastdb-volumes.py

Here are the new rules on how we keep titles in nr_v5:

1. We keep all refseq, swissprot, pir and PDB titles.

2. We keep any GenPept titles with a TAXID that has not already been seen in the record.

3. We keep at least five GenPept titles regardless of whether the TAXIDS have been seen before or not in this record.

Primer BLAST !

BioStar — Tue, 28 Apr 2020 00:28:49 -0500

BLAST team added a new feature (Max 3' match), shown in Figure 1, to Primer-BLAST that limits the length of 3' exon matches when designing exon-exon spanning primers. This makes it less likely that primers specifically designed to amplify transcripts will also amplify genomic DNA contamination in expression assays. See the NCBI Insights post (https://go.usa.gov/xvUT4) for more details.

If you have any questions or concerns, please contact blast-help@ncbi.nlm.nih.gov

Cleaner BLAST Databases for More Accurate Results

LEGE — Tue, 23 Apr 2024 01:23:08 -0500

Do you use BLAST to identify a sequence or the evolutionary scope of a gene? That can be challenging if contaminated and misclassified sequences are in the BLAST databases and show up in your search results. To address this problem, we now use the NCBI quality assurance tools listed below to systematically remove these misleading sequences from the default nucleotide (nt) and protein (nr) BLAST databases.

Foreign Contamination Screen tool for genome cross-species screening (FCS-GX) detects contamination from foreign organisms in genomes and other sequences using the genome cross-species aligner (GX)
Average Nucleotide Identity (ANI) evaluates the taxonomic classification of prokaryotic genome assemblies. Sequences from genomes marked up as ‘unverified source organism’ are considered suspect and removed.

Ref https://ncbiinsights.ncbi.nlm.nih.gov/2024/04/22/cleaner-blast-databases-more-accurate-results/

REST API

Neel — Mon, 04 Oct 2021 12:46:40 -0500

REST API

The Representational State Transfer (REST) sample clients are provided for a number of programming languages. For details of how to use these clients, download the client and run the program without any arguments.

Language	Download	Requirements
Perl	psiblast.pl	LWP and XML::Simple
Python	psiblast.py	xmltramp2

For details see Environment setup for REST Web Services and Examples for Perl REST Web Services Clients pages.

BEAP: Blast Extension and Assembly Program

Shruti Paniwala — Mon, 11 Jun 2018 04:52:56 -0500

The Blast Extension and Assembly Program (BEAP) is a computer program that uses a short starting DNA fragment, often a EST or partial gene segment, as "primer", to recursively blast nucleotide databases in an attempt to obtain all sequences that overlaps, directly or indirectly, with the "primer" therefore help to "extend" the length of the original sequence for constructing a "full length" sequence for functional analysis, or at least to obtain neighboring regions of the segment for SNP discovery and linkage disequilibrium analysis. The confidence of assembling the resulting sequences is achieved by using a known genome, such as human genome, as a reference. https://www.animalgenome.org/tools/beap/

Address of the bookmark: https://www.animalgenome.org/tools/beap/

Arvados

Martin Jones — Sat, 20 Sep 2014 16:54:21 -0500

Arvados is a free and open source bioinformatics platform for genomic and biomedical data. User can Store | Organize | Compute | Share the data for free.

Address of the bookmark: https://arvados.org/

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.

Understanding your reads and mapping !

Neel — Wed, 29 Jan 2020 06:29:55 -0600

One of the best tutorial for beginners ...

https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2017/Day1/Session4-seqIntro.html

Address of the bookmark: https://bioinformatics-core-shared-training.github.io/cruk-summer-school-2017/Day1/Session4-seqIntro.html

BOL: Related items

BLAST+ updated !!!

sequenceserver

BLAST nr version 5 database, (nr_v5)

Primer BLAST !

Cleaner BLAST Databases for More Accurate Results

REST API

REST API

Python

BEAP: Blast Extension and Assembly Program

Arvados

Step-by-Step Guide to Running Genome Assembly

What is Genome Assembly?

Step 1: Preparing Your Data

Step 2: Choosing an Assembly Strategy

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

3.2. Canu (Long-Read Assembly)

3.3. Hybrid Assembly with Unicycler

Step 4: Assessing Assembly Quality

Step 5: Post-Assembly Steps

Step 6: Sharing and Archiving

Best Practices

Conclusion

Understanding your reads and mapping !