BOL: Related items

Senior Bioinformatician (Assembly) Moore Aquatic Symbiosis Project Tree of Life

Sat, 02 Oct 2021 00:28:30 -0500

You will have some previous experience with genome bioinformatics or other large scale scientific data analysis, or a newly qualified graduate student with data science skills interested in DNA sequence data. While desirable, previous experience with DNA sequencing data is not strictly necessary for the position. We have a strong publication record and culture of producing open data resources and open source software development. This role requires an investigative and solution-oriented mindset and excellent communication skills to work effectively within large national and international consortia.

More at https://jobs.sanger.ac.uk/vacancy/senior-bioinformatician-assembly-moore-aquatic-symbiosis-project-tree-of-life-458923.html

NCBI Datasets pages

BioStar — Wed, 12 Jul 2023 06:29:31 -0500

Update! Assembly and Genome record pages now redirect to new NCBI Datasets pages. NCBI Datasets is a new resource that makes it easier to find and download genome data. Learn more: https://ncbiinsights.ncbi.nlm.nih.gov/2023/07/11/ncbi-datasets-genome-assembly-pages/ #NCBICGR

Effective July 10, 2023, NCBI’s Assembly and Genome record pages now redirect to new NCBI Datasets pages. As previously announced, these updates are part of our ongoing effort to modernize and improve your user experience. NCBI Datasets is a new resource that makes it easier to find and download genome data.  

The following pages have been updated:

The NCBI Assembly record pages now redirect to the new NCBI Datasets Genome record pages that describe assembled genomes and provide links to related NCBI tools such as Genome Data Viewer and BLAST. 
The NCBI Genome record pages now redirect to the NCBI Datasets Taxonomy record pages that provide a taxonomy-focused portal to genes, genomes, and additional NCBI resources.

During this transition, you will have the option to return to the legacy Genome and Assembly record pages. We will remove the legacy pages in early 2024. 

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.

RITA: Rapid identification of high-confidence taxonomic assignments for metagenomic data

Jit — Mon, 27 Nov 2017 08:25:33 -0600

RITA is a standalone software package and Web server for taxonomic assignment of metagenomic sequence reads. By combining homology predictions from BLAST or UBLAST with compositional classifications from a Naive Bayes classifier, RITA is able to achieve very high accuracy on short reads. Unlike other hybrid approaches which combine these predictions for all sequences to be classified, RITA uses a pipeline to first identify cases where both types of classifier are in agreement, which constitute the highest-confidence set. Sequences not classified in this manner are subjected to a series of downstream classification steps.

This work has been accepted for publication:

MacDonald NJ, Parks DH, and Beiko RG. Rapid identification of taxonomic assignments. Accepted to Nucleic Acids Research April 4, 2012.

If you have any questions or bug reports, please let us know at .

Address of the bookmark: http://kiwi.cs.dal.ca/Software/RITA

BioCircos.js is an open source interactive Javascript library to interactive display biological data on the web

Jit — Fri, 19 Jan 2018 15:03:51 -0600

BioCircos.js is an open source interactive Javascript library which provides an easy way to interactive display biological data on the web. It implements a raster-based SVG visualization using the open source Javascript framework jquery.js. BioCircos.js is multiplatform and works in all major internet browsers (Internet Explorer, Mozilla Firefox, Google Chrome, Safari, Opera). Its speed is determined by the client’s hardware and internet browser. For smoothest user experience, we recommend Google Chrome.

BioCircos.js provides SNP, CNV, HEATMAP, LINK, LINE, SCATTER, ARC, TEXT, and HISTGRAMmodules to display genome-wide genetic variations (SNPs, CNVs and chromosome rearrangement), gene expression and biomolecule interactions. BioCircos.js also provides BACKGROUND module to display background and axis circles. Tooltips showing detailed information of SVG elements are also provided.

Demo

Address of the bookmark: http://bioinfo.ibp.ac.cn/biocircos/document/index.html

AfterQC: Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data

Jit — Fri, 29 Jun 2018 03:26:03 -0500

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data AfterQC can simply go through all fastq files in a folder and then output three folders: good, bad and QC folders, which contains good reads, bad reads and the QC results of each fastq file/pair. Currently it supports processing data from HiSeq 2000/2500/3000/4000, Nextseq 500/550, MiniSeq...and other Illumina 1.8 or newer formats The author has reimplemented this tool in C++ with multithreading support to make it much faster. The new tool is called fastp and can be found at: https://github.com/OpenGene/fastp . If you prefer a C++ based tool, please use fastp instead. https://github.com/OpenGene/AfterQC

Address of the bookmark: https://github.com/OpenGene/AfterQC

SeqMonk:A tool to visualise and analyse high throughput mapped sequence data

Jit — Tue, 11 Sep 2018 04:39:38 -0500

SeqMonk is a program to enable the visualisation and analysis of mapped sequence data. It was written for use with mapped next generation sequence data but can in theory be used for any dataset which can be expressed as a series of genomic positions. It's main features are:

Import of mapped data from mapped data (BAM/SAM/bowtie etc)
Creation of data groups for visualisation and analysis
Visualisation of mapped regions against an annotated genome.
Flexible quantitation of the mapped data to allow comparisons between data sets
Statistical analysis of data to find regions of interest
Creation of reports containing data and genome annotation

Address of the bookmark: http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/

EXCAVATOR: detecting copy number variants from whole-exome sequencing data

Radha Agarkar — Fri, 04 Jan 2019 10:10:48 -0600

EXCAVATOR, for the detection of copy number variants (CNVs) from whole-exome sequencing data. EXCAVATOR combines a three-step normalization procedure with a novel heterogeneous hidden Markov model algorithm and a calling method that classifies genomic regions into five copy number states. We validate EXCAVATOR on three datasets and compare the results with three other methods. These analyses show that EXCAVATOR outperforms the other methods and is therefore a valuable tool for the investigation of CNVs in largescale projects, as well as in clinical research and diagnostics. EXCAVATOR is freely available at http://sourceforge.net/projects/excavatortool/.

EXCAVATOR is a novel software package for the detection of copy number variants (CNVs) from whole-exome sequencing data.
EXCAVATOR has been published on Genome Biology (http://genomebiology.com/2013/14/10/R120/abstract).

Address of the bookmark: https://sourceforge.net/projects/excavatortool/

nQuire: A statistical framework for ploidy estimation using NGS short-read data

Jit — Thu, 31 Jan 2019 05:12:19 -0600

nQuire implements a set of commands to estimate ploidy level of individuals from species, where recent polyploidization occurred and intraspecific ploidy variation is observed. Specifically, nQuire uses next-generation sequencing data to distinguish between diploids, triploids and tetraploids, on the basis of frequency distributions at variant sites where only two bases are segregating.

For more background see also the publication at BMC Bioinformatics.

https://github.com/clwgg/nQuire

Address of the bookmark: https://github.com/clwgg/nQuire

Trelliscope: flexibly visualize large, complex data in great detail from within the R statistical programming environment.

Jit — Tue, 21 Jan 2020 04:22:49 -0600

Trelliscope provides a way to flexibly visualize large, complex data in great detail from within the R statistical programming environment. Trelliscope is a component in the DeltaRho environment.

For those familiar with Trellis Display, faceting in ggplot, or the notion of small multiples, Trelliscope provides a scalable way to break a set of data into pieces, apply a plot method to each piece, and then arrange those plots in a grid and interactively sort, filter, and query panels of the display based on metrics of interest. With Trelliscope, we are able to create multipanel displays on data with a very large number of subsets and view them in an interactive and meaningful way.

Address of the bookmark: http://deltarho.org/docs-trelliscope/#introduction