BOL: Related items

MyCC: Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes

Jit — Fri, 03 Mar 2017 08:34:23 -0600

MyCC, an automated binning tool that combines genomic signatures, marker genes and optional contig coverages within one or multiple samples, in order to visualize the metagenomes and to identify the reconstructed genomic fragments.

More at http://www.nature.com/articles/srep24175

Address of the bookmark: https://sourceforge.net/projects/sb2nhri/files/MyCC/

SeqMule: Automated human exome/genome variants detection

Abhimanyu Singh — Tue, 07 Mar 2017 10:12:36 -0600

SeqMule takes single-end or paird-end FASTQ or BAM files, generates a script consisting of more than 10 popular alignment, analysis tools and runs the script line by line. Users can change the pipeline or fine-tune the parameters by modifying its configuration file. SeqMule also has some built-in functions, such as pooling consensus calls from various callers, plotting a Venn diagram showing intersection among different callers, and downloading databases. SeqMule can be used for both Mendelian disease study and cancer genome study.

Address of the bookmark: http://seqmule.openbioinformatics.org/en/latest/

Bacterial genome assembly !!

Jit — Fri, 05 May 2017 06:11:22 -0500

This tutorial will serve as an example of how to use free and open-source genome assembly and secondary scaffolding tools to generate high quality assemblies of bacterial sequence data. The bacterial sample used in this tutorial will be referred to simply as “Species” since it is live data. This data is paired-end data, meaning that there are forward and reverse reads, which we will designate as Sample_R1.fastq and Sample_R2.fastq, respectively.

https://github.com/jennomics/WorkflowPaper/blob/master/Genome%20Assembly%20and%20Annotation.md

Address of the bookmark: http://bioinformatics.uconn.edu/bacterial-genome-assembly-tutorial/

NCBI Prokaryotic Genome Annotation Pipeline

Jit — Tue, 16 May 2017 08:56:03 -0500

NCBI Prokaryotic Genome Annotation Pipeline is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

Genome annotation is a multi-level process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons and other mobile elements.

NCBI has developed an automatic prokaryotic genome annotation pipeline that combines ab initio gene prediction algorithms with homology based methods. The first version of NCBI Prokaryotic Genome Automatic Annotation Pipeline (PGAAP; see Pubmed Article) developed in 2005 has been replaced with an upgraded version that is capable of processing a larger data volume. You can find a more detailed description of the new version of the pipeline in NCBI Handbook chapter. NCBI's annotation pipeline depends on several internal databases and is not currently available for download or use outside of the NCBI environment.

https://www.ncbi.nlm.nih.gov/genome/annotation_prok/

Address of the bookmark: https://www.ncbi.nlm.nih.gov/genome/annotation_prok/

HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution

Jit — Wed, 07 Feb 2018 09:40:22 -0600

Software accompanying "HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution"

Preprint: http://biorxiv.org/content/early/2016/08/01/062117
Paper: http://genome.cshlp.org/content/27/5/747.full
An ipython notebook to reproduce results in the paper can be found in this repository.

HINGE is an OLC(Overlap-Layout-Consensus) assembler. The idea of the pipeline is shown below.

Address of the bookmark: https://github.com/HingeAssembler/HINGE

GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments

Shruti Paniwala — Mon, 11 Jun 2018 05:43:44 -0500

GMcloser uses likelihood-based classifiers calculated from the alignment statistics between scaffolds, contigs and paired-end reads to correctly assign contigs or long reads to gap regions of scaffolds, thereby achieving accurate and efficient gap closure. We demonstrate with sequencing data from various organisms that the gap-closing accuracy of GMcloser is 3–100-fold higher than those of other available tools, with similar efficiency. https://academic.oup.com/bioinformatics/article/31/23/3733/209212

Address of the bookmark: https://academic.oup.com/bioinformatics/article/31/23/3733/209212

SIMBA: a Genome Assembly Project Management System

Neel — Thu, 29 Nov 2018 08:52:25 -0600

SIMBA, SImple Manager for Bacterial Assemblies, is a Web interface for managing assembly projects of bacterial genomes. SIMBA was created to assist bioinformaticians to assemble bacterial genomes sequenced with NextGeneration Sequencing (NGS) platforms quickly, easily and effectively. SIMBA also is open source tool, i.e., can be freely downloaded, shared and modified.

Address of the bookmark: http://ufmg-simba.sourceforge.net/

Senior Bioinformatician (Assembly) Moore Aquatic Symbiosis Project Tree of Life

Sat, 02 Oct 2021 00:28:30 -0500

You will have some previous experience with genome bioinformatics or other large scale scientific data analysis, or a newly qualified graduate student with data science skills interested in DNA sequence data. While desirable, previous experience with DNA sequencing data is not strictly necessary for the position. We have a strong publication record and culture of producing open data resources and open source software development. This role requires an investigative and solution-oriented mindset and excellent communication skills to work effectively within large national and international consortia.

More at https://jobs.sanger.ac.uk/vacancy/senior-bioinformatician-assembly-moore-aquatic-symbiosis-project-tree-of-life-458923.html

NCBI Datasets pages

BioStar — Wed, 12 Jul 2023 06:29:31 -0500

Update! Assembly and Genome record pages now redirect to new NCBI Datasets pages. NCBI Datasets is a new resource that makes it easier to find and download genome data. Learn more: https://ncbiinsights.ncbi.nlm.nih.gov/2023/07/11/ncbi-datasets-genome-assembly-pages/ #NCBICGR

Effective July 10, 2023, NCBI’s Assembly and Genome record pages now redirect to new NCBI Datasets pages. As previously announced, these updates are part of our ongoing effort to modernize and improve your user experience. NCBI Datasets is a new resource that makes it easier to find and download genome data.  

The following pages have been updated:

The NCBI Assembly record pages now redirect to the new NCBI Datasets Genome record pages that describe assembled genomes and provide links to related NCBI tools such as Genome Data Viewer and BLAST. 
The NCBI Genome record pages now redirect to the NCBI Datasets Taxonomy record pages that provide a taxonomy-focused portal to genes, genomes, and additional NCBI resources.

During this transition, you will have the option to return to the legacy Genome and Assembly record pages. We will remove the legacy pages in early 2024. 

Step-by-Step Guide to Running Genome Assembly

Abhi — Fri, 13 Dec 2024 11:35:55 -0600

Genome assembly is a critical process in bioinformatics, enabling the reconstruction of an organism's genome from short DNA sequence reads. Whether you’re working on a new microbial genome or a complex eukaryotic organism, this guide will walk you through the steps of genome assembly using state-of-the-art tools and best practices.

What is Genome Assembly?

Genome assembly involves piecing together short DNA sequence reads generated by sequencing platforms (e.g., Illumina, PacBio, Oxford Nanopore) into longer, contiguous sequences called contigs. This can be performed as:

De Novo Assembly: Without a reference genome.
Reference-Guided Assembly: Using a reference genome to guide the assembly process.

Step 1: Preparing Your Data

Before starting the assembly, ensure that your raw sequencing data is high quality.

Input Data
- Short Reads: Illumina sequencing generates short, accurate reads ideal for scaffolding.
- Long Reads: PacBio and Nanopore sequencing provide long reads for resolving repetitive regions.
Quality Control (QC)
Use tools like FastQC or MultiQC to assess the quality of your reads:

fastqc reads.fastq multiqc .

Look for issues like low-quality bases, adapter contamination, or overrepresented sequences.
Read Trimming and Filtering
Trim low-quality bases and adapters using Trimmomatic or Cutadapt:

trimmomatic PE reads_R1.fastq reads_R2.fastq trimmed_R1.fastq trimmed_R2.fastq \ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:36

Step 2: Choosing an Assembly Strategy

Select an assembly strategy based on your data type:

Short-Read Assemblers:
- SPAdes: Popular for microbial genomes.
- Velvet: Fast for smaller genomes.
Long-Read Assemblers:
- Canu: Ideal for long-read datasets.
- Flye: Versatile for small and large genomes.
Hybrid Assemblers:
- MaSuRCA: Combines short and long reads.
- Unicycler: Optimized for bacterial genomes.

Step 3: Running the Assembly

3.1. SPAdes (Short-Read Assembly)

SPAdes is an excellent choice for small genomes, such as bacteria.

spades.py -1 trimmed_R1.fastq -2 trimmed_R2.fastq -o spades_output

The output includes assembled contigs (contigs.fasta) and scaffolds (scaffolds.fasta).

3.2. Canu (Long-Read Assembly)

Canu is designed for high-error long reads from PacBio or Nanopore.

canu -p genome -d canu_output genomeSize=4.7m -nanopore-raw reads.fastq

The output will be in canu_output/genome.contigs.fasta.

3.3. Hybrid Assembly with Unicycler

Unicycler combines short and long reads for improved assemblies.

unicycler -1 trimmed_R1.fastq -2 trimmed_R2.fastq -l long_reads.fastq -o unicycler_output

Step 4: Assessing Assembly Quality

After assembly, evaluate its quality using the following tools:

QUAST
QUAST generates assembly statistics, such as N50, genome size, and GC content:

quast contigs.fasta -o quast_output
BUSCO
BUSCO checks genome completeness by identifying conserved genes:

busco -i contigs.fasta -o busco_output -l fungi_odb10 -m genome
Assembly Graph Visualization
Visualize assembly graphs with Bandage:

Bandage load assembly_graph.gfa

Step 5: Post-Assembly Steps

Polishing
Improve assembly accuracy using tools like Pilon (for short reads) or Racon (for long reads).

racon long_reads.fasta mapped_reads.sam contigs.fasta > polished_contigs.fasta
Scaffolding
Link contigs into scaffolds using tools like SSPACE or Opera-LG if required.
Annotation
Annotate the assembled genome using Prokka for prokaryotes or Maker for eukaryotes.

prokka --outdir annotation_output --prefix genome contigs.fasta

Step 6: Sharing and Archiving

Submit to Public Repositories
Share your assembly in databases like NCBI GenBank, ENA, or DDBJ.
Metadata Preparation
Include detailed metadata for your submission, such as organism name, sequencing platform, and coverage.

Best Practices

Always perform quality checks at each stage to ensure data integrity.
Use multiple tools to cross-validate results when working with complex genomes.
Document parameters and software versions for reproducibility.

Conclusion

Genome assembly is a powerful process that transforms raw sequencing data into a coherent representation of an organism’s genome. By following this step-by-step guide, you can successfully assemble genomes and uncover valuable biological insights. Whether you’re assembling a microbial genome or tackling the complexities of a eukaryotic genome, these tools and strategies will set you on the path to success.