BOL: Related items

Phased Human Genome Assembly !

Rahul Nayak — Mon, 08 Oct 2018 09:10:54 -0500

The new publicly available assembly (PacBio HG00733) has the fewest gaps of any human genome assembly, with more than half of the genome contained in gapless sequence at least 27 Mb long. The primary contig assembly is 2.89 Gb long and consists of 865 contigs that were assembled with PacBio data generated with the company’s Sequel® System. Using the FALCON-Unzip assembler, maternal and paternal haplotypes were resolved over more than 80% of the genome. Maternal and paternal haplotype blocks were then further phased using Hi-C technology and the FALCON-Phase methoddeveloped in collaboration with Phase Genomics. The genome was then de novo scaffolded using Phase Genomics’ Proximo Hi-C platform, resulting in the first chromosome-scale diploid assembly of a single individual accomplished with only two technologies. More specific details about the assembly are included on the PacBio blog.

The data are available using NCBI accession IDs: BioProject: (PRJNA483067), assembly: [RBJD00000000] and sequence data (SRP155659).

Additional Resources

Interactive map showcasing global initiatives underway to generate reference-quality human genome assemblies for diverse populations
BioReport Podcast on the value of ethnic-specific reference genomes
Nature Reviews Genetics paper from NHGRI: Prioritizing diversity in human genomics research
Article in The Journal of Precision Medicine: “Minority Report – Ethnic Diversity and the Real Promise for Precision Medicine”
Article in Bio-IT World: “Genomic Data Standards Are a Necessity”
NHGRI Project Award: High Quality Human and Non-Human Primate Genome Assemblies

More details are available on the PacBio website:

Blog post: Data Release: Highest-Quality, Most Contiguous Individual Human Genome Assembly to Date
Blog post: For Reference-Grade Human Genome Assemblies, SMRT Sequencing Yields Optimal Results
Webinar: Assembling High-Quality Human Reference Genomes for Global Populations
FALCON-Phase press release and article preprint
PacBio research focus webpage about Human Population Genetics

Ref: https://stockguru.com/2018/10/08/pacific-biosciences-releases-highest-quality-most-contiguous-individual-human-genome-assembly-to-date/

Genome Workbench 2.10.7

Gudiya Pal — Fri, 01 Jul 2016 12:09:59 -0500

Genome Workbench 2.10.7 is here! New features include added support for local custom BLAST databases and improvements to Tree View.

For the full list of features, improvements and fixes, see the release notes:https://ncbi.nlm.nih.gov/tools/gbench/releasenotes

New Features

BLAST Tool: added support for local custom BLAST databases
Graphical Sequence View: added log scaling option for graph tracks
Generic Table View: new tutorial added

Bug Fixes and Improvements

Project Tree View: Genomic Collections/Assemblies now show accessions, not just names
Tree View: layout updated to better accommodate nodes of different sizes
Table Import Dialog (MacOS): fixed issue with table visibility
Fixed bug where different molecules IDs in GenBank could resolve to the same sequence
Graphical Sequence View: fixed issue where sequence track was not shown for some sequences
Graphical Sequence View: fixed protein coloration methods
Graphical Sequence View: improved rendering of Markers to better indicate boundaries and produce higher quality PDF images
Create Gene Model tool: fixed scenario when gene model tool failed with local sequences
Search View: ORF Finder – fixed incorrect protein lengths
Fixed bug with not opening project file (.gbp) on a click
Fixed issues in GVF import
Fixed BLAST Search tool against NCBI databases not working
Fixed tblastn (protein BLAST) not working in standalone mode
Fixed GTF export failure

RepeatModeler

Jit — Thu, 18 Aug 2016 09:57:15 -0500

RepeatModeler is a de-novo repeat family identification and modeling package. At the heart of RepeatModeler are two de-novo repeat finding programs ( RECON and RepeatScout ) which employ complementary computational methods for identifying repeat element boundaries and family relationships from sequence data. RepeatModeler assists in automating the runs of RECON and RepeatScout given a genomic database and uses the output to build, refine and classify consensus models of putative interspersed repeats.

Address of the bookmark: http://www.repeatmasker.org/RepeatModeler.html

TGNet

Shruti Paniwala — Wed, 24 Aug 2016 05:36:36 -0500

Recent technological progress has greatly facilitated de novo genome sequencing. However, de novo assemblies consist in many pieces of contiguous sequence (contigs) arranged in thousands of scaffolds instead of small numbers of chromosomes. Confirming and improving the quality of such assemblies is critical for subsequent analysis.

Visualization and quality assessment of de novo genome assemblies

Citation

This software is fully described in the paper:
Riba-Grognuz, Keller, Falquet, Xenarios & Wurm (2011) Visualization and quality assessment of de novo genome assemblies.

In brief, our scripts create Cytoscape files to visualize transcript evidence that suggests adjacency between scaffolds and contigs.

Software requirements

BLAT (tested with Standalone BLAT v. 32×1). Source Binaries .
Cytoscape (tested with versions 2.7.0, 2.8.2)
a UNIX machine (tested on Mac OS X 10.6 and CentOS 4.6)

Address of the bookmark: https://github.com/ksanao/TGNet

Useful Bioinformatics Tools

Poonam Mahapatra — Mon, 29 Aug 2016 04:08:12 -0500

Collections of few handy tools for bioinformatician

http://molbiol-tools.ca/Convert.htm

Address of the bookmark: http://molbiol-tools.ca/Convert.htm

BRAKER: pipeline for fully automated prediction of protein coding genes with GeneMark-ES/ET and AUGUSTUS in novel eukaryotic genomes

Jit — Thu, 01 Sep 2016 08:02:59 -0500

Gene finding in eukaryotic genomes is notoriously difficult to automate. The task is to design a work flow with a minimal set of tools that would reach state-of-the-art performance across a wide range of species. GeneMark-ET is a gene prediction tool that incorporates RNA-Seq data into unsupervised training and subsequently generates ab initio gene predictions. AUGUSTUS is a gene finder that usually requires supervised training and uses information from RNA-Seq reads in the prediction step. Complementary strengths of GeneMark-ET and AUGUSTUS provided motivation for designing a new combined tool for automatic gene prediction.

http://www.ncbi.nlm.nih.gov/pubmed/26559507

Address of the bookmark: http://bioinf.uni-greifswald.de/bioinf/braker/

NGS Tutorial

Jit — Mon, 05 Sep 2016 09:50:46 -0500

These tutorials are written for hundreds of bioinformaticians trying to cope with large volume of next-generation sequencing (NGS) data. NGS technologies brought a dramatic shift in the world of sequencing. Merely five years back, genome sequencing of higher eukaryotes used to be very expensive endeavor. To get a genome of interest sequenced, hundreds of scientists had to raise funds together by writing a joint white-paper and petitioning to various government agencies. The tasks of sequencing and assembly were handled by dedicated sequencing facilities, of which only a few existed around the globe. Naturally, the capacities at those sequencing facilities were significantly constrained from high volume of requests

Address of the bookmark: http://www.homolog.us/Tutorials/index.php

e-RGA: enhanced Reference Guided Assembly of Complex Genomes

Jit — Mon, 19 Dec 2016 05:56:14 -0600

Next Generation Sequencing has totally changed genomics: we are able to produce huge amounts of data at an incredibly low cost compared to Sanger sequencing. Despite this, some old problems have become even more difficult, de novo assembly being on top of this list. Despite efforts to design tools able to assemble, de novo, an organism sequenced with short reads, the results are still far from those achievable with long reads. In this paper, we propose a novel method that aims to improve de novo assembly in the presence of a closely related reference. The idea is to combine de novo and reference-guided assembly in order to obtain enhanced results.

Address of the bookmark: http://journal.embnet.org/index.php/embnetjournal/article/view/208

Mercator

Jit — Mon, 06 Feb 2017 04:20:36 -0600

Our basic strategy in building homology maps is to use exons that are orthologous in multiple genomes as map "anchors." Given K genomes, the steps in the map construction are as follows:

For each genome, obtain a set of exon annotations. These annotations can be a combination of both exon predictions (e.g. Genscan) and annotations that have been experimentally verified (e.g. RefSeq). Ideally, we would like to have these annotations be as sensitive as possible. Specificity is not a concern, as incorrect annotations are not likely not have significant alignments with other gene annotations.
Compare all exons against all exons in other genomes and record significant alignments between exons. Currently, we use BLAT to do this all-vs-all comparison with alignments being performed in protein space.
Construct a graph with each vertex corresponding to a exon and edges between vertices whose corresponding exons have significant alignments.
Identify cliques in this graph. These cliques are potential anchors to be used in the map.
Starting with the largest cliques (those that have exons in all or most of the genomes), join neighboring (adjacent in genomic coordinates, in each genome) cliques to form runs. Smaller cliques that are inconsistent with runs formed by larger cliques are filtered out. After the smallest cliques have been considered, cliques that are not part of a run are discarded.
The extents of each run in each genome are outputted as orthologous segments. The cliques from each run are used to output the exact genomic coordinates of anchors within each orthologous segment. These anchors can be used by genomic alignment programs (such as MAVID) to do a detailed alignment of each orthologous segment.

https://www.biostat.wisc.edu/~cdewey/mercator/

Address of the bookmark: https://www.biostat.wisc.edu/~cdewey/mercator/

MashMap: a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s)

Jit — Tue, 12 Dec 2017 17:23:31 -0600

MashMap is a fast and approximate software for mapping long reads (PacBio/ONT) or assembly to reference genome(s). It maps a query sequence against a reference region if and only if its estimated alignment identity is above a specified threshold. It does not compute the alignments explicitly, but rather estimates a k-mer based Jaccard similarity using a combination of Winnowing and MinHash. This is then converted to an estimate of sequence identity using the Mash distance. An appropriate k-mer sampling rate is automatically determined given minimum local alignment length and identity thresholds. The efficiency of the algorithm improves as both of these thresholds are increased.

Address of the bookmark: https://github.com/marbl/MashMap