BOL: Rahul Nayak's blogs

chromoMap-An R package for Interactive visualization and mapping of human chromosomes

Rahul Nayak — Mon, 25 Jun 2018 17:22:24 -0500

chromoMap is an R package that provides interactive, configurable and elegant graphics visualization of the human chromosomes allowing users to map chromosome elements (like genes, SNPs etc.) on the chromosome plot. It introduces a special plot viz. the "chromosome heatmap" that, in addition to mapping elements, can visualize the data associated with chromosome elements (like gene expression) in the form of heat colors which can be highly advantageous in the scientific interpretations and research work. Because of the enormous size of the chromosomes, it is impractical to visualize each element on the same plot. But chromoMap plots provide a magnified view for each of chromosome location to render additional information and visualization specific for that location. You can map thousands of genes and can view all mappings easily. Users can investigate the detailed information about the mappings (like gene names or total genes mapped on a location) or can view the magnified single or double stranded view of the chromosome at a location showing each mapped element in sequential order (You will see in the demos below). Not ony that, the plots can be saved as HTML documents that can be customized and shared easily. In addition, you can include them in R Markdown or in R Shiny applications.

https://cran.r-project.org/web/packages/chromoMap/index.html

Understanding liftOver !

Rahul Nayak — Wed, 06 Jun 2018 10:00:20 -0500

LiftOver is a necesary step to bring all genetical analysis to the same reference build. LiftOver can have three use cases:

(1) Convert genome position from one genome assembly to another genome assembly

In most scenarios, we have known genome positions in NCBI build 36 (UCSC hg 18) and hope to lift them over to NCBI build 37 (UCSC hg19).

(2) Convert dbSNP rs number from one build to another

(3) Convert both genome position and dbSNP rs number over different versions

Run:

liftOver input.bed hg18ToHg19.over.chain.gz output.bed unlifted.bed

The outformat is as follow:

Deleted in new:
    Sequence intersects no chains
Partially deleted in new:
    Sequence insufficiently intersects one chain
Split in new:
    Sequence insufficiently intersects multiple chains
Duplicated in new:
    Sequence sufficiently intersects multiple chains
Boundary problem:
    Missing start or end base in an exon

For example:

If you liftOver chr4:6497-6497 from hg19 to GRch38 and it return "deleted in new".

It means chr4:6497-6497 is part of a genomic contig on hg19 that is not anymore mapped on GRch38 because the new assembly is now better built without including this contig.

Popular bioinformatics educational resources !

Rahul Nayak — Fri, 04 May 2018 19:43:21 -0500

Followings are the list of popular bioinformatics educational resources

Bii.a-star.edu.sg

Bio research and development. Has course information and research information.

Isb-sib.ch

SIB operates the ExPASy proteomics server and the Swiss node of EMBnet. Teaching activities include a series of post-graduate courses given at the Universities of Geneva and Lausanne, as well as at the EPFL, and a Masters Degree in bioinformatics. Major research areas include the development of integrated databases and software resources in the field of proteomics.

Bioinformatics.ca

Provides information about bioinformatics in Canada. Workshops, certification and resources.

Chickscope.beckman.uiuc.edu

Students raise chicken embryos in the classroom and obtain magnetic resonance images through the Internet.

Bcb.iastate.edu

Graduate program at Iowa State University offering Undergraduate Major (BCBio) and the PhD program (BCB).

Bu.edu/bioinformatics/

Interdisciplinary PhD and Masters Programs that include an internship in the local industry companies. In conjunction with the NE masters program.

Bioinformatics.ubc.ca

A computational biology research centre covering many areas of genomics, proteomics, computer science and statistics. Research, training, news and events, resources and support, director's message, faculty and personnel.

Openhelix.com

Provides onsite training on specific bioinformatics databases and tools. Also offers bioinformatic software testing and research consulting services.

Igb.uci.edu

Specializing in making publicly available software and database services for computational biology.

Bioinformatics.pe.kr

Maintained by Dr. Seyeon Weon, Korea providing information on courses, a database archive, software archive and online resources.

Groups.yahoo.com/group/bimatics/

Bioinformatics group for students interested and/or working in the bioinformatics/computationalbiology fields. Offers opportunities to exchanging information and sharing ideas.

Ncbi.nlm.nih.gov/books/NBK22183/

Information about several medically important genes and related diseases. Illustrates the use of bioinformatics in their study.

Bioinfo.mbb.yale.edu/mbb452a/2003/

Bioinformatics course at Yale University. All course slides are available online.

Cs.iastate.edu/~honavar/comp-bio-courses.html

Listing of computational molecular biology course pages that have extensive online course materials.

Bioinf.manchester.ac.uk/dbbrowser/bioactivity/prefacefrm.html

A web-based tutorial associated with "Introduction to bioinformatics" published by Addison Wesley Longman.

Northeastern.edu/bioinformatics/

From the Biology department and in cooperation with Boston University. Emphasis on the ability to integrate knowledge from biological, computational, and mathematical disciplines.

Biocomp.unibo.it/lsbioinfo/

A two year, international master's programme in bioinformatics at the Universita di Bologna, Italy.

Cs.helsinki.fi/bioinformatiikka/mbi/programme.html

A two year Masters Degree Programme in Bioinformatics (MBI) offered by the University of Helsinki and Helsinki University of Technology, Finland.

Ornl.gov/sci/techresources/Human_Genome/education/education.shtml

A resource for introductory information on the Human Genome Project.

His.se/bioinformatics

A one-year, international master's programme in bioinformatics at the University of Skovde, Sweden.

Members.tripod.com/C.elegans/

Resources in biochemical, molecular, cellular, system, and organism biology, including over 25,000 indexed links, accumulated since 2000, from topic menus or from search interface.

Bioinformatics.org/faq/#contents

Summary of basics of bioinformatics for the intelligent newcomer.

Jiscmail.ac.uk/archives/bioinformatics.html

Forum featuring various aspects, events and developments in the bioinformatics field.

Biinoida.blogspot.com

Blog focusing on bioinformatics, biotechnology, pharma regulatory affairs, IPR and clinical trials.

Colorbasepair.com/bioinformatics_courses_tutorials.html

A list of on-line course materials and tutorials for bioinformatics and computational biology.

Geospiza.com/education/

Instructional materials for teaching bioinformatics. These include animated tutorials on topicssuch as BLAST, finding mutations in a protein, and graphing with MS-Excel.

Bioinformatics.fi

An international, two-year Master's programme jointly managed by the University of Tampere and the University of Turku, Finland.

Perlsource.net

Provides online courses in Perl programming for bioinformatic tools.

String graph based genome assembly software and tools !

Rahul Nayak — Tue, 19 Dec 2017 17:17:38 -0600

In graph theory, a string graph is an intersection graph of curves in the plane; each curve is called a "string". String graphs were first proposed by E. W. Myers in a 2005 publication. In recent Genome Research paper describing an innovative approach for assembling large genomes from NGS data caught our attention for several reasons. i) it give different "string graph" prospective of long lasting genome assembly problem ii) the paper is coauthored by Jared Simpson, the developer of ABySS assembler and Richard Durbin. iii) Simpson-Durbin algorithm is that it does not rely on de Bruijn graphs, and instead employs a different graph construction approach called ‘string graph’.

Following are the genome assembly tools based on string graph:

1.SGA (String Graph Assembler) https://github.com/jts/sga

Assembles large genomes from high coverage short read data. SGA is designed as a modular set of programs, which are used to form an assembly pipeline. SGA implements a set of assembly algorithms based on the FM-index. As the FM-index is a compressed data structure, the algorithms are very memory efficient. The SGA assembly has three distinct phases. The first phase corrects base calling errors in the reads. The second phase assembles contigs from the corrected reads. The third phase uses paired end and/or mate pair data to build scaffolds from the contigs. The output of this software is a PDF report that allows the properties of the genome and data quality to be visually explored. By providing more information to the user at the start of an assembly project, this software will help increase awareness of the factors that make a given assembly easy or difficult, assist in the selection of software and parameters and help to troubleshoot an assembly if it runs into problems.

2. SAGE: String-overlap Assembly of GEnomes https://github.com/lucian-ilie/SAGE2

SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers.

3. FSG: Fast String Graph

The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.

4. BASE https://github.com/dhlbh/BASE

It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.

5. Fermi https://github.com/lh3/fermi/

Fermi is a de novo assembler with a particular focus on assembling Illumina short sequence reads from a mammal-sized genome. In addition to the role of a typical assembler, fermi also aims to preserve heterozygotes which are often collapsed by other assemblers. Its ultimate goal is to find a minimal set of unitigs to represent all the information in raw reads.

If you want to learn about String Graph assembler, please read the following papers -

i) The Fragment Assembly String Graph - E. W. Myers

This paper describes the String Graph concept.

ii) Efficient construction of an assembly string graph using the FM-index - Jared T. Simpson and Richard Durbin

This earlier paper from Simpson and Durbin

iii) Efficient de novo assembly of large genomes using compressed data structures - Jared T. Simpson and Richard Durbin