BOL: Related items

String graph based genome assembly software and tools !

Rahul Nayak — Tue, 19 Dec 2017 17:17:38 -0600

In graph theory, a string graph is an intersection graph of curves in the plane; each curve is called a "string". String graphs were first proposed by E. W. Myers in a 2005 publication. In recent Genome Research paper describing an innovative approach for assembling large genomes from NGS data caught our attention for several reasons. i) it give different "string graph" prospective of long lasting genome assembly problem ii) the paper is coauthored by Jared Simpson, the developer of ABySS assembler and Richard Durbin. iii) Simpson-Durbin algorithm is that it does not rely on de Bruijn graphs, and instead employs a different graph construction approach called ‘string graph’.

Following are the genome assembly tools based on string graph:

1.SGA (String Graph Assembler) https://github.com/jts/sga

Assembles large genomes from high coverage short read data. SGA is designed as a modular set of programs, which are used to form an assembly pipeline. SGA implements a set of assembly algorithms based on the FM-index. As the FM-index is a compressed data structure, the algorithms are very memory efficient. The SGA assembly has three distinct phases. The first phase corrects base calling errors in the reads. The second phase assembles contigs from the corrected reads. The third phase uses paired end and/or mate pair data to build scaffolds from the contigs. The output of this software is a PDF report that allows the properties of the genome and data quality to be visually explored. By providing more information to the user at the start of an assembly project, this software will help increase awareness of the factors that make a given assembly easy or difficult, assist in the selection of software and parameters and help to troubleshoot an assembly if it runs into problems.

2. SAGE: String-overlap Assembly of GEnomes https://github.com/lucian-ilie/SAGE2

SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers.

3. FSG: Fast String Graph

The new integrated assembler has been assessed on a standard benchmark, showing that fast string graph (FSG) is significantly faster than SGA while maintaining a moderate use of main memory, and showing practical advantages in running FSG on multiple threads. Moreover, we have studied the effect of coverage rates on the running times.

4. BASE https://github.com/dhlbh/BASE

It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.

5. Fermi https://github.com/lh3/fermi/

Fermi is a de novo assembler with a particular focus on assembling Illumina short sequence reads from a mammal-sized genome. In addition to the role of a typical assembler, fermi also aims to preserve heterozygotes which are often collapsed by other assemblers. Its ultimate goal is to find a minimal set of unitigs to represent all the information in raw reads.

If you want to learn about String Graph assembler, please read the following papers -

i) The Fragment Assembly String Graph - E. W. Myers

This paper describes the String Graph concept.

ii) Efficient construction of an assembly string graph using the FM-index - Jared T. Simpson and Richard Durbin

This earlier paper from Simpson and Durbin

iii) Efficient de novo assembly of large genomes using compressed data structures - Jared T. Simpson and Richard Durbin

Research Fellow in Bioinformatics @ Queen's University Belfast -Institute for Global Food Security, School of Biological Sciences

Thu, 17 Oct 2013 04:33:02 -0500

Ref: 13/102900

Available immediately until 30th November 2015, to work on the development of bioinformatics approaches to aid analysis of data derived from the metabolomic profiling of biological matrices. The successful applicant will lead research activities on an FP7 funded EU-wide collaborative project aimed at establishing biomarker-based strategies for high throughput diagnostic screening. Key tasks will involve multivariate analysis of large datasets, bioinformatic-based selection and validation of identified markers, construction of metabolomic spectral profile databases and development of machine learning/database searching approaches amenable to analytical screening techniques. This position will offer the opportunity to travel and undertake work with project collaborators based in the Republic of Ireland and Europe.

Informal enquiries may be directed to Dr Terry McGrath, email: terry.mcgrath@qub.ac.uk.

Anticipated interview date: Thursday 31st October 2013
Salary scale: £30,424 – £39,649 per annum (including contribution points)
Closing date: Monday 21st October 2013

Telephone (028) 90973044 FAX: (028) 90971040 or e-mail on personnel@qub.ac.uk

The University is committed to equality of opportunity and to selection on merit. It therefore welcomes applications from all sections of society and particularly welcomes applications from people with a disability.

Fixed term contract posts are available for the stated period in the first instance but in particular circumstances may be renewed or made permanent subject to availability of funding.

More @ https://hrwebapp.qub.ac.uk/tlive_webrecruitment/wrd/run/ETREC107GF.open?VACANCY_ID=5616943npO&WVID=6273090Lgx&LANG=USA

MUMmer4: A fast and versatile genome alignment system

Jit — Sat, 03 Feb 2018 04:59:17 -0600

MUMmer4, a substantially improved version of MUMmer that addresses genome size constraints by changing the 32-bit suffix tree data structure at the core of MUMmer to a 48-bit suffix array, and that offers improved speed through parallel processing of input query sequences. With a theoretical limit on the input size of 141Tbp, MUMmer4 can now work with input sequences of any biologically realistic length. We show that as a result of these enhancements, the nucmer program in MUMmer4 is easily able to handle alignments of large genomes;

Address of the bookmark: https://mummer4.github.io/

JRF @ NATIONAL JALMA INSTITUTE OF LEPROSY AND OTHER MYCOBACTERIAL DISEASES

Mon, 28 Oct 2013 10:42:48 -0500

NATIONAL JALMA INSTITUTE OF LEPROSY AND OTHER MYCOBACTERIAL DISEASES

(INDIAN COUNCIL OF MEDICAL RESEARCH)

P.O BOX 101,
Dr. M. Miyazaki Marg,
Tajganj, Agra - 282001

Applications are invited for a walk-in interview to be held in the Seminar Hall of the on 15th November, 2013, 9:30 am for temporary positions of JRF, Lab Technician and Field attendant in a ICMR funded project entitled "Elucidating the strain differentiation and transmission dynamics of M. leprae through simple sequence repeats ISSR-PCR marker"

1. JRF (one Post)

Essential qualification: Candidates with M.Sc/IVI.Tech or equivalent degree in any life science related subjects with UGC-CSIR/ICMR/DBT-Net qualified

Desirable qualification: Experience in Molecular Biology/Computational Biology will be preferred.

Age. Maximum 28 years as on 11.11.2013. Age relaxation as per GOI rules.

Emoluments: Rs. 6,000 + 20% HRA per Month

2. Lab Technician (One Post)

Essential Qualification: 12th with DMLT/B.SCA4.SC in Life sciences

Desirable qualification: Experience in Molecular Biology/Computational Biology will be preferred.

Age: Maximum 30 years as on 11.11.2013. Age relaxation as per GOI rules.

Emoluments: Rs13,760/ Per Month

3. Field Attendant (One Post)

Essential Qualification: 10th Pass

Desirable Qualification: Experience in field work

Age: Maximum 28 years as on 11.11.2013. Age relaxation as per GOI rules.

Emoluments: Rsl2,040l Per Month

Terms: posts are purely temporary. Appointment will be initially made for a period of one (01) year and may be extended further based on the performance of the candidate up to completion of the project.

Application & Selection procedure: candidates have to appear in the walk-in-interview in person along with an application/CV on plain paper giving details of at educational qualificationq experience and submit photocopies of relevant documents at the time of interview. Selection will be based on the performance of the candidate in the interview' Candidates will not be sent any interview call letter separately. No TA/DA will be paid to the candidate for appearing in the interview. selection is not possible without appearing in the interview. All candidates must report by 9:00am on the date of interview. Advance copy of CV may be sent to m.sarathipartha@gmail.com

Advertisement: http://www.jalma-icmr.org.in/P_S_M_advertisment.pdf

AlignGraph: algorithm for secondary de novo genome assembly guided by closely related references

Manisha Mishra — Tue, 17 Apr 2018 16:21:20 -0500

AlignGraph is a software that extends and joins contigs or scaffolds by reassembling them with help provided by a reference genome of a closely related organism.

Using AlignGraph

AlignGraph --read1 reads_1.fa --read2 reads_2.fa --contig contigs.fa --genome genome.fa --distanceLow distanceLow --distanceHigh distancehigh --extendedContig extendedContigs.fa --remainingContig remainingContigs.fa [--kMer k --insertVariation insertVariation --coverage coverage --part p --fastMap --ratioCheck --iterativeMap --misassemblyRemoval --resume]

Address of the bookmark: https://github.com/baoe/AlignGraph

Post-doc in Systems Genetics

Wed, 08 Jan 2014 19:23:37 -0600

Gagneur lab at Gene Center, Ludwig-Maximilians-Universitaet, Munich, Germany

Deadline for applications : January 15, 2014.

Description :

We seek a talented and motivated post-doc to develop computational methods for inferring the molecular basis of genetic diseases by integration of personal omics data. Research topics include: identifying causal mutations of rare disease patients by meta-analysis; inferring disease-causing molecular pathways from genotype, human phenotypes, and omics profile of patient-derived cell lines; and causal inference from longitudinal omics studies of patients. The developed methods will be applied to analyze data from our medical collaborators.

Candidates must either hold a PhD in computational biology or bioinformatics, or hold a PhD in physics, statistics, or applied mathematics with practical experience with high-dimensional data analysis. Experience in quantitative genetics is a plus. Applicants must have a proven publication record and an interest for translational research.

The Gagneur lab is a young, lively and multidisciplinary group with a research focus on systems genetics and gene regulation. It is located at the Gene Center of the LMU (University of Munich), an interdisciplinary institution whose 16 independent research groups investigate the regulation of gene expression at all levels - from the underlying molecular mechanisms to the biological system. The institute is located on the biomedical research campus Munich-Grosshadern, offering a dynamic, interactive and internationally oriented research environment. The dynamism of Munich and the proximity of the Alps provide an excellent quality of life.

The salary is according to the TV-L (German academic salary scale).
Applications including a cover letter, CV, and references must be sent by January 15th 2014 to Julien Gagneur (gagneur@genzentrum.lmu.de)

About the lab: http://www.gagneur.genzentrum.lmu.de

CrossMap: a program for convenient conversion of genome coordinates

Jit — Thu, 31 May 2018 06:00:47 -0500

CrossMap is a program for convenient conversion of genome coordinates (or annotation files) between different assemblies (such as Human hg18 (NCBI36) <> hg19 (GRCh37), Mouse mm9 (MGSCv37) <> mm10 (GRCm38)). It supports most commonly used file formats including SAM/BAM, Wiggle/BigWig, BED, GFF/GTF, VCF. CrossMap is designed to liftover genome coordinates between assemblies. It’s not a program for aligning sequences to reference genome. We do not recommend using CrossMap to convert genome coordinates between species.

Address of the bookmark: http://crossmap.sourceforge.net

Computer experts in biotechnology laboratory

Jit — Wed, 04 Dec 2013 02:11:43 -0600

Only bioinformatician can understand that multiplication and division are different but same thing :)

Disclaimer: This cartoon is solely designed to create humour and fun, not to offend any computer experts.

assemblytics: delta file to analyze alignments of an assembly to another assembly or a reference genome

Jit — Thu, 14 Jun 2018 07:31:00 -0500

Download and install MUMmer Align your assembly to a reference genome using nucmer (from MUMmer package) $ nucmer -maxmatch -l 100 -c 500 REFERENCE.fa ASSEMBLY.fa -prefix OUT Consult the MUMmer manual if you encounter problems Optional: Gzip the delta file to speed up upload (usually 2-4X faster) $ gzip OUT.delta Then use the OUT.delta.gz file for upload. Upload the .delta or delta.gz file (view example) to Assemblytics Important: Use only contigs rather than scaffolds from the assembly. This will prevent false positives when the number of Ns in the scaffolded sequence does not match perfectly to the distance in the reference. The unique sequence length required represents an anchor for determining if a sequence is unique enough to safely call variants from, which is an alternative to the mapping quality filter for read alignment. http://assemblytics.com/

Address of the bookmark: http://assemblytics.com/

Hidden Markov Models, Viterbi Algorithm, Markov Chain Exploration with script

Manisha Mishra — Thu, 14 Nov 2013 13:36:56 -0600

Hidden Markov Models, the Viterbi Algorithm, and CpG Islands (in VB6)

Problem :

The CG island is a stretch of DNA (usually longer than 200 bases) in which the frequency of the CG sequence is higher than other regions. It is also called the CpG island, where "p" simply indicates that "C" and "G" are connected by a phosphodiester bond.

CpG islands are often located around the promoters of housekeeping genes (which are essential for general cell functions) or other genes frequently expressed in a cell. At these locations, the CG sequence is not methylated. By contrast, the CG sequences in inactive genes are usually methylated to suppress their expression. The methylated cytosine may be converted to thymine by accidental deamination. Unlike the cytosine to uracil mutation which is efficiently repaired, the cytosine to thymine mutation can be corrected only by the mismatch repair which is very inefficient. Hence, over evolutionary time scales, the methylated CG sequence will be converted to the TG sequence.

Find step wise explanationand implementation steps at http://dna.cs.byu.edu/bio465/Labs/hmm.shtml

Source code with explanation http://www.tannerhelland.com/1187/hidden-markov-models-viterbi-algorithm-cpg-islands-in-vb6/

Fore detail understanding of HMM read this excellent tutorial http://www.cs.ubc.ca/~murphyk/Software/HMM/labman2.pdf

Viterbi Algo at http://en.wikipedia.org/wiki/Viterbi_path

For firther reading Wiki page http://en.wikipedia.org/wiki/Hidden_Markov_model

On CpG island paper and for indepth understanding http://www.biomedcentral.com/1471-2164/12/S2/S10

If you are more interested in exploring Markov Chain Exploration and understand it with graphical version please visit http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=75049&lngWId=1

Reference:

1.http://www.planet-source-code.com

2. http://www.tannerhelland.com