BOL: Related items

Cgaln

Jit — Wed, 22 Feb 2017 05:14:15 -0600

Cgaln (Coarse grained alignment) is a program designed to align a pair of whole genomic sequences of not only bacteria but also entire chromosomes of vertebrates on a nominal desktop computer. Cgaln performs an alignment job in two steps, at the block level and then at the nucleotide level. The former "coarse-grained" alignment can explore genomic rearrangements and reduce the regions to be analyzed in the next step. The latter is devoted to detailed alignment within the limited regions found in the first stage. The output of Cgaln is 'glocal' in the sense that rearrangements are taken into consideration while each alignable region is extended as long as possible. Thus, Cgaln is not only fast and memory-efficient, but also can filter noisy outputs without missing the most important homologous segment pairs.

http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

Address of the bookmark: http://www.iam.u-tokyo.ac.jp/chromosomeinformatics/rnakato/cgaln/

Bioistats Online course

Abhimanyu Singh — Thu, 10 Nov 2016 04:22:51 -0600

One of our primary focuses will be to develop an understanding of the various ways in which we can assign a probability to some chance event. We'll also learn the fundamental properties of probability, investigate how probability behaves, and learn how to calculate the probability of a new chance event.

This book is handy understanding basic concepts.

Address of the bookmark: https://onlinecourses.science.psu.edu/stat414/node/287

MafTools

Jit — Thu, 16 Feb 2017 11:16:01 -0600

maftools - An R package to summarize, analyze and visualize MAF files. Introduction.

With advances in Cancer Genomics, Mutation Annotation Format (MAF) is being widley accepted and used to store variants detected. The Cancer Genome Atlas Project has seqenced over 30 different cancers with sample size of each cancer type being over 200. The resulting data consisting of genetic variants is stored in the form of Mutation Annotation Format. This package attempts to summarize, analyze, annotate and visualize MAF files in an efficient manner either from TCGA sources or any in-house studies as long as the data is in MAF format. Maftools can also handle ICGC Simple Somatic Mutation format.

maftools is on bioRxiv

Please cite the below if you find this tool useful for you.

Mayakonda, A. and H.P. Koeffler, Maftools: Efficient analysis, visualization and summarization of MAF files from large-scale cohort based cancer studies. bioRxiv, 2016. doi: http://dx.doi.org/10.1101/052662

Address of the bookmark: https://github.com/PoisonAlien/maftools

DIAL

Abhimanyu Singh — Wed, 01 Mar 2017 08:42:28 -0600

A computational pipeline for identifying single-base substitutions between two closely related genomes without the help of a reference genome. DIAL works even when the depth of coverage is insufficient for de novo assembly, and it can be extended to determine small insertions/deletions. Our main motivation is to use this tool to survey the genetic diversity of endangered species as the identified sequence differences can be used to design genotyping arrays to assist in the species' management.

http://www.bx.psu.edu/~ratan/

Address of the bookmark: http://www.bx.psu.edu/miller_lab/

Krona

Jit — Wed, 22 Mar 2017 04:47:35 -0500

Krona allows hierarchical data to be explored with zooming, multi-layered pie charts. Krona charts can be created using an Excel template or KronaTools, which includes support for several bioinformatics tools and raw data formats. The interactive charts are self-contained and can be viewed with any modern web browser (see Browser support).

Address of the bookmark: https://github.com/marbl/Krona/wiki

TMAP - torrent mapping alignment program General Notes

Poonam Mahapatra — Sun, 02 Apr 2017 15:53:47 -0500

TMAP - torrent mapping alignment program General Notes

TMAP is a fast and accurate alignment software for short and long nucleotide sequences produced by next-generation sequencing technologies.

The latest TMAP is unsupported. To use a supported version, please see the TMAP version associated with a Torrent Suite release below.

Get the latest source code:

git clone git://github.com/iontorrent/TMAP.git
 cd TMAP
 git submodule init
 git submodule update

https://github.com/iontorrent/TS/tree/master/Analysis/TMAP

Address of the bookmark: https://github.com/iontorrent/TS/tree/master/Analysis/TMAP

CAR: Reconstructing Contiguous Regions of an Ancestral Genome

Abhimanyu Singh — Thu, 18 May 2017 05:24:01 -0500

We describe a new method for predicting the ancestral order and orientation of those intervals from their observed adjacencies in modern species. We combine the results from this method with data from chromosome painting experiments to produce a map of an early mammalian genome that accounts for 96.8% of the available human genome sequence data. The precision is further increased by mapping inversions as small as 31 bp. Analysis of the predicted evolutionary breakpoints in the human lineage confirms certain published observations but disagrees with others. Although only a few mammalian genomes are currently sequenced to high precision, our theoretical analyses and computer simulations indicate that our results are reasonably accurate and that they will become highly accurate in the foreseeable future. Our methods were developed as part of a project to reconstruct the genome sequence of the last ancestor of human, dogs, and most other placental mammals;

Address of the bookmark: http://www.bx.psu.edu/miller_lab/car/

Tetra-Nucleotide Analysis

Jit — Thu, 04 May 2017 05:07:41 -0500

A tetra-nucleotide is a fragment of DNA sequence with 4 bases (e.g. AGTC or TTGG). Pride et al. (2003) showed that the frequency of tetra-nucleotides in bacterial genomes contain useful, albeit weak, phylogenetic signals. Even though tetra-nucleotide analysis (TNA) utilizes the information of whole genome, it is evident that it cannot replace other alignment-based phylogenetic methods such as OrthoANI or 16S rRNA phylogeny. However, TNA can be useful for phylogenetic characterization when whole genome or 16S rRNA gene information is not available. For example, a partial genomic fragment obtained from a metagenome can be identified by TNA (Teeling et al., 2004). TNA is also fast enough that it can be used as a search engine against a large genome database.

Address of the bookmark: https://chunlab.wordpress.com/tetra-nucleotide-analysis/

Tryst with a Bioinformatician # Dr Altan Kara

Jitendra Narayan — Thu, 16 Nov 2017 08:47:52 -0600

Dr Altan Kara is a Bioinformatics specialist at the faculty of Gene Engineering and Biotechnology Institute at TUBITAK MAM Research Center. His research interest revolves around the cancer informatics and computational aided-drug design. I applaud Dr Altan for clearly setting out both his expectations of people that join his lab/university in addition to listing his responsibilities to his research members at TUBITAK MAM Research Institüte. Hopefully, this interview will prove useful to others in the field, especially to those who are just starting their bioinformatics careers.

You can find out more about Dr Altan by visiting his (well documented) lab page (http://gmbe.mam.tubitak.gov.tr/en) and BOL page http://bioinformaticsonline.com/profile/altan . And now, on to the BOL:“Tryst with a Bioinformatician” interview series ...

What push you to join Computational Biology/Bioinformatics?

According to me, bioinformatics is the center of modern biological research and if a researcher wants to discover new biological insights by evaluating the globally produced biological data to derivate unified solutions for specific biological problems, learning bioinformatics is the only way to achieve this goal.

What fascinates you about Computational Biology/Bioinformatics?

It's flexibility. As well known, there are highly diverse and complex biological questions are waiting to be enlightened and it's impossible to bring solutions to this diversity by using similar approaches. Thus, the employed method has to be unique for the targeted biological problem and by using bioinformatics tools this can be easily achieved.

What is the one word you would use to describe yourself?

Bioinformatician. :)

Can you please describe your research work in a nutshell for BOL users.

At my current Institute, I am working in the field of cancer bioinformatics. Briefly, the overall aim of the project which I am working for (AKMARK (Project CODE:5153403)) is, applying a bioinformatics-supported genome, transcriptome, proteome, and metabolome analysis to reveal the molecular profile of the disease through an integrated approach, and to develop an early diagnosis and scanning kit based on this profile. Alterations in the gene, transcript, protein, and metabolite profiles between normal tissue, normal tissue adjoined to the tumor (reactive stroma), tumor tissue, lymph node metastasis, and blood samples taken from the same patient and the reflection of these changes in some other selected body fluids will be revealed within the scope of the project. The molecular structures involved in the development and progression of NSCLC will be determined and relations with the clinical, tumor-node-metastasis (TNM) staging and histology will be made. The development of a diagnostic kit for immediate clinical purposes and an electrochemical biosensor for quick on-site applications are targeted through the development of a number of antibody and aptamer formed against the most specific biomarker selected from the panel.

Is there anything else we should know about you and your research?

Besides AKMARK, I am also in preparation of having a side project that aims for the development of a computational method to design inhibitors for prokaryotic two-component systems. In this project, I will be in collaboration with Prof. Maria Kontoyianni, SIUE: Southern Illinois University Edwardsville, School of Pharmacy.

What was your greatest scientific disappointment in life till now?

So far I do not experience any memorable scientific disappointment in my life. :)

What major research challenges and problems did you face yet? How did you handle them?

The major challenge which I faced so far in my scientific career was predicting the interaction between the prokaryotic two-component proteins. To be able to accurately predict the interactions between these proteins, I create a meta-predictor by using a support vector machine. By using this technique I integrated six different protein-protein interaction methods in a way to cover disadvantage of one method with the advantage of another one. The meta-predictor which I developed during this work is accessible via http://metapred2cs.ibers.aber.ac.uk/ and for more detailed information about the system the articles with the PMID IDs; PMID: 27378293 and PMID: 26384938 can be read.

What's your all-time favourite bioinformatics package, and why?

For me, the best bioinformatics package is R/Bioconductor. The reason why I like this package is, it provides lots of useful tools for comprehensive analysis and comparison of high-throughput experimental data in an integrated manner and besides lots of the packages it provides, it is open source and also open for development. As a result, it provides strong and flexible ways to do science.

In bioinformatics, do you see yourself in which of the following roles-scientist, analyst, developer, engineer or pure academician?

Scientist / Developer.

What will you like to accomplish in next five years / ten years?

For my current research, I would like to design a pipeline to automatically integrate and analyse omics data for cancer research which will be specifically aiming for biomarker and novel drug target discovery. In addition to this, I also like to develop another pipeline for prokaryotic TCS protein structure prediction and inhibitor design.

When you will be retired, what would you tell next generation bioinformaticians?

Bioinformatics is not all about scripting and researchers who study in this field should never expect a tool to do their analyses for them. Besides computational skills, a bioinformatician must have a strong biological background in his/her research area which will allow them to understand if anything went wrong during their run by only looking at the results instead of just blindly trusting the output of the bioinformatics tools.

What you always miss in bioinformatics when you will no longer working in this field?

Bioinformatics is open to doing multi-discipliner research with scientists all around the world. As a result, while I studying in this field I can interactively learn a lot from wide range research community. I think this is the one thing which I will miss the most.

If there will be bioinformatics company owned by you in future, What are your company focus and aim?

With the increasing amount of data in databases, there is already a massive need for effective methods to eliminate the manipulated data and reach to clean/useful information. As days pass, the requirement of data mining will be the first step of any research project. For this reason, the major goal of my bioinformatics company will be developing effective tools to eliminate manipulated datasets and information that exist in the literature and provide trustworthy clean information/datasets for researchers.

How much bioinformatics change in 2050, according to your wild imagination?

Bioinformatics is a field that constantly and dynamically changes. As the bioinformatics progress, new tools and methods become available and they provide a better application of existing methods or totally new methods that offer an alternative solution to various biological problems. A long with these updates, developers also provide easy to use GUIs for most of the tools. Considering this, if the field carries on developing like this, every single researcher with a strong biological background can be able to perform bioinformatics analyses by him/herself without needing a professional help. As a result, almost all of the bioinformaticians will be responsible just for development of new methods/tools.

What would one piece of advice you give someone who's trying to reinvent themselves and enter into bioinformatics sector?

Bioinformatics is a wide field with a lot of career options. Thus, if a researcher likes to step into this field first he/she should be clear about the branch of the bioinformatics they like to study in. Following to this decision they should first learn at least one programing language and investigate the ways of how other researcher employed that language in their researches and WHY? A researcher, in this field, should never create and use copy paste scripts but always must understand WHY the other researcher worked in that way. Knowing the answer of this question is the only way to learn bioinformatics. Besides, a researcher in the field of bioinformatics (from any branch) must always be good about the environmental control. In other words, one should always easily control input output directories, modify files or directories, annotate and modify employed scripts during the research and should not allow any confusion during the different stages of the research. Finally, they should not blindly trust the output of a tool/software but do a benchmarking test for each of the tools which they decided to utilise in their research. In addition to this, even if the tools pass the benchmarking, researchers should have a good biological background in their field to tell if anything when wrong during the process by only looking the output(s) of the employed pipelines/packages/tools.

Alignment-free sequence comparison tools available for next-generation sequencing data analysis

Abhimanyu Singh — Tue, 07 Nov 2017 05:33:33 -0600

kallisto

Transcript abundance quantification from RNA-seq data (uses pseudoalignment for rapid determination of read compatibility with targets)

Software (C++)

https://pachterlab.github.io/kallisto/

Sailfish

Estimation of isoform abundances from reference sequences and RNA-seq data (k-mer based)

Software (C++)

http://www.cs.cmu.edu/~ckingsf/software/sailfish/

Salmon

Quantification of the expression of transcripts using RNA-seq data (uses k-mers)

https://combine-lab.github.io/salmon/

RNA-Skim

RNA-seq quantification at transcript-level (partitions the transcriptome into disjoint transcript clusters; uses sig-mers, a special type of k-mers)

Software (C++)

http://www.csbio.unc.edu/rs/

Variant calling

ChimeRScope

Fusion transcript prediction using gene k-mers profiles of the RNA-seq paired-end reads

Software (Java)

https://github.com/ChimeRScope/ChimeRScope/wiki

FastGT

Genotyping of known SNV/SNP variants directly from raw NGS sequence reads by counting unique k-mers

Software (C)

https://github.com/bioinfo-ut/GenomeTester4/

Phy-Mer

Reference-independent mitochondrial haplogroup classifier from NGS data (k-mer based)

Software (Python)

https://github.com/danielnavarrogomez/phy-mer

LAVA

Genotyping of known SNPs (dbSNP and Affymetrix's Genome-Wide Human SNP Array) from raw NGS reads (k-mer based)

Software (C)

http://lava.csail.mit.edu/

MICADo

Detection of mutations in targeted third-generation NGS data (can distinguish patients’ specific mutations; algorithm uses k-mers and is based on colored de Bruijn graphs)

Software (Python)

http://github.com/cbib/MICADo

General mapper

Minimap

Lightweight and fast read mapper and read overlap detector (uses the concept of “minimazers”, a special type of k-mers)

Software (C)

https://github.com/lh3/minimap

Assembly

De novo genome assembly

MHAP

Produces highly continuous assembly (fully resolved chromosome arms) from third-generation long and noisy reads (10 kbp) using a dimensionality reduction technique MinHash

Software (Java)

https://github.com/marbl/MHAP

Miniasm

Assembler of long noisy reads (SMRT, ONT) using the Overlap-Layout Consensus (OLC) approach without the necessity of an error correction stage (uses minimap)

Software (C)

https://github.com/lh3/miniasm

LINKS

Scaffolding genome assembly with error-containing long sequence (e.g., ONT or PacBio reads, draft genomes)

Software (Perl)

https://github.com/warrenlr/LINKS/

Read clustering

afcluster

Clustering of reads from different genes and different species based on k-mer counts

Software (C++)

https://github.com/luscinius/afcluster

QCluster

Clustering of reads with alignment-free measures (k-mer based) and quality values

Software (C++)

http://www.dei.unipd.it/~ciompin/main/qcluster.html

Reads error correction

Lighter

Correction of sequencing errors in raw, whole genome sequencing reads (k-mer based)

Software (C++)

https://github.com/mourisl/Lighter

QuorUM

Error corrector for Illumina reads using k-mers

Software (C++)

https://github.com/gmarcais/Quorum

Trowel

Software (C++)

https://sourceforge.net/projects/trowel-ec/

Metagenomics

Assembly-free phylogenomics

AAF

Phylogeny reconstruction directly from unassembled raw sequence data from whole genome sequencing projects; provides bootstrap support to assess uncertainty in the tree topology (k-mer based)

Software (Python)

https://github.com/fanhuan/AAF

kSNP v3

Reference-free SNP identification and estimation of phylogenetic trees using SNPs (based on k-mer analysis)

Software (C)

https://sourceforge.net/projects/ksnp/files/

NGS-MC

Phylogeny of species based on NGS reads using alignment-free sequence dissimilarity measures d2* and d2 S under different Markov chain models (using k-words)

R package

http://www-rcf.usc.edu/~fsun/Programs/NGS-MC/NGS-MC.html

Species identification/taxonomic profiling

CLARK

Taxonomic classification of metagenomic reads to known bacterial genomes using k-mer search and LCA assignment

Software (C++)

http://clark.cs.ucr.edu/

FOCUS

Reports organisms present in metagenomic samples and profiles their abundances (uses composition-based approach and non-negative least squares for prediction)

Web service Software (Python)

http://edwards.sdsu.edu/FOCUS/

GSM

Estimation of abundances of microbial genomes in metagenomic samples (k-mer based)

Software (Go)

https://github.com/pdtrang/GSM

Mash

Species identification using assembled or unassembled Illumina, PacBio, and ONT data (based on MinHash dimensionality-reduction technique)

Software (C++)

https://github.com/marbl/mash

Kraken

Taxonomic assignment in metagenome analysis by exact k-mer search; LCA assignment of short reads based on a comprehensive sequence database

Software (C++)

https://ccb.jhu.edu/software/kraken/

LMAT

Assignment of taxonomic labels to reads by k-mers searches in precomputed database

Software (C++/Python)

https://sourceforge.net/projects/lmat/

stringMLST

k-mer-based tool for MLST directly from the genome sequencing reads

Software (Python)

http://jordan.biology.gatech.edu/page/software/stringMLST

Taxonomer

k-mer-based ultrafast metagenomics tool for assigning taxonomy to sequencing reads from clinical and environmental samples

Web service

http://taxonomer.iobio.io/

Other

d2-tools

Word-based (k-tuple) comparison (pairwise dissimilarity matrix using d2S measure) of metatranscriptomic samples from NGS reads

Software (Python/R)

https://code.google.com/p/d2-tools/

VirHostMatcher

Prediction of hosts from metagenomic viral sequences based on ONF using various distance measures (e.g., d2)

Software (C++)

https://github.com/jessieren/VirHostMatcher

MetaFast

Statistics calculation of metagenome sequences and the distances between them based on assembly using de Bruijn graphs and Bray–Curtis dissimilarity measure

Software (Java)

https://github.com/ctlab/metafast