Data Mining in Bioinformatics

Data mining, the extraction of hidden predictive information from large databases. Data mining is becoming an increasingly important tool to transform this data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery. Data Mining for Bioinformatics enables researchers to meet the challenge of mining vast amounts of biomolecular data to discover real knowledge. In other words, you’re a bioinformatician, and data has been dumped in your lap. Find the patterns, trend, answers, or what ever meaningful knowledge the data is hiding. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.This page Covering theory, algorithms, and methodologies, as well as data mining technologies. Unfortunately life is never simple. In molecular biology, it’s becoming more common to generate reams of data then ask someone in bioinformatics to produce an answer. This is exploratory data analysis, one of the most difficult things to do well. Especially if you’re thrown in at the deep end.

Data mining commonly involves four classes of tasks:

  • Classification - Arranges the data into predefined groups. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification and neural networks.
  • Clustering - Is like classification but the groups are not predefined, so the algorithm will try to group similar items together.
  • Regression - Attempts to find a function which models the data with the least error.
  • Association rule learning - Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.
  • From experience, I can say that is one of the most frustrating positions to be in. Data mining is a huge field and can easily be bewildering for a beginner. However, high through-put techniques in molecular biology require, more and more, that bioinformatics is required to interpret the data. Furthermore, people working in bioinformatics generally come from computer science, or biology backgrounds. Data mining, however, involves statistics to one degree or another, which means entering a field that is may not be your strong point.
  • Excel is fine for creating graphs. If you’re serious about data mining though, you’ll need something more heavy weight. I use R, free, and with good data mining packages such as vegan and labdsv. For beginners R can be impenetrable, I recommend this book an introduction to R as well as the underlying statistics.
  • Any of us can rush head on into a land of support vector machines, hidden markov models and neural networks. But coming back to the first point, what are you trying to prove? Always question what are you doing, how does it fit in to the wider picture? Try to regularly review, and keep track of where you are going? This will prevent you from falling into data mining despair.

Data Mining Resources on the net:

A laboratory of data mining and bioinformatics is headed by Prof. Ambuj Singh. There are currently seven graduate students in the research group. Our research focuses on image informatics and scalable querying and mining of graphs.For more detail visit: http://www.cs.ucsb.edu/~dbl/

Here are the materials (Lecture notes) from several past courses on data mining and/or Web mining by Stanford: For detail visit: http://infolab.stanford.edu/~ullman/mining/mining.html
Statistical Data Mining Tutorial Slides by Andrew Moore The following links point to a set of tutorials on many aspects of statistical data mining, including the foundations of probability, the foundations of statistical data analysis, and most of the classic machine learning and data mining algorithms. For detail visit: http://www.autonlab.org/tutorials/

A tutorial on Introduction to Data Mining for Discovering hidden value in your data warehouse:http://www.thearling.com/text/dmwhite/dmwhite.htm 
Wiki Links: http://en.wikipedia.org/wiki/Data_mining
Bioinformatics with Clementine http://www.spss.ch/upload/1051192224_inseratClemBio.pdf 
Causal Data Mining in Bioinformatics by Ioannis Tsamardinos: http://www.forth.gr/ics/bmi/In_the_News/2007/EN69-4.pdf

Report on ACM Text Mining in Bioinformatics (TMBIO 006) http://www.sigir.org/forum/2007J/2007j_sigirforum_song.pdf 
BIOKDD 2002: Recent Advances in Data Mining for 
Bioinformatics: http://www.acm.org/sigs/sigkdd/explorations/issue4-2/zaki.pdf

Bioinformatics and Medical Informatics: 

Tools for Mining and Applying Genetic Information in Patient Care:http://www.biomedtechalliance.org/pdfs/03_03_05/03_03_05.pdf

DATA MINING OF MICROARRAY DATABASES FOR HUMAN LUNG CANCER: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.106.385&rep=rep1&type=pdf

Towards knowledge-based gene expression data mining: http://www.ailab.si/blaz/papers/2007-JBI-BellazziZupan.pdf

DRAFT Accepted for publication in 'Data Mining in Bioinformatics'
Jason Wang, Mohammed Zaki, Hannu Toivonen, and Dennis Shasha (Eds.), Springer:http://www.cs.helsinki.fi/u/htoivone/pubs/gene_mapping_by_pattern_discovery.pdf

Data Mining and Text Mining for Bioinformatics: Proceedings of the European Workshop: http://www.rok.informatik.hu-berlin.de/wbi/research/publications/2003/proceedings_ws_mining.pdf

Biological Network Analysis:

Graph Mining in Bioinformatics: http://agbs.kyb.tuebingen.mpg.de/wikis/bg/BNA-5.pdf.

Text mining in bioinformatics: http://agbs.kyb.tuebingen.mpg.de/wikis/bg/4.pdf

Some datamining books that are available on google books:

Data mining and bioinformatics: first international workshop, VDMB 2006 By Mehmet M. Dalkilic

Data mining: concepts and techniques By Jiawei Han, Micheline Kamber