http://engr.smu.edu/cse/dbgroup/images/smubad.jpg

 

Phyla Classification Using Data Mining Techqniues

 

Genomic sequencing projects are generating vast stores of data that provide opportunities and challenges in data analysis. Investigations of trends in codon usage have proven to be a rich area of study in this field. There are a number of methods for isolating codon usage bias, each designed to capture a specific aspect of the bias. We posit that each species has evolved under the influence of a unique set of environmental constraints that has governed the shaping of the organism’s codon usage. Analysis of codon usage data should, therefore, provide insights into the selection process at work influencing genomic composition. In our research, we examine the usage of codon usage bias data found in the CUB-DB data for classification of microbial organisms into phyla.  Successful prediction is an indication that the forces molding the codon usage of a given phylum/class are indeed distinctive, and that it would be of use in understanding the evolutionary forces involved. Additionally, it supports using this method to aid in, and validate existing taxonomic classification techniques.  Initial research results, published at IEEE BIBE 2009, support our hypothesis.  Research continues….

 

Another data mining technique being used is the application of EMMs to assist in the classification approach.  By building one EMM for each subpattern length a metaclassifcation, MCM, can be performed.  Initial results are quite promising and a tool, EMMSA, is being created.