|
Phyla Classification Using
Data Mining Techqniues
Genomic
sequencing projects are generating vast stores of data that provide
opportunities and challenges in data analysis. Investigations of trends in
codon usage have proven to be a rich area of study in this field. There are
a number of methods for isolating codon usage bias, each designed to
capture a specific aspect of the bias. We posit that each species has
evolved under the influence of a unique set of environmental constraints
that has governed the shaping of the organism’s codon usage. Analysis of
codon usage data should, therefore, provide insights into the selection
process at work influencing genomic composition. In our research, we
examine the usage of codon usage bias data found in the CUB-DB data for classification
of microbial organisms into phyla.
Successful prediction is an indication that the forces molding the
codon usage of a given phylum/class are indeed distinctive, and that it
would be of use in understanding the evolutionary forces involved.
Additionally, it supports using this method to aid in, and validate
existing taxonomic classification techniques. Initial research results, published at IEEE BIBE 2009, support our
hypothesis. Research continues….
Another data mining technique being used
is the application of EMMs
to assist in the classification approach.
By building one EMM for each subpattern length a metaclassifcation, MCM, can be
performed. Initial results are quite
promising and a tool, EMMSA,
is being created.
|