|
COVE
(COunt Vectors and EMMs)
In the COVE project we use the
count vectors underlying a TCGR to construct a Markov
chain based classifier. These count
vectors are viewed as a signature of the associated DNA/RNA sequence. While the TCGR visualization is
informative, as the frequency of sub-patterns as well as their location of
occurrence is easy to spot, it can not be used directly in prediction of
miRNAs. Thus in this research we use
the underlying count vectors used to create the TCGRs. The data that is used in our
classification technique can be viewed as a time series of vectors where
each vector shows the counts of sub-patterns in that window.
We generate one of these time
series for each sub-pattern size being used. These time series are used to then create
a spatio-temporal Markov chain model which is rooted in the structure of
the miRNA sequences. A training
phase creates the model based on known miRNAs (or target classification
problem). The rationale is that
non-exonic genome sequence (introns and intergenic regions) can then be
subject to TCGR analysis to assess whether pre-miRNA-sized chunks of the
genome generate TCGR patterns that resemble those in the training data
set.
The Extensible Markov Model (EMM) is a very
powerful modeling tool. Our initial
research has examined several different types of systems (and thus data)
including water level/flow rate in rivers, traffic volume in roadways, and
network traffic data. The time
series view of TCGR shown above easily lends itself to be modeled by an
EMM. An EMM is a time varying Markov
Chain (MC). At any point in time,
when viewed as a static graph it is a MC.
However, over time the structure of the graph changes. Both the number of nodes in the graph as
well as the transition probabilities vary over time. Thus the EMM is both a graph and a
learning algorithm.
The TCGR count input can be
viewed as both spatial and temporal.
Spatiality is defined by the sub-patterns being examined. Temporality is based on the sliding
window. With the EMM model, at any
time t a probability of a target event E occurring at some time in the
future can be calculated. At any
time, t, we can view the input as represented by a vector of n numeric
values: Et = <S1t, S2t, …, Snt>. For the miRNA prediction problem, t is
actually the sliding window number.
Each element, Sit, contains the count vector for
sub-pattern i at time t.
The miRNA prediction problem can be viewed as predicting whether the
EMM (which was constructed using known miRNAs as training data) accurately
models a given input sequence. Given
an input sequence, the likelihood that it is modeled by an EMM can be
determined by multiplying the transition probabilities found along the path
constructed in the EMM as the input sequence is mapped (clustered) to EMM
nodes. Each node in the EMM
represents a cluster of vector values that are sufficiently
“close” together to be clustered into one EMM node. The label of each node, Ni,
may be a centroid, medoid, or other representative of this cluster.
The following figure shows the
relationships between the various components of our classification
approach. Here each EMM is used to
create a separate classifier. The
different classifiers are combines using a mete-classifcation approach, MCM.

|