COVE (COunt Vectors and EMMs)

 

In the COVE project we use the count vectors underlying a TCGR to construct a Markov chain based classifier.  These count vectors are viewed as a signature of the associated DNA/RNA sequence.  While the TCGR visualization is informative, as the frequency of sub-patterns as well as their location of occurrence is easy to spot, it can not be used directly in prediction of miRNAs.  Thus in this research we use the underlying count vectors used to create the TCGRs.  The data that is used in our classification technique can be viewed as a time series of vectors where each vector shows the counts of sub-patterns in that window. 

We generate one of these time series for each sub-pattern size being used.  These time series are used to then create a spatio-temporal Markov chain model which is rooted in the structure of the miRNA sequences.  A training phase creates the model based on known miRNAs (or target classification problem).  The rationale is that non-exonic genome sequence (introns and intergenic regions) can then be subject to TCGR analysis to assess whether pre-miRNA-sized chunks of the genome generate TCGR patterns that resemble those in the training data set. 

The Extensible Markov Model (EMM) is a very powerful modeling tool.   Our initial research has examined several different types of systems (and thus data) including water level/flow rate in rivers, traffic volume in roadways, and network traffic data.   The time series view of TCGR shown above easily lends itself to be modeled by an EMM.  An EMM is a time varying Markov Chain (MC).  At any point in time, when viewed as a static graph it is a MC.  However, over time the structure of the graph changes.  Both the number of nodes in the graph as well as the transition probabilities vary over time.  Thus the EMM is both a graph and a learning algorithm.

The TCGR count input can be viewed as both spatial and temporal.  Spatiality is defined by the sub-patterns being examined.  Temporality is based on the sliding window.  With the EMM model, at any time t a probability of a target event E occurring at some time in the future can be calculated.  At any time, t, we can view the input as represented by a vector of n numeric values: Et = <S1t, S2t, …, Snt>.  For the miRNA prediction problem, t is actually the sliding window number.  Each element, Sit, contains the count vector for sub-pattern i at time t.  The miRNA prediction problem can be viewed as predicting whether the EMM (which was constructed using known miRNAs as training data) accurately models a given input sequence.  Given an input sequence, the likelihood that it is modeled by an EMM can be determined by multiplying the transition probabilities found along the path constructed in the EMM as the input sequence is mapped (clustered) to EMM nodes.   Each node in the EMM represents a cluster of vector values that are sufficiently “close” together to be clustered into one EMM node.  The label of each node, Ni, may be a centroid, medoid, or other representative of this cluster.

The following figure shows the relationships between the various components of our classification approach.  Here each EMM is used to create a separate classifier.  The different classifiers are combines using a mete-classifcation approach, MCM.