Windowing, Estimation, Thresholding, and Microarray Analysis (WETAMA)

 

There are four major statistical issues to be addressed in WETAMA led by Professor McGee.  They are determination of the window size for obtaining the “best” possible representation of the TCGR, estimation of the probability that a given sequence is a miRNA, and determination of an adequate threshold which classifies non-miRNA sequences and miRNA sequences and applications of these and other statistical methods to gene expression and miRNA microarray data. 

Determining the window size.  We have used a window-width of nine for our preliminary TCGR experiments.  This number was chosen to balance computational efficiency with obtaining reasonable results.  Denote the length of a window as N (9 in our case) and k the length of the nucleotide sequence of interest (e.g. if we are looking for the sequences ATT or ACT, then k = 3).  It has been shown that, for any k > log4(N – k + 1) that some of sequences of letters of length k (called k-words in the sequel) will be missing from the sequence.  In other words, as k increases, we are less likely to see all combinations of k-words.  It is important to find a window width that is short enough for computational efficiency, yet large enough to capture all of the important combinations of k-words.  A similar problem exists in the field of density estimation, where it is necessary to capture salient features of the data without showing too much spurious detail.  This is often termed the “bias-variance trade-off” in statistics.  Experts in density estimation have devised cross-validation procedures to obtain an optimal window-width.  We will investigate adapting these procedures to the problem of finding an optimal window width for the TCGR.  Further, we will investigate the effect of increasing the width of the window as k increases.

Estimation of the probability that a given sequences is a miRNA.  In Markov Chain theory, the most widely accepted approach for determining the probability of a transition from state i to state j, where i and j may or may not be contiguous, is to multiply the transition probabilities of all intermediary states.  Another way to estimate transition probabilities is through a product integral estimate, such as the Aalen-Johansen estimator.  Their estimator is for nonhomogeneous (i.e. time-varying, as for the EMM) Markov chains in the presence of censoring.  We can view counts given by the TCGR as frequencies for DNA words of various lengths.  As the word size increases, some words will tend to “drop out” more than others.  Such sequence patterns are censored in the statistical sense.  The Aalen-Johansen estimator adjusts the transition probability for the drop-out rate of words which no longer appear.  We will apply this estimator, as well as develop others, as part of this project.

With any estimator of transition probabilities, the fact that the resulting EMM models have varying numbers of states can have severe impact on the resulting values.  Since each transition probability is in [0,1] sequences of longer length will have artificially lower probabilities than shorter sequences.  Our initial approach is to take the rth root of the value found by multiplying r probabilities:


We have successfully used this normalized probability when looking at Web usage patterns.  Another approach is to fix an integer M, and for EMMs with number of states s such that s > M, calculate the transitional probabilities only for chains of length M.  This method has been used successfully with modeling of protein sequences.  Other approaches will be examined to determine the best way to compensate for different lengths. 

Thresholds for Classification Purposes.  The third statistical problem is to work with Dr. Wang to determine cutoff values to be used for classification purposes.  With any set threshold value, a given number of true positives and false positives will be generated.  The statistical problem is to maximize the number of true positives at any threshold, while minimizing the number of false positives.  Statistically speaking, we need to balance Type I and Type II errors. 

Preprocessing of Microarray Data:  A fourth statistical problem involves research into preprocessing methods for gene expression and miRNA arrays.  In order to obtain reliable results from high-level analyses such as classification and prediction, microarray data must be background corrected and normalized previous to testing for significant differential expression and clustering/classifying the genes that are significantly differentially expressed.  gene expression microarray and miRNA array data, and we will extend those results by examining the effect of preprocessing methods on the accuracy of the fusion classifier we have developed, and other classifiers that are developed as a result of our collaboration.