|
Windowing,
Estimation, Thresholding, and Microarray
Analysis (WETAMA)
There
are four major statistical issues to be addressed in WETAMA led by Professor McGee. They are determination
of the window size for obtaining the “best” possible
representation of the TCGR, estimation of the probability that a given
sequence is a miRNA, and determination of an
adequate threshold which classifies non-miRNA
sequences and miRNA sequences and applications of
these and other statistical methods to gene expression and miRNA microarray data.
Determining the window size. We have used a window-width of nine for
our preliminary TCGR experiments.
This number was chosen to balance computational efficiency with
obtaining reasonable results. Denote
the length of a window as N (9 in our case) and k the length of the
nucleotide sequence of interest (e.g. if we are looking for the sequences
ATT or ACT, then k = 3). It has been
shown that, for any k > log4(N
– k + 1) that some of sequences of letters of length k (called
k-words in the sequel) will be missing from the sequence. In other words, as k increases, we are
less likely to see all combinations of k-words. It is important to find a window width
that is short enough for computational efficiency, yet large enough to
capture all of the important combinations of k-words. A similar problem exists in the field of
density estimation, where it is necessary to capture salient features of
the data without showing too much spurious detail. This is often termed the
“bias-variance trade-off” in statistics. Experts in density estimation have
devised cross-validation procedures to obtain an optimal window-width. We will investigate adapting these
procedures to the problem of finding an optimal window width for the
TCGR. Further, we will investigate
the effect of increasing the width of the window as k increases.
Estimation of the
probability that a given sequences is a miRNA. In Markov Chain theory, the most widely
accepted approach for determining the probability of a transition from
state i
to state j, where i and j may or may not be contiguous, is
to multiply the transition probabilities of all intermediary states. Another way to estimate transition
probabilities is through a product integral estimate, such as the
Aalen-Johansen estimator. Their
estimator is for nonhomogeneous (i.e.
time-varying, as for the EMM) Markov chains in the presence of
censoring. We can view counts given
by the TCGR as frequencies for DNA words of various lengths. As the word size increases, some words
will tend to “drop out” more than others. Such sequence patterns are censored in
the statistical sense. The
Aalen-Johansen estimator adjusts the transition probability for the
drop-out rate of words which no longer appear. We will apply this estimator, as well as
develop others, as part of this project.
With
any estimator of transition probabilities, the fact that the resulting EMM
models have varying numbers of states can have severe impact on the
resulting values. Since each
transition probability is in [0,1] sequences of
longer length will have artificially lower probabilities than shorter
sequences. Our initial approach is
to take the rth root of the value
found by multiplying r probabilities:

We
have successfully used this normalized probability when looking at Web
usage patterns. Another approach is
to fix an integer M, and for EMMs with number of
states s such that s > M, calculate the transitional
probabilities only for chains of length M.
This method has been used successfully with modeling of protein
sequences. Other approaches will be
examined to determine the best way to compensate for different lengths.
Thresholds for
Classification Purposes. The
third statistical problem is to work with Dr. Wang to determine cutoff
values to be used for classification purposes. With any set threshold value, a given
number of true positives and false positives will be generated. The statistical problem is to maximize
the number of true positives at any threshold, while minimizing the number
of false positives. Statistically
speaking, we need to balance Type I and Type II errors.
Preprocessing of Microarray Data: A fourth statistical problem involves
research into preprocessing methods for gene expression and miRNA arrays. In
order to obtain reliable results from high-level analyses such as
classification and prediction, microarray data
must be background corrected and normalized previous to testing for
significant differential expression and clustering/classifying the genes
that are significantly differentially expressed. gene expression microarray and miRNA array
data, and we will extend those results by examining the effect of
preprocessing methods on the accuracy of the fusion classifier we have
developed, and other classifiers that are developed as a result of our
collaboration.
|