TRACDS: Temporal Relationships Among Clusters for Massive Data Streams
State-of-the-art data stream clustering algorithms developed by the data mining community do not utilize the temporal order of events and therefore in the resulting clustering all temporal information is lost. This is quite strange as one of the salient features of data streams is temporal ordering of events. In this project we develop a technique to efficiently incorporate temporal ordering into the clustering process and prove its usefulness on large, high-throughput data streams. Temporal ordering is introduced into the data stream clustering process by dynamically constructing an evolving Markov Chain where the states represent clusters. Our approach is based on the previously developed Extensible Markov Model (EMM). The results of this project will provide a framework upon which important stream mining applications such as anomaly detection and prediction of future events are easily implemented.
Broader Impact. By showing that state-of-the-art data steam clustering algorithms can incorporate temporal order information efficiently, this project will have a broad impact on many areas where temporal order is essential. As examples, NOAA Hurricane Data and NASA satellite data will be used throughout this project.
Team
Matt Bolanos, Sudheer Chelluboina, Margaret H. Dunham (Co-PI), John Forrest, Michael Hahsler (Co-PI), Vladimir Jovanovic, Hadil Shaiba, Yu Su
Developed Software
- rEMM: An implementation of TRACDS as a R package (stable).
- stream: Infrastructure for data stream mining with R (beta).
- PIIH: Forecasts hurricane intensities with the TRACDS based PIIH-model for the Atlantic Basin in real-time for the 2011 hurricane season.
- QuasiAlign: Application of TRACDS for genetic sequence alignment.
Activities
- Matt Bolanos received the best undergraduate award for CSE at the Annual SMU Research Day for his work on data stream clustering (Spring 2013)
- Both REU students (Vladimir Jovanovic and John Forrest) represented SMU at the Texas Undergraduate Research Day with their work on genetic sequence classification and data stream clustering (read report).
- Research Experience for Undergraduates (REU) project reports: J. Forrest (Fall 2010), V. Jovanovic (Fall 2010)
- We organized the StreamKDD'10 - First International Workshop on Novel Data Stream Pattern Mining Techniques to promote research in the area of data stream mining.
Media
- “Discovery: New Forecasting Algorithm Helps Predict Hurricane Intensity and Wind Speed” (Dec. 5, 2011) by the National Science Foundation
- “Weatherwatch: Can the intensity of a hurricane be predicted?” (Oct. 12, 2011) by The Guardian.
Publications
- Anurag Nagar and Michael Hahsler. Using text and data mining techniques to extract stock market sentiment from live news streams. In 2012 International Conference on Computer Technology and Science (ICCTS 2012), August 2012.
- Charlie Isaksson, Margaret H. Dunham, and Michael Hahsler. SOStream: Self organizing density-based clustering over data stream. In International Conference on Machine Learning and Data Mining (MLDM'2012). Springer, July 2012.
- Vladimir Jovanovic, Margaret H. Dunham, Michael Hahsler, and Yu Su. Evaluating hurricane intensity prediction techniques in real time. In Third IEEE ICDM Workshop on Knowledge Discovery from Climate Data, Proceedings of the of the 2011 IEEE International Conference on Data Mining Workshops (ICDMW 2011). IEEE, 2011.
- John Forrest. Stream: A Framework for Data Stream Modeling in R. Bachelor Thesis, Department of Computer Science and Engineering, SMU, 2011.
- Michael Hahsler and Margaret H. Dunham. Temporal structure learning for clustering massive data streams in real-time. In SIAM Conference on Data Mining (SDM11). SIAM, 2011.
- Yu Su, Sudheer Chelluboina, Michael Hahsler, and Margaret H Dunham, A New Data Mining Model for Hurricane Intensity Prediction, 2nd IEEE ICDM Workshop on Knowledge Discovery from Climate Data, Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (ICDMW 2010). IEEE, 2010
- Margaret H. Dunham, Michael Hahsler, and Myra Spiliopoulou. Novel data stream pattern mining, Report on the StreamKDD’10 Workshop. SIGKDD Explorations, 12(2):54-55, 2010.
- Michael Hahsler and Margaret H. Dunham, rEMM: Extensible Markov Model for Data Stream Clustering in R, Journal of Statistical Software, 35(5):1-31, 2010.
- Margaret H. Dunham, Michael Hahsler, and Myra Spiliopoulou, editors. Proceedings of the First International Workshop on Novel Data Stream Pattern Mining Techniques (StreamKDD'10). ACM Press, New York, NY, USA, 2010
Acknowledgement of Support
This research is supported by
the National Science Foundation under Grant No.
IIS-0948893.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
