Rare Event Detection

 

Data mining is used to detect anomalies or rare events.  An anomaly can be used as an indication of a possible dangerous situation in computer networks and other systems. Unsupervised techniques and supervised techniques are the two dominant types of rare event mining. Supervised techniques are classification-based. This type of approach consists of a machine-learning algorithm being trained over pre-classified data. With data labeled as normal or rare classes, supervised algorithms aim at achieving high recall or precision or both. A nature of rare class problems is the imbalance of data in normal classes and rare classes. A common solution is to use different sampling schemes to alter data distribution so that the data imbalance is alleviated in training data.

 

Extensible Markov Models are well suited to model the spatiotemporal environment and to detect rare events.  They support both supervised and unsupervised detection.  Even though the basic EMM algorithms assume that learning is used to identify rare events in an unsupervised manner, pre-designated nodes can be added to the EMM which are known to be target events to look for.  Scalability is achieved due to the fact that similar real world events are clustered into one EMM node.  In addition, nodes can be removed from the EMM or nodes may be merged together if desired.  EMMs can be used to identify events that are rare based on the events themselves (space), time of the events (temporal), or unusual transitions.  Finally, the EMM rare event detection algorithm works dynamically in a quasi-real time manner as the data arrives.  The time required for each execution of the rare event detection algorithm is dominated by the clustering algorithm which in turn depends on the number of EMM states (not the number of real world events).

 

EMMs predict (detect) rare events when a captured real world event is not close enough to an existing node in the graph or when the transition probability to the closest node from the previous node is low.  Thus the current captured event has not occurred frequently in the past or has not occurred following the previous state often.   We have examined the use of EMMs for rare event detection using network VoIP traffic data as well as automobile traffic data.  In these large environments, EMM achieves scalability through a distributed hierarchical approach.  Anomalies in Web traffic can be examined in a hierarchically distributed fashion.

 

Risk Level Assessment:

 

Traffic anomaly is an important risk indication in computer networks. However, anomaly detection techniques using positive security methods suffer from a high false alarm rate when a high detection rate is pursued. Through the use of a heuristic risk assessment model we are able to reduce the false alarm rate. Operations proposed are solely based on the synopsis of the data stream profile characterized by the EMM. The experiments conducted with VoIP CDR (Call Detail Records) data provided by Cisco Systems show that compared with a positive security-based anomaly detection model, the false alarm rate caused by the proposed model is significantly mitigated without losing a high detection rate.