Clustering

One of the difficult problems about clustering is how to determine the appropriate number of clusters. It is more difficult when there exist outliers(noise) in the data. Existing algorithms need to run multiple times to determine the number of clusters due to their batch behavior, which wastes a lot of time.

Online Clustering:

We propose online clustering so that the user is shown the current clusters(cluster profile) at each step, and based on the cluster profile the user can adjust the parameter. Online clustering is especially suitable when data are from network, such as an internet search engine.

Dynamic Linear Condensation Technique:

How to scale up to large databases is recently getting attention from database community. Existing scalable algorithms are not suitable for online clustering. We propose a dynamic linear condensation technique so that the clustering is adaptive to large databases/main memory and still online. By the condensation technique, we get a linear speedup and still high quality clusters.

How to Handle Outliers:

To handle outliers effectively, we introduce an additional parameter, the viewing number of clusters k', which is not less than the output number of clusters k. This is, k' clusters are shown to the user, among which k clusters are chosen by the user as the final output. Since it is online clustering, the user can adjust the two parameters, k and k', based on the cluster profile. We also plan to improve this by more effective outlier removal technique.