Annotated Bibliography on Association Rule Mining
>>
Home > Research > Association Rules > Annotated
Bibliography
Contains 98 references
Latest update: Wed Sep 14 14:50:36 2011
This page contains my annotated bibliography on Association
Rules. The annotation is kept very concise and only serves as a
guide to what paper one should read to find information on a
certain aspect of association rule mining. The bibliography is not
complete or comprehensive, but it should cover the milestones in
mainstream research efforts. The bibliography is organized as
follows:
- Mining association rules (using support
and confidence): Algorithms which directly find frequent
itemsets.
- Alternative interest measures (besides
confidence): Other measures of interest to remedy
shortcomings of the support-confidence framework.
- Mining rules without or
with a variable support threshold: Techniques which deal
with the shortcomings of support (e.g.,rare item problem).
- Constraint-based mining:
Algorithms which use additional constraints (e.g., presence of an
item or max number of rules) to reduce the search space.
- Mining
sequential, generalized, quantitative or causal rules:
Deals with special types of rules.
- Concise representations of
frequent itemsets (closed, maximal, ect.): Focuses on
closed and maximal frequent itemsets which are typically by orders
of magnitude fewer itemsets than all frequent itemsets. However,
all frequent itemsets can be induced from these itemsets and thus
algorithms mining closed and maximal frequent itemsets are often
more efficient.
- Using association rules for
classification: Association rules are used to discover
classification rules or to build classifiers.
- Evolution of association rules over
time: Methods to find changes in data by analyzing changes
in association rules found in the data (typically by splitting the
data set into several parts and comparing the association rules
found in the individual parts).
- Theoretic
considerations and sampling: Discusses theoretic issues of
rule mining (e.g., properties of lattices and measures, sampling,
clustering).
- Evaluation and
efficient implementation of rule mining algorithms: Deals
with the comparison of algorithms, generation of synthetic data and
the implementation issues of mining algorithms.
Please, contact me for suggestions, corrections or questions.
Mining association rules (using support and confidence)
- [AIS93]
-
R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases. In
Proceedings of the ACM SIGMOD International Conference on
Management of Data, pages 207-216, Washington D.C., May 1993.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces association rules and the
SUPPORT-CONFIDENCE framework and an algorithm to mine large
itemsets. The algorithm is sometimes called AIS after the authors
initials.
- [MTV94]
-
Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
Efficient algorithms for discovering association rules. In
Usama M. Fayyad and Ramasamy Uthurusamy, editors, AAAI
Workshop on Knowledge Discovery in Databases (KDD-94), pages
181-192, Seattle, Washington, 1994. AAAI Press. [ bib |
find paper in Google Scholar |
find paper in Google ]
Develop similar improvements to the candidate
generation as APRIORI. Itemsets with support are called covering
sets. The paper also introduces sampling from the database and
gives bounds for the resulting estimate of support.
- [AS94]
-
Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining
association rules in large databases. In Jorge B. Bocca,
Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the
20th International Conference on Very Large Data Bases, VLDB,
pages 487-499, Santiago, Chile, September 1994. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduction of the APRIORI algorithm (the
best-known algorithm; it uses a breadth-first search strategy to
counting the support of itemsets). The algorithm uses an improved
candidate generation function which exploits the downward closure
property of support and makes it more efficient than AIS. Also an
algorithm to generate synthetic transaction data is presented. Such
synthetic transaction data are widely used for the evaluation and
comparison of new algorithms.
- [SON95]
-
Ashok Savasere, Edward Omiecinski, and Shamkant Navathe. An
efficient algorithm for mining association rules in large
databases. In Proceedings of the 21st VLDB Conference,
pages 432-443, Zurich, Switzerland, 1995. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduction of the PARTITION algorithm. The
database is scanned only twice. For the first scan the DB is
partitioned and in each partition support is counted. Then the
counts are merged to generate potential large itemsets. In the
second scan the potential large itemsets are counted to find the
actual large itemsets.
- [Toi96]
-
Hannu Toivonen. Sampling large databases for association rules. In
VLDB '96: Proceedings of the 22th International Conference on
Very Large Data Bases, pages 134-145, San Francisco, CA, USA,
1996. Morgan Kaufmann Publishers Inc. [ bib |
find paper in Google Scholar |
find paper in Google ]
Find frequent itemsets in a random sample of a
database (that fits into main memory) and then verify the found
frequent itemsets in the database.
- [HGN00]
-
Jochen Hipp, Ulrich Güntzer, and Gholamreza Nakhaeizadeh.
Algorithms for association rule mining - A general survey and
comparison. SIGKDD Explorations, 2(2):1-58, 2000.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Describes the fundamentals of association rule
mining and presents an systematization of existing
algorithms.
- [Zak00]
-
Mohammed J. Zaki. Scalable algorithms for association mining.
IEEE Transactions on Knowledge and Data Engineering,
12(3):372-390, May/June 2000. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces six new algorithms combining several
features (database format, the decomposition technique, and the
search procedure). Includes Eclat (Equivalence CLAss
Transformation), MaxEclat, Clique, MaxClique, TopDown, and
AprClique. ECLAT is a well known depth-first search algorithm using
set intersection.
- [OLP+03]
-
Salvatore Orlando, Claudio Lucchese, Paolo Palmerini, Raffaele
Perego, and Fabrizio Silvestri. kdci: a multi-strategy algorithm
for mining frequent sets. In Bart Goethals and Mohammed J.
Zaki, editors, FIMI'03: Proceedings of the IEEE ICDM Workshop
on Frequent Itemset Mining Implementations, November 2003.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces the kDCI algorithm.
- [HPYM04]
-
Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao. Mining frequent
patterns without candidate generation. Data Mining and
Knowledge Discovery, 8:53-87, 2004. [ bib |
find paper in Google Scholar |
find paper in Google ]
Describes the data mining method FP-growth
(frequent pattern growth) which uses an extended prefix-tree
(FP-tree) structure to store the database in a compressed form.
FP-growth adopts a divide-and-conquer approach to decompose both
the mining tasks and the databases. It uses a pattern fragment
growth method to avoid the costly process of candidate generation
and testing.
- [CGL04]
-
Frans Coenen, Graham Goulbourne, and Paul Leng. Tree structures for
mining association rules. Data Mining and Knowledge
Discovery, 8:25-51, 2004. [ bib |
find paper in Google Scholar |
find paper in Google ]
Describes how to compute PARTIAL SUPPORT COUNTS in
one DB-pass and how to store them in an enumeration tree
(P-Tree).
- [HCXY07]
-
J. Han, H. Cheng, D. Xin, and X. Yan. Frequent
pattern mining: Current status and future directions. Data
Mining and Knowledge Discovery, 14(1), 2007. [ bib |
find paper in Google Scholar |
find paper in Google ]
Complete overview of the state-of-the art in
frequent patten mining and identifies future research
directions.
Alternative interest measures (besides confidence)
- [PS91]
-
G. Piatetsky-Shapiro. Discovery, analysis, and presentation of
strong rules. In G. Piatetsky-Shapiro and W.J. Frawley,
editors, Knowledge Discovery in Databases. AAAI/MIT Press,
Cambridge, MA, 1991. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces the measure LEVERAGE which is the
simplest function which satisfies his principles for rule-interest
functions (0 if the variables are statistically independent;
monotonically increasing if the variables occur more often
together; monotonically decreasing if one of the variables alone
occurs more often).
- [BMUT97]
-
Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom
Tsur. Dynamic itemset counting and implication rules for market
basket data. In SIGMOD 1997, Proceedings ACM SIGMOD
International Conference on Management of Data, pages 255-264,
Tucson, Arizona, USA, May 1997. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces CONVICTION (as an improvement to
confidence based on implication rules) and INTEREST (later called
LIFT).
- [AY98]
-
C. C. Aggarwal and P. S. Yu. A new framework for itemset
generation. In PODS 98, Symposium on Principles of Database
Systems, pages 18-24, Seattle, WA, USA, 1998. [ bib |
find paper in Google Scholar |
find paper in Google ]
Points out weaknesses of the large frequent itemset
method using support (spuriousness, dense datasets) and that lift
gives only values close to one for items which are very frequent,
even if they are perfectly positive correlated. COLLECTIVE STRENGTH
is introduced. Collective strength uses the violation rate for an
itemset which is the fraction of transactions which contains some,
but not all items of the itemset. The violation rate is compared to
the expected violation rate under independence. Collective strength
is downward closed.
- [LHM99]
-
Bing Liu, Wynne Hsu, and Yiming Ma. Pruning and summarizing the
discovered associations. In Proceedings of the fifth ACM SIGKDD
international conference on Knowledge discovery and data mining
(KDD-99), pages 125-134. ACM Press, 1999. [ bib |
find paper in Google Scholar |
find paper in Google ]
Remove insignificant rules using the chi-square
test to test for correlation between the antecedent and the
confident of a rule. Also DIRECTION SETTING (DS) RULES are
introduced. A DS rule has a pos. correlated antecedent and
consequent and is not built from a rule with a shorter antecedent
which is a DS rule. Normally, only a small and concise fraction of
rules are DS rules.
- [BD01]
-
Dario Bruzzese and Cristina Davino. Pruning of discovered
association rules. Computational Statistics, 16:387-398,
2001. [ bib |
find paper in Google Scholar |
find paper in Google ]
The authors construct several statistical tests to
evaluate the significance of discovered associations.
- [BH03]
-
Brock Barber and Howard J. Hamilton. Extracting share frequent
itemsets with infrequent subsets. Data Mining and Knowledge
Discovery, 7:153-185, 2003. [ bib |
find paper in Google Scholar |
find paper in Google ]
ITEMSET SHARE is the fraction of some measure
(e.g., sales, profit) contributed by the items in the set. A
itemset is share frequent if it exceeds a threshold. Share
frequency is not downward closed! The article presents several
algorithms and heuristics to mine share frequent itemsets.
- [TKS04]
-
Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava. Selecting the
right objective measure for association analysis. Information
Systems, 29(4):293-313, 2004. [ bib |
find paper in Google Scholar |
find paper in Google ]
Compare the properties of 21 objective measures (of
interest). The measures in general lack to agree with each other.
However, the authors show that if support-based pruning or table
standardization (of the contingency tables) is used, the measures
become highly correlated.
- [Sch05]
-
Tobias Scheffer. Finding association rules that trade support
optimally against confidence. Intelligent Data Analysis,
9(4):381-395, 2005. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces predictive accuracy which is the
expected value of the confidence of a rules with respect to the
process underlying the database. The author shows how predictive
accuracy can be calculated from confidence and support measured on
a data set using a Bayesian frequency correction (very simplified:
confidence is discounted for rules with low supports). Also an
algorithm is presented which finds the top n most predictive
association rules (redundant rules with a 0 predictive accuracy
improvement are removed) and shows how to estimate the prior
distribution needed for the correction.
- [BGBG05]
-
Julien Blanchard, Fabrice Guillet, Henri Briand, and Regis Gras.
Assessing rule interestingness with a probabilistic measure of
deviation from equilibrium. In Proceedings of the 11th
international symposium on Applied Stochastic Models and Data
Analysis ASMDA-2005, pages 191-200. ENST, 2005.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Presents a statistical test for the deviation from
the equilibrium of a rule. The equilibrium for rule a -> b is
defined as: the number of transactions which contain a and b
together is equal to the number of transactions which contain a and
not b.
- [GH06]
- Liqiang Geng and Howard J. Hamilton. Interestingness
measures for data mining: A survey. ACM Computing Surveys,
38(3):9, 2006. [ bib |
find paper in Google Scholar |
find paper in Google ]
- [Li06]
-
Jiuyong Li. On optimal rule discovery. IEEE Transactions on
Knowledge and Data Engineering, 18(4):460-471, 2006.
[ bib |
find paper in Google Scholar | find
paper in Google ]
An optimal rule set (with respect to a metric of
interestingness) contains all rules except those with no greater
interestingness than one of its more general rules. An optimal rule
set is a subset of a nonredundant rule set. The autors present an
algorithm called ORD to find an optimal rule set. Classifiers build
on optimal class association rules are at least as accurate as
those built from CBA and C4.5 rule.
- [HH07]
-
Michael Hahsler and Kurt Hornik. New probabilistic interest
measures for association rules. Intelligent Data Analysis,
11(5):437-455, 2007. [ bib |
find paper in Google Scholar |
find paper in Google ]
Presents a simple probabilistic framework for
transaction data which can be used to simulate transaction data
when no associations are present. Uses such data and a real-world
grocery database to explore the behavior of confidence and lift,
two popular interest measures used for rule mining. Also introduces
the new probabilistic measures hyper-lift and
hyper-confidence.
Mining rules without or with a variable minimum support
threshold
- [BMS97]
-
Sergey Brin, Rajeev Motwani, and Craig Silverstein. Beyond market
baskets: Generalizing association rules to correlations. In
SIGMOD 1997, Proceedings ACM SIGMOD International Conference on
Management of Data, pages 265-276, Tucson, Arizona, USA, May
1997. [ bib |
find paper in Google Scholar |
find paper in Google ]
Proposes to use the chi-square test for
correlation. For an itemset of length l, the test is carried out on
a l-dimensional contingency tables. A problem is cells with low
counts and multiple tests.
- [SBM98]
-
Craig Silverstein, Sergey Brin, and Rajeev Motwani. Beyond market
baskets: Generalizing association rules to dependence rules.
Data Mining and Knowledge Discovery, 2:39-68, 1998.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Journal version of Brin et al. (1997).
- [LHM99]
-
Bing Liu, Wynne Hsu, and Yiming Ma. Mining association rules with
multiple minimum supports. In Proceedings of the fifth ACM
SIGKDD international conference on Knowledge discovery and data
mining (KDD-99), pages 337-341. ACM Press, 1999.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Adapts APRIORI to work with different minimum
support thresholds assigned to different items (minimum item
supports, MIS). To preserve the downward closure property of
support item sorting using the MIS values is used.
- [LZD+99]
-
Jinyan Li, Xiuzhen Zhang, Guozho Dong, Kotagiri Ramamohanarao, and
Qun Sun. Efficient mining of high confidence association rules
without support thresholds. In J. Zytkow and J. Rauch,
editors, Principles of Data Mining and Knowledge Discovery
PKDD'99, LNAI 1704, Prague, Czech Republic, pages 406-411.
Springer-Verlag, 1999. [ bib |
find paper in Google Scholar |
find paper in Google ]
This paper used JUMPING EMERGING PATTERNS to mine a
border for top rules (rules with 100% confidence) for a given
consequent. The drawbacks are that only one consequent is mined at
a time and that finding rules with other than 100% confidence is
difficult.
- [AEMT00]
-
Khalil M. Ahmed, Nagwa M. El-Makky, and Yousry Taha. A
note on ”Beyond market baskets: Generalizing association rules to
correlations”. SIGKDD Explorations, 1(2):46-48, 2000.
[ bib |
find paper in Google Scholar |
find paper in Google ]
A reply to Brin et al. (1997). The authors state
that the chi-square test tests the whole contingency table, but for
larger than 2x2 tables we want to test dependence for single
cells.
- [WHC01]
-
Ke Wang, Yu He, and David W. Cheung. Mining
confident rules without support requirement. In Proceedings of
the tenth international conference on Information and knowledge
management, pages 89 - 96, New York, NY, 2001. ACM Press.
[ bib |
find paper in Google Scholar |
find paper in Google ]
The paper shows that for data with categorical
attributes a UNIVERSAL-EXISTENTIAL UPWARD CLOSURE exists for
confidence. With this property algorithms with confidence-based
pruning are possible that use a level-wise (from k to k-1)
candidate generation are. The paper also discusses a disk-based
implementation.
- [SK01]
-
Masakazu Seno and George Karypis. Lpminer: An algorithm for finding
frequent itemsets using length decreasing support constraint. In
Nick Cercone, Tsau Young Lin, and Xindong Wu, editors,
Proceedings of the 2001 IEEE International Conference on Data
Mining, 29 November - 2 December 2001, San Jose, California,
USA, pages 505-512. IEEE Computer Society, 2001.
[ bib |
find paper in Google Scholar |
find paper in Google ]
To find longer frequent itemsets, the minimal
support requirement decreases as a function of the itemset length.
A algorithm based on the FP-tree is presented and a property called
small valid extension (SVE) is introduced which makes mining
efficient in absence of downward closure.
- [DP01]
-
William DuMouchel and Daryl Pregibon. Empirical Bayes screening for
multi-item associations. In F. Provost and R. Srikant,
editors, Proceedings of the ACM SIGKDD Intentional Conference
on Knowledge Discovery in Databases and Data Mining (KDD-01),
pages 67-76. ACM Press, 2001. [ bib |
find paper in Google Scholar |
find paper in Google ]
Search for unusually frequent itemsets using
statistical methods. First, the authors propose stratification of
the data to avoid finding spurious associations within strata. Then
the deviation of the observed frequency over a baseline frequency
(based on independence) is used. Since the deviation is unreliable
for low counts, an empirical Bayes model (its 95% confidence limit)
is used to produce a posterior distribution of the true ratio of
actual to baseline frequencies. The Bayes model gives ratios close
to the observed ratios for large samples and reduces (shrinks) the
ratio if the sample size gets small (to smooth away noise). For
multi-item associations log-linear models are proposed to find
higher order associations which cannot be explained by pairwise
associations.
- [CDF+01]
-
Edith Cohen, Mayur Datar, Shinji Fujiwara, Aristides Gionis, Piotr
Indyk, Rajeev Motwani, Jeffrey D. Ullman, and Cheng Yang.
Finding interesting associations without support pruning. IEEE
Transactions on Knowledge and Data Engineering, 13(1):64-78,
2001. [ bib |
find paper in Google Scholar |
find paper in Google ]
Uses similarity measures between hashed values of
rows in a transaction database. The approach in the paper was only
shown for associations between two items.
- [TMF03]
-
Feng Tao, Fionn Murtagh, and Mohsen Farid. Weighted association
rule mining using weighted support and significance framework. In
Proceedings of The Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-2003), Washington,
DC, 2003. ACM Press. [ bib |
find paper in Google Scholar |
find paper in Google ]
Uses attributes of the items (e.g., price, page
dwelling time) to WEIGHT SUPPORT. A support and significance
framework is presented which possesses a weighted downward closure
property important for pruning the search space.
- [Omi03]
-
Edward R. Omiecinski. Alternative interest measures for mining
associations in databases. IEEE Transactions on Knowledge and
Data Engineering, 15(1):57-69, Jan/Feb 2003. [ bib |
find paper in Google Scholar |
find paper in Google ]
Omiecinski introduced several alternatives to
support. The first measure, ANY-CONFIDENCE, is defined as the
confidence of the rule with the largest confidence which can be
generated from an itemset. The author states that although finding
all itemsets with a set any-confidence would enable us to find all
rules with a given minimum confidence, any-confidence cannot be
used efficiently as a measure of interestingness since confidence
is not downward closed. The second introduced measure is
ALL-CONFIDENCE. This measure is defined as the smallest confidence
of all rules which can be produced from an itemset, i.e., all rules
produced form an itemset will have a confidence greater or equal to
its all-confidence value. BOND, the last measure, is defined as the
ratio of the number of transactions which contain all items of an
itemset to the number of transactions which contain at least one of
these items. Omiecinski showed that bond and all-confidence are
downward closed and, therefore, can be used for efficient mining
algorithms.
- [XTK03]
-
Hui Xiong, Pang-Ning Tan, and Vipin Kumar. Mining strong affinity
association patterns in data sets with skewed support distribution.
In Bart Goethals and Mohammed J. Zaki, editors,
Proceedings of the IEEE International Conference on Data
Mining, November 19-22, 2003, Melbourne, Florida, pages
387-394, November 2003. [ bib |
find paper in Google Scholar |
find paper in Google ]
Support-based pruning strategies are not effective
for data sets with skewed support distributions. The authors
propose the concept of hyperclique pattern, which uses an objective
measure called h-confidence (equal to all-confidence by Omiecinski,
2003) to identify strong affinity patterns. The generation of
so-called cross-support patterns (patterns with items with
substantially different support) is avoided by h-confidence's
cross-support property.
- [SK05]
-
Masakazu Seno and George Karypis. Finding frequent itemsets using
length-decreasing support constraint. Data Mining and Knowledge
Discovery, 10:197-228, 2005. [ bib |
find paper in Google Scholar |
find paper in Google ]
See Seno and Karypis 2001.
- [Hah06]
-
Michael Hahsler. A model-based frequency constraint for mining
associations from transaction data. Data Mining and Knowledge
Discovery, 13(2):137-166, September 2006. [ bib |
DOI |
find paper in Google Scholar |
find paper in Google ]
Develops a novel model-based frequency constraint
as an alternative to a single, user-specified minimum support. The
constraint utilizes knowledge of the process generating transaction
data by applying a simple stochastic mixture model (the NB model)
and uses a user-specified precision threshold to find local
frequency thresholds for groups of itemsets (NB-frequent itemsets).
The new constraint provides improvements over a single minimum
support threshold and that the precision threshold is more robust
and easier to set and interpret by the user.
Constraint-based mining
- [KMR+94]
-
Mika Klemettinen, Heikki Mannila, Pirjo Ronkainen, Hannu Toivonen,
and A. Inkeri Verkamo. Finding interesting rules from large
sets of discovered association rules. In Nabil R. Adam,
Bharat K. Bhargava, and Yelena Yesha, editors, Third
International Conference on Information and Knowledge Management
(CIKM'94), pages 401-407. ACM Press, 1994. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduce the usage of rule templates.
- [SVA97]
-
Ramakrishnan Srikant, Quoc Vu, and Rakesh Agrawal. Mining
association rules with item constraints. In David Heckerman, Heikki
Mannila, Daryl Pregibon, and Ramasamy Uthurusamy, editors,
Proceedings of the 3rd International Conference Knowledge
Discovery and Data Mining (KDD-97), pages 67-73. AAAI Press,
1997. [ bib |
find paper in Google Scholar |
find paper in Google ]
Integrates BOOLEAN CONSTRAINTS on items (absence,
presence) into the mining algorithm to reduce the search space.
Algorithms are discussed.
- [NLHP98]
-
Raymond T. Ng, Laks V.S. Lakshmanan, Jiawei Han, and Alex
Pang. Exploratory mining and pruning optimizations of constrained
associations rules. In Proceedings of the ACM SIGMOD
Conference, Seattle, WA, pages 13-24, 1998. [ bib |
find paper in Google Scholar |
find paper in Google ]
Characterizes various constraints (contains,
minimum, maximum, count, sum, avg) according to anti-monotonicity
and succinctness. Anti-monotonicity is the property which allows
iterative pruning (generate and test candidates) used e.g., on
support by Apriori. Succinctness is a property that enables us to
generate only those itemsets which satisfy the constraint without
the need to test them.
- [BAG00]
-
R. Bayardo, R. Agrawal, and D. Gunopulos.
Constraint-based rule mining in large, dense databases. Data
Mining and Knowledge Discovery, 4(2/3):217-240, 2000.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces the MINIMUM IMPROVEMENT constraint for
confidence (mine only rules with a confidence which is minimp
greater than the confidence of any of its proper subset-rules).
DenseMiner, an algorithm that enforces minimum support, minimum
confidence and minimum improvement already during a breadth-first
search for all rules for a given consequent C is presented.
- [PHL01]
-
Jian Pei, Jiawei Han, and Laks V.S. Lakshmanan. Mining
frequent itemsets with convertible constraints. In Proceedings
of the 17th International Conference on Data Engineering, April
2-6, 2001, Heidelberg, Germany, pages 433-442, 2001.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Develops a technique of how constraints on avg,
median and sum can be converted so that they can be used already
during the search phase of the FP-growth algorithm. The constraints
are classified into constraints that are: convertible
anti-monotone, convertible monotone and strongly
convertible.
- [WZ05]
-
Geoffrey I. Webb and Songmao S. Zhang.
k-optimal-rule-discovery. Data Mining and Knowledge
Discovery, 10(1):39-79, 2005. [ bib |
find paper in Google Scholar | find
paper in Google ]
Develops GRD (based on the OPUS search strategy)
which discovers all rules satisfying a set of constraints (max.
number of rules, min support, min confidence, max coverage, max
leverage) in a depth-first search. (An early draft of the paper was
called: Beyond association rules: Generalized rule
discovery)
- [BGMP05]
-
Francesco Bonchi, Fosca Giannotti, Alessio Mazzanti, and Dino
Pedreschi. ExAnte: A preprocessing method for frequent-pattern
mining. IEEE Intelligent Systems, 20(3):25-31, 2005.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Reduces the database size before mining by
iteratively applying mu-reduction and alpha-reduction. Mu-reduction
removes transactions which do not meet monotone constraints.
Alpha-reduction remove infrequent items from the
transactions.
- [BL06]
-
Francesco Bonchi and Claudio Lucchese. On condensed representations
of constrained frequent patterns. Knowledge and Information
Systems, 9(2):180-201, 2006. [ bib |
find paper in Google Scholar |
find paper in Google ]
Presents an algorithm to efficiently mine closed
and constrained frequent itemsets.
Mining sequential, generalized, quantitative or causal
rules
- [SA95]
-
Ramakrishnan Srikant and Rakesh Agrawal. Mining generalized
association rules. In Proceedings of the 21st VLDB Conference,
Zurich, Switzerland, 1995. [ bib |
find paper in Google Scholar |
find paper in Google ]
Generalized association rules use a taxonomy (is-a
hierarchy) on items. The paper introduces R-interesting rules as
rules with a support which is R-times higher than the support of
its closest ancestor (a rule with at leased on item generalized).
Algorithms that use R-interesting in addition to support and
confidence are presented and evaluated.
- [AS95]
-
Rakesh Agrawal and Ramakrishnan Srikant. Mining sequential
patterns. In Philip S. Yu and Arbee S. P. Chen, editors,
Eleventh International Conference on Data Engineering,
pages 3-14, Taipei, Taiwan, 1995. IEEE Computer Society Press.
[ bib |
find paper in Google Scholar | find
paper in Google ]
Introduces mining sequential patterns. A sequential
pattern is a maximal sequence that exceeds minimum support (a
minimum number of customers). The algorithms AprioriSome and
AprioryAll (based on Apriori) are presented.
- [MTV97]
- Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
Discovery of frequent episodes in event sequences. Data Mining
and Knowledge Discovery, 1(3):259-289, 1997. [ bib |
find paper in Google Scholar |
find paper in Google ]
- [LSW97]
-
Brian Lent, Arun N. Swami, and Jennifer Widom. Clustering
association rules. In Proceedings of the Thirteenth
International Conference on Data Engineering, April 7-11, 1997
Birmingham U.K., pages 220-231. IEEE Computer Society, 1997.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Join adjacent intervals for quantitative
association rules to produce more general rules.
- [SBMU00]
-
Craig Silverstein, Sergey Brin, Rajeev Motwani, and Jeffrey D.
Ullman. Scalable techniques for mining causal structures. Data
Mining and Knowledge Discovery, 4(2/3):163-192, 2000.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Explores the applicability of constraint-based
causal discovery (known from Bayesian learning) to discover causal
relationships in market basket data.
- [Ada01]
-
Jean-Marc Adamo. Data Mining for Association Rules and
Sequential Patterns. Springer, New York, 2001. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduction to association rules and mining
sequential patterns.
- [LNWJ01]
-
Yingjiu Li, Peng Ning, X. Sean Wang, and Sushil Jajodia.
Generating market basket data with temporal information. In ACM
KDD Workshop on Temporal Data Mining, August 2001.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Develop a generator for synthetic data with
temporal patterns based on the generator by Agrawal and Srikan
(1994).
- [AL03]
-
Y. Aumann and Y. Lindell. Statistical theory for
quantitative association rules. Journal of Intelligent
Information Systems, 20(3):255-283, 2003. [ bib |
find paper in Google Scholar |
find paper in Google ]
Defines QUANTITATIVE ASSOCIATION RULES using
statistical measures (e.g., mean and variance) of continuous data.
Also algorithms are discussed.
- [HCXY07]
-
J. Han, H. Cheng, D. Xin, and X. Yan. Frequent
pattern mining: Current status and future directions. Data
Mining and Knowledge Discovery, 14(1), 2007. [ bib |
find paper in Google Scholar |
find paper in Google ]
Complete overview of the state-of-the art in
frequent patten mining and identifies future research
directions.
Concise representations of frequent itemsets (closed, maximal,
etc.)
- [MT96]
-
Heikki Mannila and Hannu Toivonen. Multiple uses of frequent sets
and condensed representations. In Proceedings of the Second
International Conference on Knowledge Discovery and Data Mining
(KDD-96), pages 189-194. AAAI Press, 1996. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces general rules with disjunctions and
negations in the antecedent and the consequent. The confidence of
any such rules can be approximated by using the support of frequent
itemsets only (applying the inclusion-exclusion principle). Using
the negative border, an error bound for the estimates can be
calculated. The authors also show that frequent itemsets with a
support of epsilon are a concise representation (epsilon-adequate
representation) which can approximate the confidence of any itemset
with an error of at most epsilon.
- [ZPOL97]
-
Mohammed J. Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara,
and Wei Li. New algorithms for fast discovery of association rules.
Technical Report 651, Computer Science Department, University of
Rochester, Rochester, NY 14627, July 1997. [ bib |
find paper in Google Scholar |
find paper in Google ]
Quickly identify MAXIMAL FREQUENT ITEMSETS (a
frequent itemset is maximal if it is no proper subset of any other
frequent itemset) using different database layout schemes (regular,
inverted) and clustering techniques (equivalence class ECLAT, max.
clique). See also Zaki 2000, Scalable Algorithms for Association
Mining.
- [PBTL99b]
-
Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal.
Efficient mining of association rules using closed itemset
lattices. Information Systems, 24(1):25-46, 1999.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Present the CLOSE algorithm to mine frequent closed
itemsets.
- [PBTL99a]
-
Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal.
Discovering frequent closed itemsets for association rules. In
Proceeding of the 7th International Conference on Database
Theory, Lecture Notes In Computer Science (LNCS 1540), pages
398-416. Springer, 1999. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces CLOSED ITEMSETS. An itemset X is closed
if no proper super set of X is contained in every transaction in
which X is contained. Which means there exists no super set of X
with the same support count as X.
- [PHM00]
-
Jian Pei, Jiawei Han, and Runying Mao. CLOSET: An efficient
algorithm for mining frequent closed itemsets. In ACM SIGMOD
Workshop on Research Issues in Data Mining and Knowledge
Discovery, 2000. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces the algorithm CLOSET which mines
frequent closed itemsets using FP-growth (a depth-first search
using support counting).
- [BTP+00]
-
Y. Bastide, R. Taouil, N. Pasquier, G. Stumme,
and L. Lakhai. Mining frequent patterns with counting
inference. SIGKDD Explorations, 2(2):66-75, 2000.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Proposes the algorithm PASCAL (a APRIORI
optimization) to mine closed and frequent items. This approach uses
frequent key-patterns to infer counts of frequent non-key
patterns.
- [AAP00]
-
Rakesh C. Agrawal, Charu C. Aggarwal, and
V. V. V. Prasad. Depth first generation of long patterns.
In Proceedings of the ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-2000), pages 108-118,
2000. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces the algorithm DepthProject which builds
a lexicographic tree in a depth first order.
- [BCG01]
-
Douglas Burdick, Manuel Calimlim, and Johannes Gehrke. MAFIA: A
maximal frequent itemset algorithm for transactional databases. In
Proceedings of the 17th International Conference on Data
Engineering, pages 443-452, Washington, DC, 2001. IEEE
Computer Society. [ bib |
find paper in Google Scholar |
find paper in Google ]
MAFIA (MAximal Frequent Itemset Algorithm) finds
maximal itemsets using a depth-first traversal of the itemset
lattice, a compressed vertical bitmap representation of the
database, additional pruning techniques (Parent Equivalence
Pruning, Frequent Head Union Tail pruning) and dynamic reordering.
The authors claim that MAFIA outperforms DepthProject (Agrawal et
al., 2001) by a factor of 3 to 5 on average.
- [ZH02]
-
Mohammed J. Zaki and Ching-Jiu Hsiao. CHARM: An efficient
algorithm for closed itemset mining. In Proceedings of the
Second SIAM International Conference on Data Mining,
Arlington, VA, 2002. SIAM. [ bib |
find paper in Google Scholar |
find paper in Google ]
The algorithm CHARM enumerates all frequent closed
itemsets and uses a number of improvements: (a) It uses a IT-tree
(itemset-tidset tree based on equivalence classes) to search
simultaneously the itemset space and the transaction space. (b) It
uses a fast hash-based elimination of non-closed itemsets. (c) It
uses diffsets which represents the database in a compact way which
should fit into main memory. (d) It uses efficient intersection
operations. The performance testing shows that CHARM can provide
significant improvement over algorithms as Apriori, Close, Pascal,
Mafia, and Closet.
- [CG02]
-
Toon Calders and Bart Goethals. Mining all non-derivable frequent
itemsets. In Tapio Elomaa, Heikki Mannila, and Hannu Toivonen,
editors, Proceedings of the 6th European Conference on
Principles of Data Mining and Knowledge Discovery, volume 2431
of Lecture Notes in Computer Science, pages 74-85.
Springer-Verlag, 2002. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduce NON-DERIVABLE ITEMSETS (NDIs). The
support of all frequent NDIs allows for computing the support of
all frequent itemsets using deduction rules based on the
inclusion-exclusion principle.
- [BBR03]
-
Jean-Francois Boulicaut, Artur Bykowski, and Christophe Rigotti.
Free-sets: A condensed representation of boolean data for the
approximation of frequency queries. Data Mining and Knowledge
Discovery, 7(1):5-22, 2003. [ bib |
find paper in Google Scholar |
find paper in Google ]
Presents a new epsilon-adequate representation for
frequent itemsets called frequent FREE-SETS. An itemset is a
free-set if it has no subset with (almost) the same support thus
the items in the itemset cannot be used to form a (nearly) exact
rule.
- [Zak04]
-
Mohammed Zaki. Mining non-redundant association rules. Data
Mining and Knowledge Discovery, 9:223-248, 2004.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Compares frequent itemsets and frequent closed
itemsets and shows that frequent closed itemsets can be used to
generate NON-REDUNDANT association rules. Non-Redundant rules are a
set of rules with the most general rules (smallest antecedent and
consequent) without loss of information.
- [ZH05]
-
Mohammed Zaki and Ching-Jui Hsiao. Efficient algorithms for mining
closed itemsets and their lattice structure. IEEE Transactions
on Knowledge and Data Engineering, 17(4):462-478, 2005.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Describes the algorithm CHARM.
- [GZ05]
-
Karam Gouda and Mohammed J. Zaki. GenMax: An efficient
algorithm for mining maximal frequent itemsets. Data Mining and
Knowledge Discovery, 11:1-20, 2005. [ bib |
find paper in Google Scholar |
find paper in Google ]
Presents a backtrack search based algorithm for
mining maximal frequent itemsets. Uses: progressive focusing for
maximality checking, and diffset propagation for frequency
computation.
- [BL06]
-
Francesco Bonchi and Claudio Lucchese. On condensed representations
of constrained frequent patterns. Knowledge and Information
Systems, 9(2):180-201, 2006. [ bib |
find paper in Google Scholar |
find paper in Google ]
Presents an algorithm to efficiently mine closed
and constrained frequent itemsets.
- [CRB06]
- Toon Calders, Christophe Rigotti, and Jean-Francois Boulicaut.
A survey on condensed representations for frequent sets. In
Jean-Francois Boulicaut, Luc Raedt, and Heikki Mannila, editors,
Constraint-Based Mining and Inductive Databases: European
Workshop on Inductive Databases and Constraint Based Mining,
Hinterzarten, Germany, March 11-13, 2004, Revised Selected
Papers, volume 3848 of Lecture Notes in Computer
Science, pages 64-80, February 2006. [ bib |
find paper in Google Scholar |
find paper in Google ]
- [HCXY07]
-
J. Han, H. Cheng, D. Xin, and X. Yan. Frequent
pattern mining: Current status and future directions. Data
Mining and Knowledge Discovery, 14(1), 2007. [ bib |
find paper in Google Scholar |
find paper in Google ]
Complete overview of the state-of-the art in
frequent patten mining and identifies future research
directions.
Using association rules for classification
- [LHM98]
-
Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and
association rule mining. In Proceedings of the 4rd
International Conference Knowledge Discovery and Data Mining
(KDD-98), pages 80-86. AAAI Press, 1998. [ bib |
find paper in Google Scholar |
find paper in Google ]
Mines only the subset of association rules with the
classification class attribute in the right-hand-site (CARs). From
these CARs a classifier is built by using the rules with the
highest confidence to cover the whole database. The presented
alorithm is called Classification Based on Associations (CBA). In
ecperiment the resulting classifiers are more accurate than
C4.5.
- [Fre00]
- Alex A. Freitas. Understanding the crucial differences
between classification and discovery of association rules - a
position paper. SIGKDD Explorations, 2(1):65-69, 2000.
[ bib |
find paper in Google Scholar |
find paper in Google ]
- [Li06]
-
Jiuyong Li. On optimal rule discovery. IEEE Transactions on
Knowledge and Data Engineering, 18(4):460-471, 2006.
[ bib |
find paper in Google Scholar | find
paper in Google ]
An optimal rule set (with respect to a metric of
interestingness) contains all rules except those with no greater
interestingness than one of its more general rules. An optimal rule
set is a subset of a nonredundant rule set. The autors present an
algorithm called ORD to find an optimal rule set. Classifiers build
on optimal class association rules are at least as accurate as
those built from CBA and C4.5 rule.
- [JHZ10]
-
Mojdeh Jalali-Heravi and Osmar R. Zaïane. A study on
interestingness measures for associative classifiers. In
Proceedings of the 2010 ACM Symposium on Applied
Computing, SAC '10, pages 1039-1046. ACM, 2010.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Compares associative classifiers using 53 different
objective measures for association rules.
Evolution of association rules over time
- [SKK01]
-
Hee Seok Song, Soung Hie Kim, and Jae Kyeong Kim. A
methodology for detecting the change of customer behavior based on
association rule mining. In Proceedings of the Pacific Asia
Conference on Information System, pages 871-885. PACIS, 2001.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Develops a methodology to detect changes of
customer behavior automatically by comparing association rules
between different time snapshots of data. Defines emerging pattern,
unexpected change and the added/perished rule based on similarity
and difference measures for rule matching.
Theoretic considerations, sampling and clustering
- [MTV94]
-
Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
Efficient algorithms for discovering association rules. In
Usama M. Fayyad and Ramasamy Uthurusamy, editors, AAAI
Workshop on Knowledge Discovery in Databases (KDD-94), pages
181-192, Seattle, Washington, 1994. AAAI Press. [ bib |
find paper in Google Scholar |
find paper in Google ]
Develop similar improvements to the candidate
generation as APRIORI. Itemsets with support are called covering
sets. The paper also introduces sampling from the database and
gives bounds for the resulting estimate of support.
- [Toi96]
-
Hannu Toivonen. Sampling large databases for association rules. In
VLDB '96: Proceedings of the 22th International Conference on
Very Large Data Bases, pages 134-145, San Francisco, CA, USA,
1996. Morgan Kaufmann Publishers Inc. [ bib |
find paper in Google Scholar |
find paper in Google ]
Find frequent itemsets in a random sample of a
database (that fits into main memory) and then verify the found
frequent itemsets in the database.
- [ZPLO97]
-
Mohammed Javeed Zaki, Srinivasan Parthasarathy, Wei Li, and
Mitsunori Ogihara. Evaluation of sampling for data mining of
association rules. In Proceedings of the 7th International
Workshop on Research Issues in Data Engineering (RIDE '97) High
Performance Database Management for Large-Scale Applications,
pages 42-50. IEEE Computer Society, 1997. [ bib |
find paper in Google Scholar |
find paper in Google ]
Evaluates random sampling with replacement as
presented in Manila et al. 1994 using several datasets. The
experiments show that Chernoff bounds overestimate the needed
sample size and that sampling seems an effective tool for practical
purposes.
- [LSW97]
-
Brian Lent, Arun N. Swami, and Jennifer Widom. Clustering
association rules. In Proceedings of the Thirteenth
International Conference on Data Engineering, April 7-11, 1997
Birmingham U.K., pages 220-231. IEEE Computer Society, 1997.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Join adjacent intervals for quantitative
association rules to produce more general rules.
- [ZO98]
-
M. J. Zaki and M. Ogihara. Theoretical foundation of
association rules. In SIGMOD'98 Workshop on Research Issues in
Data Mining and Knowledge Discovery (SIGMOD-DMKD'98), Seattle,
Friday, June 5, 1998, 1998. [ bib |
find paper in Google Scholar |
find paper in Google ]
Presents the lattice-theoretic foundations of
mining associations based on FORMAL CONCEPT ANALYSIS and shows that
frequent itemsets are determined by the set of frequent concepts.
The paper studies the generation of a minimal set of rules (called
base) can be generated from which all other association rules can
be inferred. The paper also presents some complexity considerations
using the connection between frequent itemsets and maximal
bipartite cliques. It is shown that for very sparse databases
association rule algorithms should scale linearly in the number of
items.
- [MS98]
-
Nimrod Megiddo and Ramakrishnan Srikant. Discovering predictive
association rules. In Rakesh Agrawal, Paul E. Stolorz, and
Gregory Piatetsky-Shapiro, editors, Proceedings of the Fourth
International Conference on Knowledge Discovery and Data Mining
(KDD-98), pages 274-278. AAAI Press, 1998. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces several STATISTICAL TESTS: Test if the
observed support count is sig. greater than a support threshold,
Chi-square test of independence (see also Brin et al. 1997). Also
deals with the Bonferroni effect (multiple-comparison problem) by
finding an upper bound of the number of tested hypotheses and
proposing a resampling procedure using an independence model. The
paper introduced confidence intervals for support and confidence.
Finally, the authors find that the support-confidence framework
does a good job to eliminate statistically insignificant rules (on
market basket data).
- [LHM99]
-
Bing Liu, Wynne Hsu, and Yiming Ma. Pruning and summarizing the
discovered associations. In Proceedings of the fifth ACM SIGKDD
international conference on Knowledge discovery and data mining
(KDD-99), pages 125-134. ACM Press, 1999. [ bib |
find paper in Google Scholar |
find paper in Google ]
Remove insignificant rules using the chi-square
test to test for correlation between the antecedent and the
confident of a rule. Also DIRECTION SETTING (DS) RULES are
introduced. A DS rule has a pos. correlated antecedent and
consequent and is not built from a rule with a shorter antecedent
which is a DS rule. Normally, only a small and concise fraction of
rules are DS rules.
- [BA99]
-
Robert J. Bayardo Jr. and Rakesh Agrawal. Mining the most
interesting rules. In Proceedings of the fifth ACM SIGKDD
international conference on Knowledge discovery and data mining
(KDD-99), pages 145-154. ACM Press, 1999. [ bib |
find paper in Google Scholar |
find paper in Google ]
Shows that for all rules with the same antecedent,
the best (optimal, most interesting) rules according to measures as
confidence, support, gain, chi-square value, gini, entropy gain,
laplace, lift, conviction all must reside along a
support/confidence border. The paper also shows that many measures
are monotone functions of support and confidence.
- [DP01]
-
William DuMouchel and Daryl Pregibon. Empirical Bayes screening for
multi-item associations. In F. Provost and R. Srikant,
editors, Proceedings of the ACM SIGKDD Intentional Conference
on Knowledge Discovery in Databases and Data Mining (KDD-01),
pages 67-76. ACM Press, 2001. [ bib |
find paper in Google Scholar |
find paper in Google ]
Search for unusually frequent itemsets using
statistical methods. First, the authors propose stratification of
the data to avoid finding spurious associations within strata. Then
the deviation of the observed frequency over a baseline frequency
(based on independence) is used. Since the deviation is unreliable
for low counts, an empirical Bayes model (its 95% confidence limit)
is used to produce a posterior distribution of the true ratio of
actual to baseline frequencies. The Bayes model gives ratios close
to the observed ratios for large samples and reduces (shrinks) the
ratio if the sample size gets small (to smooth away noise). For
multi-item associations log-linear models are proposed to find
higher order associations which cannot be explained by pairwise
associations.
- [BP01]
-
Stephen D. Bay and Michael J. Pazzani. Detecting group
differences: Mining contrast sets. Data Mining and Knowledge
Discovery, 5(3):213-246, 2001. [ bib |
find paper in Google Scholar |
find paper in Google ]
Finds sets with substantially different support in
different groups. Uses interest based pruning and statistical
surprise for filtering (summarizing) contrast sets. The search
error is controlled using different (Bonferroni) correction for
sets of different size.
- [STB+02]
-
Gerd Stumme, Rafik Taouil, Yves Bastide, Nicolas Pasquier, and
Lotfi Lakhal. Computing iceberg concept lattices with titanic.
Data & Knowledge Engineering, 42(2):189-222, 2002.
[ bib |
find paper in Google Scholar |
find paper in Google ]
The paper shows how iceberg concept lattices can be
used as a condensed method to represent and visualize frequent
(closed) itemsets. Iceberg concept lattices only show the top-most
part of a concept lattices (known from Formal Concept Analysis). To
compute iceberg concept lattices the algorithm TITANIC is presented
which computes closed sets (a closure system) in a level-wise
approach using weights (e.g., support), equivalence classes and key
sets (minimal sets in an equivalence class). TITANIC is compared
experimentally to Next-Closure and performs better. PASCAL (Bastide
et al. 2000) is a modified version of TITANIC to mine all frequent
itemsets.
- [APY02]
-
Charu C. Aggarwal, Cecilia Magdalena Procopiuc, and
Philip S. Yu. Finding localized associations in market basket
data. Knowledge and Data Engineering, 14(1):51-62, 2002.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Proposes to cluster transactions using a similarity
measure based on the new affinity measure (measures similarity
between pairs of items). Then mine association rules in the
identified clusters.
- [SLTN03]
-
Sam Y. Sung, Zhao Li, Chew L. Tan, and Peter A. Ng.
Forecasting association rules using existing data sets. IEEE
Transactions on Knowledge and Data Engineering,
15(6):1448-1459, Nov/Dec 2003. [ bib |
find paper in Google Scholar |
find paper in Google ]
Resample datasets proportional to background
attributes (e.g., distribution of customers' sex) to forecast rules
in a new situation (e.g., a new store at a new location).
- [RMZ03]
-
Ganesh Ramesh, William A. Maniatty, and Mohammed J. Zaki.
Feasible itemset distributions in data mining: theory and
application. In Symposium on Principles of Database Systems,
PODS 2003, San Diego, CA, USA, 2003. ACM Press.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Studies the length distributions of frequent and
frequent maximal itemsets (the number of frequent itemsets with the
same length). The length distribution determines the algorithms
performance and is important to generate realistic synthetic
datasets.
- [PMS03]
-
Dmitry Pavlov, Heikki Mannila, and Padhraic Smyth. Beyond
independence: Probabilistic models for query approximation on
binary transaction data. IEEE Transactions on Knowledge and
Data Engineering, 15(6):1409-1421, 2003. [ bib |
find paper in Google Scholar |
find paper in Google ]
Investigates the use of probabilistic models
(independence model, pair-wise interactions stored in a Chow-Liu
Tree, mixtures of independence models, itemset inclusion-exclusion
model, and the maximum entropy method) for the problem of
generating fast approx. answers to queries for large sparse binary
data sets.
- [HSM03]
-
Jaakko Hollmén, Jouni K. Seppänen, and Heikki Mannila. Mixture
models and frequent sets: Combining global and local methods for
0-1 data. In SIAM International Conference on Data Mining
(SDM'03), San Fransisco, May 2003. [ bib |
find paper in Google Scholar |
find paper in Google ]
Clusters binary data first using the EM-algorithms
(looks like LCA; Cadez et al. (2001) seem to do the same to find
profiles). Then the authors mine frequent itemsets in each cluster.
Finally, they use the maximum entropy technique to obtain local
models from the frequent itemsets and combine these models to
approximate the joint distribution.
- [Yan04]
-
Guizhen Yang. The complexity of mining maximal frequent itemsets
and maximal frequent patterns. In Proceedings of the 2004 ACM
SIGKDD international conference on Knowledge discovery and data
mining, Seattle, WA, USA, 2004. ACM Press. [ bib |
find paper in Google Scholar |
find paper in Google ]
Shows that enumerating all maximal frequent
itemsets is NP-hard and the associated counting problem is
#P-complete.
- [Sch05]
-
Tobias Scheffer. Finding association rules that trade support
optimally against confidence. Intelligent Data Analysis,
9(4):381-395, 2005. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces predictive accuracy which is the
expected value of the confidence of a rules with respect to the
process underlying the database. The author shows how predictive
accuracy can be calculated from confidence and support measured on
a data set using a Bayesian frequency correction (very simplified:
confidence is discounted for rules with low supports). Also an
algorithm is presented which finds the top n most predictive
association rules (redundant rules with a 0 predictive accuracy
improvement are removed) and shows how to estimate the prior
distribution needed for the correction.
- [Web06]
-
Geoffrey I. Webb. Discovering significant rules. In KDD
'06: Proceedings of the 12th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 434-443, New York,
NY, USA, 2006. ACM Press. [ bib |
DOI |
find paper in Google Scholar |
find paper in Google ]
Comapares two approaches (the well-known Bonferroni
adjustment and a new evaluation using holdout data) to control the
experimentwise risk of false discoveries for statistical hypothesis
tests. Experimental results indicate that neither of the two
approaches dominates the other.
Evaluation and efficient implementation of rule mining
algorithms
- [KP88]
-
Ron Kohavi and Foster Provost. Glossary of terms. Machine
Learning, 30(2-3):271-274, 1988. [ bib |
find
paper in Google Scholar | find paper
in Google ]
Definition of measures from Machine Learning.
Theses are especially interesting for comparison and evaluation of
association rule algorithms.
- [AIS93]
-
Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Database mining:
A performance perspective. IEEE Transactions on Knowledge and
Data Engineering, 5(6):914-925, 1993. [ bib |
find paper in Google Scholar |
find paper in Google ]
Places association rule mining together with
classification and sequence mining into the context of rule
discovery in database mining. The authors basic operations and an
algorithm to discover classification rules. For the evaluation they
generate artificial survey data using different classification
functions.
- [AS94]
-
Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining
association rules in large databases. In Jorge B. Bocca,
Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the
20th International Conference on Very Large Data Bases, VLDB,
pages 487-499, Santiago, Chile, September 1994. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduction of the APRIORI algorithm (the
best-known algorithm; it uses a breadth-first search strategy to
counting the support of itemsets). The algorithm uses an improved
candidate generation function which exploits the downward closure
property of support and makes it more efficient than AIS. Also an
algorithm to generate synthetic transaction data is presented. Such
synthetic transaction data are widely used for the evaluation and
comparison of new algorithms.
- [KBF+00]
-
Ron Kohavi, Carla Brodley, Brian Frasca, Llew Mason, and Zijian
Zheng. KDD-Cup 2000 organizers' report: Peeling the onion.
SIGKDD Explorations, 2(2):86-98, 2000. [ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces also some freely available data sets for
algorithm performance evaluation.
- [ZKM01]
-
Zijian Zheng, Ron Kohavi, and Llew Mason. Real world performance of
association rule algorithms. In F. Provost and
R. Srikant, editors, Proceedings of the ACM SIGKDD
Intentional Conference on Knowledge Discovery in Databases and Data
Mining (KDD-01), pages 401-406. ACM Press, 2001.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Compares the performance of association rule
algorithms (APRIORI, CHARM, FP-growth, CLOSET, MagnumOpus) using
one IBM-Artificial dataset and three real-world e-commerce
datasets. It shows that some improvements demonstrated on
artificial datasets do not carry over to real-world
datasets.
- [CSM01]
-
Igor V. Cadez, Padhraic Smyth, and Heikki Mannila.
Probabilistic modeling of transaction data with applications to
profiling, visualization, and prediction. In F. Provost and
R. Srikant, editors, Proceedings of the ACM SIGKDD
Intentional Conference on Knowledge Discovery in Databases and Data
Mining (KDD-01), pages 37-45. ACM Press, 2001. [ bib |
find paper in Google Scholar |
find paper in Google ]
The authors construct a model (profile with
weights) for each individual's behavior as a mixture of several
components. This mixture provides the probabilities for a
multinomial probability model (each item has a constant probability
to be chosen for a transaction). Finally, the authors compare
several estimation methods and model variants empirically using
store choice data.
- [LNWJ01]
-
Yingjiu Li, Peng Ning, X. Sean Wang, and Sushil Jajodia.
Generating market basket data with temporal information. In ACM
KDD Workshop on Temporal Data Mining, August 2001.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Develop a generator for synthetic data with
temporal patterns based on the generator by Agrawal and Srikan
(1994).
- [BK02]
-
Christian Borgelt and Rudolf Kruse. Induction of association rules:
Apriori implementation. In 15th Conference on Computational
Statistics (Compstat 2002), Heidelberg, Germany, 2002. Physica
Verlag. [ bib |
find paper in Google Scholar |
find paper in Google ]
An efficient implementation of APRIORI.
- [OLP+03]
-
Salvatore Orlando, Claudio Lucchese, Paolo Palmerini, Raffaele
Perego, and Fabrizio Silvestri. kdci: a multi-strategy algorithm
for mining frequent sets. In Bart Goethals and Mohammed J.
Zaki, editors, FIMI'03: Proceedings of the IEEE ICDM Workshop
on Frequent Itemset Mining Implementations, November 2003.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Introduces the kDCI algorithm.
- [Bor03]
-
Christian Borgelt. Efficient implementations of apriori and eclat.
In Bart Goethals and Mohammed J. Zaki, editors,
Proceedings of the IEEE ICDM Workshop on Frequent Itemset
Mining Implementations, Melbourne, FL, USA, November 2003.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Discusses the efficient implementation of APRIORI
(with prefix tree) and ECLAT.
- [GZ04]
-
Bart Goethals and Mohammed J. Zaki. Advances in frequent
itemset mining implementations: Report on FIMI'03. SIGKDD
Explorations, 6(1):109-117, 2004. [ bib |
find paper in Google Scholar |
find paper in Google ]
This paper reports on the performance of different
frequent itemset mining implementations on several real-world and
artificial databases. The authors conclude that the latest
algorithms (patricia, kdci, fpclose, fpmax*) outperform older ones
but that currently no tested algorithm gracefully scales up to very
large databases with millions of transactions.
- [CLA04]
-
Frans Coenen, Paul Leng, and Shakil Ahmed. Data structures for
association rule mining: T-trees and P-trees. IEEE Transactions
on Knowledge and Data Engineering, 16(6):774-778, 2004.
[ bib |
find paper in Google Scholar |
find paper in Google ]
Describes two new structures for association rule
mining: T-trees (total support trees) and P-trees (partial support
trees). The T-tree is a data structure (a compressed set
enumeration tree) to store itemsets. The P-tree is a compressed way
to represent a database in memory for mining.
- [JSL+05]
-
Daniel R. Jeske, Behrokh Samadi, Pengyue J. Lin, Lan Ye,
Sean Cox, Rui Xiao, Ted Younglove, Minh Ly, Douglas Holt, and Ryan
Rich. Generation of synthetic data sets for evaluating the accuracy
of knowledge discovery systems. In Proceeding of the eleventh
ACM SIGKDD international conference on Knowledge discovery in data
mining, pages 756-762, New York, NY, USA, 2005. ACM Press.
[ bib |
DOI |
find paper in Google Scholar |
find paper in Google ]
Generate synthetic data (e.g., credit card
transaction data) for accuracy evaluation using semantic
graphs.
- [HH07]
-
Michael Hahsler and Kurt Hornik. New probabilistic interest
measures for association rules. Intelligent Data Analysis,
11(5):437-455, 2007. [ bib |
find paper in Google Scholar |
find paper in Google ]
Presents a simple probabilistic framework for
transaction data which can be used to simulate transaction data
when no associations are present. Uses such data and a real-world
grocery database to explore the behavior of confidence and lift,
two popular interest measures used for rule mining. Also introduces
the new probabilistic measures hyper-lift and
hyper-confidence.
© Michael Hahsler, Last
modified
Parts of this page have been generated by bibtex2html