Christian Borgelt's Web Pages

Carpenter - Closed and Maximal Frequent Item Set Mining

Download

carpenter	(350 kb)	GNU/Linux executable
carpenter.exe	(218 kb)	Windows console executable
carpenter.zip	(191 kb)	C sources, version 3.22 (2020.06.15)
carpenter.tar.gz	(171 kb)
census.zip	(382 kb)	census data set (UCI ML repository)
census	(2 kb)	shell script used for the conversion

Description

Carpenter is a program to find closed frequent item sets with the Carpenter algorithm [Pan et al. 2003], which enumerates transaction sets, in contrast to many other frequent item set mining algorithms, which enumerate item sets. Such an approach can be highly competitive in special cases, namely if there are few transactions and (very) many items, which is a common situation in biological data sets like gene expression data. For other data sets (fewer items, many transactions), however, it is not a recommendable approach.

By default the program finds closed item sets. It can also find maximal item sets, but the filtering of the closed item sets may not be very efficient.

This implementation offers a variant based on transaction identifier lists according to the description in [Pan et al. 2003], although with several optimizations due to which it significantly outperforms the implementation of the Gemini package, which is provided by the authors of [Pan et al. 2003].

The alternative is a variant based on an item occurrence counter table, which bears some vague resemblance to the horizontal approach in the RERII algorithm [Cong et al. 2004]. Which of the two variants is faster depends on the data set. By default, the variant to be used is chosen automatically based on the table size (essentially: if the data table fits into the processor cache, use the table, otherwise use the transaction identifier list version).

Full description of the Carpenter program (included in the source package).

If you have trouble executing the program on Microsoft Windows, check whether you have the Microsoft Visual C++ Redistributable for Visual Studio 2022 (see under "Other Tools and Frameworks") installed, as the program was compiled with Microsoft Visual Studio 2022.

The improved Carpenter algorithm used in this program is described in the following paper:

Finding Closed Frequent Item Sets by Intersecting Transactions
Christian Borgelt, Xiaoyuan Yang, Ruben Nogales-Cadenas, Pedro Carmona-Saez, and Alberto Pascual-Montano.
Proc. 14th Int. Conf. on Extending Database Technology (EDBT 2011, Uppsala, Sweden), 367-376.
ACM Press, New York, NY, USA 2011
edbt_11.pdf (448 kb) edbt_11.ps.gz (373 kb) (10 pages)

Some other references:

Carpenter: Finding Closed Patterns in Long Biological Datasets
F. Pan, G. Cong, A.K.H. Tung, J. Yang, and M. Zaki
Proc. 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2003, Washington, DC), 637-642
ACM Press, New York, NY, USA 2003
Mining Frequent Closed Patterns in Microarray Data
G. Cong, K.-L. Tan, A.K.H. Tung, F. Pan
Proc. 4th IEEE Int. Conf. on Data Mining (ICDM 2004, Brighton, UK), 363-366
IEEE Press, Piscataway, NJ, USA 2004

More information about frequent item set mining, implementations of other algorithms as well as test data sets can be found at the Frequent Itemset Mining Implementations Repository.