Apriori

Find Frequent Item Sets and Association Rules with the Apriori Algorithm

Note: This documentation refers to Apriori version 6.31 (2022.11.22) and may not be compatible with other versions.
Call apriori without any options or arguments to check the actually supported options.

Introduction
Basic Notions
Target Types
Program Invocation
Program Options
Input Format
- Format of the Transactions File
- Format of the Item Appearances File
Output Format
Extended Rule Selection
Extended Item Set Selection
Transactions as a Prefix Tree
Compilation Options
License
Download
Contact

Introduction

Frequent item set mining and association rule induction [Agrawal and Srikant 1994] are powerful methods for so-called market basket analysis, which aims at finding regularities in the shopping behavior of customers of supermarkets, mail-order companies, online shops etc. With the induction of frequent item sets and association rules one tries to find sets of products that are frequently bought together, so that from the presence of certain products in a shopping cart one can infer (with a high probability) that certain other products are present. Such information, especially if expressed in the form of rules, can often be used to increase the number of items sold, for instance, by appropriately arranging the products on the shelves of a supermarket or on the pages of a mail-order catalog (they may, for example, be placed adjacent to each other in order to invite even more customers to buy them together) or by directly suggesting items to a customer, which may be of interest for him/her.

An association rule is a rule like "If a customer buys wine and bread, he/she often buys cheese, too." It expresses an association between (sets of) items, which may be products of a supermarket or a mail-order company, special equipment options of a car, optional services offered by telecommunication companies etc. An association rule states that if we pick a customer at random and find out that he/she selected certain items (bought certain products, chose certain options etc.), we can be confident, quantified by a percentage, that he/she also selected certain other items (bought certain other products, chose certain other options etc.).

Of course, we do not want just any association rules, we want "good" rules, rules that are "expressive" and "reliable". The standard measures to assess association rules are the support and the confidence of a rule, both of which are computed from the support of certain item sets. These notions are discussed in this section in more detail. However, these standard criteria are often not sufficient to restrict the set of rules to the interesting ones. Therefore some additional rule evaluation measures are considered this section.

The main problem of association rule induction is that there are so many possible rules. For example, for the product range of a supermarket, which may consist of several thousand different products, there are billions of possible association rules. It is obvious that such a vast amount of rules cannot be processed by inspecting each one in turn. Therefore efficient algorithms are needed that restrict the search space and check only a subset of all rules, but, if possible, without missing important rules. One such algorithm is the Apriori algorithm, which was developed by [Agrawal and Srikant 1994] and which is implemented in a specific way in my Apriori program.

For an overview of frequent item set mining in general and several specific algorithms (including Apriori), see the survey [Borgelt 2012]. This page describes the Apriori implementation that I have been developing and improving since 1996. It uses a prefix tree to organize the support counters and a doubly recursive procedure to process the transaction to count the support of candidate item sets. Some implementation details can be found in [Borgelt and Kruse 2002], [Borgelt 2003], and [Borgelt 2004].

By the way: Earlier versions of my Apriori program are incorporated in the well-known data mining tool Clementine (Apriori version 1.8 in Clementine version 5.0, Apriori version 2.7 in Clementine version 7.0), available from SPSS. Newer versions of Clementine still use my program, but I am not completely sure about the version number of the underlying Apriori program.

This program (possibly in an earlier version) is also accessible through the arules package of the statistical software package R. Furthermore it can be used through the Python interface provided by the PyFIM library.

A graphical user interface for this program (ARuleGUI), written in Java, is available here.

supp(x → y) ≥ supp(x → y,z),		conf(x → y) ≥ conf(x → y,z),
supp(x → y) ≥ supp(x → y,z),		conf(x → y) ≥ conf(x → y,z).

option	meaning	default
`-t#`	target type `s`: frequent item sets `c`: closed (frequent) item sets `m`: maximal (frequent) item sets `g`: (frequent) generators `r`: association rules	`s`
`-m#`	minimum number of items per item set/association rule	1
`-n#`	maximum number of items per item set/association rule	no limit
`-s#`	minimum support of an item set positive: percentage of transactions negative: absolute number of transactions	10
`-S#`	maximum support of an item set positive: percentage of transactions negative: absolute number of transactions	100
`-o`	use original definition of the support of a rule (body & head)
`-c#`	minimum confidence of a rule as a percentage	80
`-e#`	additional evaluation measure frequent item sets: `x`: no measure `b`: binary logarithm of support quotient (+) association rules: `x`: no measure `o`: rule support (original def.: body & head) (+) `c`: rule confidence (+) `d`: absolute confidence difference to prior (+) `l`: lift value (confidence divided by prior) (+) `a`: absolute difference of lift value to 1 (+) `q`: difference of lift quotient to 1 (+) `v`: conviction (inverse lift for negated head) (+) `e`: absolute difference of conviction to 1 (+) `r`: difference of conviction quotient to 1 (+) `b`: imbalance ratio (+) `k`: Kulczynski measure (+) `z`: difference of Kulczynski measure to ½ (+) `u`: conditional probability ratio (+) `j`: importance (binary log. of cond. prob. ratio) (+) `z`: certainty factor (relative confidence change) (+) `n`: normalized χ² measure (+) `p`: p-value from (unnormalized) χ² measure (-) `y`: normalized χ² measure with Yates' correction (+) `t`: p-value from Yates-corrected χ² measure (-) `i`: information difference to prior (+) `g`: p-value from G statistic/information difference (-) `f`: Fisher's exact test (table probability) (-) `h`: Fisher's exact test (χ² measure) (-) `m`: Fisher's exact test (information gain) (-) `s`: Fisher's exact test (support) (-) All measures for association rules are also applicable to item sets and are then aggregated over all possible association rules with a single item in the consequent. The aggregation mode can be set with the option `-a#`. Measures marked with (+) must meet or exceed the threshold, measures marked with (-) must not exceed the threshold in order for the rule or item set to be reported.	x
`-a#`	aggregation mode for evaluation measure `x`: no aggregation (use first value) `m`: minimum of individual measure values `n`: maximum of individual measure values `a`: average of individual measure values	x
`-d#`	threshold for additional evaluation measure (as a percentage)	10
`-i`	invalidate evaluation below expected support	evaluate all
`-p#`	(minimum size for) pruning with evaluation < 0: weak forward > 0: strong forward = 0: backward pruning	no pruning
`-q#`	sort items w.r.t. their frequency 0: do not sort 1: ascending, -1: descending w.r.t. item frequency 2: ascending, -2: descending w.r.t. transaction size sum	2
`-u#`	filter unused items from transactions = 0: do not filter items w.r.t. usage in sets < 0: fraction of removed items for filtering > 0: take execution time ratio into account	0.01
`-x`	do not prune with perfect extensions	prune
`-y`	a-posteriori pruning of infrequent item sets
`-T`	do not organize transactions as a prefix tree
`-F#:#..`	support border for filtering item sets (list of minimum support values, one per item set size, starting at the minimum size, as given with option `-m#`)	none
`-R#`	read item selection/appearances from a file parameter: file name
`-P#`	write a pattern spectrum to a file parameter: file name (only for frequent item sets, not for association rules)
`-Z`	print item set statistics (number of item sets per size)
`-N`	do not pre-format some integer numbers
`-g`	write output in scanable form (quote certain characters)
`-h#`	record header for output	""
`-k#`	item separator for output	" "
`-I#`	implication sign for association rules	" "
`-v#`	output format for item set information (changed to " `(%a)`" if parameter of `-s` is negative)	" `(%S)`" " `(%a)`"
`-j#`	sort item sets in output by their size (default: no sorting)
`-w`	transaction weight in last field	only items
`-r#`	record/transaction separators	"`\n`"
`-f#`	field/item separators	" `\t,`"
`-b#`	blank characters	" `\t\r`"
`-C#`	comment characters	"`#`"
`-!`	print additional option information

%%		a percent sign
%i		number of items (item set size)
%a		absolute item set support
%s		relative item set support as a fraction
%S		relative item set support as a percentage
%e		additional evaluation measure
%E		additional evaluation measure as a percentage

`-ax`		no aggregation (use first value)
`-am`		minimum of individual measure values
`-an`		maximum of individual measure values
`-aa`		average of individual measure values
`-as`		split into equal size subsets

Find Frequent Item Sets and Association Rules with the Apriori Algorithm

Contents

Program Invocation

Program Options

Format of the Item Appearances File

Output Format

Format of Association Rules

Conviction (Inverse Lift for Negated Head, option -ev)

Absolute Difference of Conviction to 1 (option -ee)

Difference of Conviction Quotient to 1 (option -er)

Imbalance Ratio (option -eb)

Kulczynski Measure (option -ek)

Difference of Kulczynski Measure to ½ (option -ew)

Conditional Probability Ratio (option -eu)

Importance (option -ej)

Certainty Factor (option -ez)

Normalized χ2-Measure (option -en)

p-Value Computed from χ2 Measure (option -ep)

Normalized χ2-Measure with Yates' Correction (option -ey)

p-Value Computed from Yates-corrected χ2-Measure (option -et)

p-Value Computed from G-Statistic (option -eg)

Fisher's Exact Test; Table Probability (option -ef)

Fisher's Exact Test; χ2-Measure (option -eh)

Fisher's Exact Test; Information Gain (option -em)

Fisher's Exact Test; Support (option -es)

Original Rule Support (body & head; option -eo)

Rule Confidence (option -ec)

Selection Behavior of Some Measures

Extended Item Set Selection

Binary Logarithm of Support Quotient

Additional Rule Evaluation Measures

Pruning with Additional Measures

Difference of Support Quotient to 1

Transaction Prefix Tree

Compilation Options

License

Download

Contact

Conviction (Inverse Lift for Negated Head, option `-ev`)

Absolute Difference of Conviction to 1 (option `-ee`)

Difference of Conviction Quotient to 1 (option `-er`)

Imbalance Ratio (option `-eb`)

Kulczynski Measure (option `-ek`)

Difference of Kulczynski Measure to ½ (option `-ew`)

Conditional Probability Ratio (option `-eu`)

Importance (option `-ej`)

Certainty Factor (option `-ez`)

Normalized χ²-Measure (option `-en`)

p-Value Computed from χ² Measure (option `-ep`)

Normalized χ²-Measure with Yates' Correction (option `-ey`)

p-Value Computed from Yates-corrected χ²-Measure (option `-et`)

p-Value Computed from G-Statistic (option `-eg`)

Fisher's Exact Test; Table Probability (option `-ef`)

Fisher's Exact Test; χ²-Measure (option `-eh`)

Fisher's Exact Test; Information Gain (option `-em`)

Fisher's Exact Test; Support (option `-es`)

Original Rule Support (body & head; option `-eo`)

Rule Confidence (option `-ec`)