Apriori Algorithm

Introduction

The Apriori algorithm is a classic method in data mining that identifies frequent itemsets within a dataset and extracts relevant association rules.

It is commonly used to analyze shopping data, especially in market basket analysis and recommendation systems. For example, if many customers buy bread and butter together, Apriori helps discover that pattern.

Apriori algorithm takes advantage of the downward closure property which states that any subset of a frequent itemset must be frequent. Think about it this way. If we set any itemset that appears over 5 times is frequent and we find out that the pair [Bread, Butter] is frequent, we know that any itemset including this pair will also be considered frequent automatically.

Apriori algorithm starts from 1-frequent itemset and grows this itemset iteratively until it reaches the max length. It greatly reduces the time complexity since non-frequent itemsets are pruned in the earlier stages.

https://businessanalytics.substack.com/p/apriori-algorithm

In this assignment, we were given a collection of tweets from 2014 related to flu shots. We had to implement the apriori algorithm and extract some insights from the data.

Code Explanation

def F1_list(transactions, min_support):
    """
    Finds frequent itemsets in a given transaction dataset based on the minimum support threshold.

    :param transactions: (list of sets) A list where each element is a set representing a transaction
                         (e.g., [{item1, item2}, {item1, item3, item4}, ...]).
    :param min_support: (int) The minimum support threshold for an itemset to be considered frequent
                        (e.g., 500).
    :return: (list of tuples) A list of tuples, where each tuple consists of a frozenset representing
                              a frequent itemset and an integer indicating its support count
                              (e.g., [(frozenset({item1, item2}), 600), (frozenset({item3}), 550), ...]).
    """

    l1_candidate_itemsets = {}
    l1_frequent_itemsets = []

    # create a dictionary with key:item & value:support.
    for transaction in tqdm(transactions, desc="Processing itemsets of size 1"):
        for item in transaction:
            if item in l1_candidate_itemsets:
                l1_candidate_itemsets[item] += 1
            else:
                l1_candidate_itemsets[item] = 1

    # prune items that have support less than the minimum support. Added tqdm for progress checking
    for item in l1_candidate_itemsets:
        itemset_support = l1_candidate_itemsets[item]
        if itemset_support >= min_support:
            l1_frequent_itemsets.append((frozenset([item]), itemset_support))

    # print message for additional information
    print('  Frequent itemsets of size 1:', len(l1_frequent_itemsets))

    return l1_frequent_itemsets

def Fk(transactions, min_support, Fk_1, F1):
    """
    Generates candidate k-itemsets and their support counts using the Fk-1 × F1 method.

    :param transactions: (list of sets) A list where each element is a set representing a transaction
                         (e.g., [{item1, item2}, {item1, item3, item4}, ...]).
    :param min_support: (int) The minimum support threshold for an itemset to be considered frequent
                        (e.g., 500).
    :param Fk_1: (list of tuples) A list of frequent (k-1)-itemsets, where each tuple consists of a
                 frozenset representing the itemset and an integer indicating its support count
                 (e.g., [(frozenset({item1, item2}), 600), (frozenset({item3}), 550), ...]).
    :param F1: (list of tuples) A list of frequent 1-itemsets, where each tuple consists of a frozenset
               representing the itemset and an integer indicating its support count
               (e.g., [(frozenset({item1}), 800), (frozenset({item2}), 750), ...]).
    :return: (list of tuples) A list of candidate k-itemsets, where each tuple consists of a frozenset
                              representing the itemset and an integer indicating its support count
                              (e.g., [(frozenset({item1, item2, item3}), 520), (frozenset({item2, item4}), 530), ...]).
    """
    candidates = set()

    # create candidate sets to use as keys
    for itemset, _ in Fk_1:
        for item, _ in F1:
            new_candidate = frozenset(itemset | item)  # Union of (k-1)-itemset with 1-itemset
            if len(new_candidate) == len(itemset) + 1:  # Ensure new candidate it length of k
                candidates.add(new_candidate)

    itemset_count_dict = {}

    # Initialize tqdm progress bar for progress checking
    pbar = tqdm(candidates, desc="Processing itemsets")

    # create a dictionary with key:itemset & value:support
    for itemset in pbar:
        # Update the description with itemset size
        pbar.set_description(f"Processing itemsets of size {len(itemset)}")
        for transaction in transactions:
            if itemset.issubset(transaction):
                if itemset not in itemset_count_dict:
                    itemset_count_dict[itemset] = 1
                else:
                    itemset_count_dict[itemset] += 1

    lk_frequent_itemsets = []

    # prune itemsets that have support less than the minimum support
    for itemset in itemset_count_dict:
        itemset_support = itemset_count_dict[itemset]
        if itemset_support >= min_support:
            lk_frequent_itemsets.append((itemset, itemset_support))

    # print message for additional information
    if lk_frequent_itemsets:
        first_itemset = next(iter(lk_frequent_itemsets))[0]
        print(f'  Frequent itemsets of size {len(first_itemset)}: {len(lk_frequent_itemsets)}')

    return lk_frequent_itemsets

def apriori(transactions, min_support):
    """
    Apriori Algorithm for finding frequent itemsets of all sizes based on a minimum support threshold.

    :param transactions: (list of sets) A list where each element is a set representing a transaction
                         (e.g., [{item1, item2}, {item1, item3, item4}, ...]).
    :param min_support: (int) The minimum support threshold for an itemset to be considered frequent
                        (e.g., 500).
    :return: (list of tuples) A list of tuples, where each tuple consists of a frozenset representing
                              a frequent itemset and an integer indicating its support count
                              (e.g., [(frozenset({item1, item2}), 600), (frozenset({item3, item4}), 550), ...]).
    """

    all_frequent_itemsets = []

    # Step 1: Get frequent 1-itemsets
    f1 = F1_list(transactions, min_support)
    all_frequent_itemsets += f1

    # Step 2: Iteratively generate frequent itemsets of size k
    fk_1 = f1
    while True:
        fk = Fk(transactions, min_support, fk_1, f1)  # Here f1 is used for candidate generation
        if not fk:
            break  # If no frequent itemsets are found, stop the loop

        all_frequent_itemsets += fk
        fk_1 = fk

    return sorted(all_frequent_itemsets, key=lambda x: x[1], reverse=True)

The Apriori algorithm operates through a systematic approach.

Initially, the F1_list function generates frequent single-item sets by counting each item's occurrence across all transactions and selecting those meeting the minimum support threshold.

Subsequently, the Fk function iteratively expands these frequent itemsets, combining smaller frequent itemsets (Fk-1) with single-item sets (F1) to form larger candidate itemsets. Each candidate's support is calculated, and itemsets not meeting the support criterion are discarded. This iterative process continues until no larger frequent itemsets can be generated.