We at Wisdom have been great fans of HyperLogLog for quite some time. More recently however we came to finally get around to working with HyperLogLog from within PostgreSQL as an extension from Citus. It's been a great enhancement at Wisdom, so I thought I'd share a few quirks that could have been better documented to help your team get going with HLL in minutes.

For those of you who are unfamiliar with HyperLogLog, it's a statistical data structure that derives approximations- in basically O(1) time complexity and space complexity. Accuracy can be improved by taking up more space. As an example, 1.3KB can estimate the cardinality of tens of billions of unique values with an accuracy of a few percent.

Even better, you can union or difference various HLL values. This is worth thinking about for a moment. You can take daily snapshots of daily_active_user_count, and quickly figure out monthly_active_user_count, safely handling distinct and non-distinct daily users. You can also go the other way, with set intersection, to calculate how many monthly_new_users you received between two monthly hll values.

Here lies the first tricky gotcha. HLL set intersections are less exciting when you consider in practice that it's usually by convention fairly performant to count new user signups, and the accuracy of HLL becomes unbearable when working with very small results (intersections with high overlap, or differences with low overlap).  Let's take the monthly new users example again. Taking the set intersection of two monthly_active_user_count would likely have high overlap, (let's say 6% of users are new). A 2% error rate on each of the two sets isn't okay when the difference is 6% (representing new users). Reporting new users to be between 4% and 8% isn't good enough. So you should typically stay from intersections— stick with with unions.

Here lies the first tricky gotcha. HLL set intersections are less exciting when you consider in practice that it's usually fairly performant to count new user signups, and the accuracy becomes unbearable when working with very high overlap (small resultant cardinality).  Let's take the monthly new users example again. Taking the set intersection of two HLL instances of monthly_active_user`_count would likely have high overlap between active users. You'd expect roughly 1-5% of active users to be new users. On the high end of our estimate, 5% of users would be returned with a +/- 2% accuracy of the entire set, which is a really wide error band. The predicted precision is terrible, with the upper bound being about 2.3x the lower bound.

Use Case Code Explained

Let's jump into some basic SQL code.

-- Some hosted providers preinstall HLL,
-- like AWS's RDS and GCP's Cloud SQL
SHOW rds.extensions;
 

-- enable the extension
CREATE EXTENSION IF NOT EXISTS hll;


-- Create the destination table
CREATE TABLE daily_uniques (
    date            date UNIQUE,
    users           hll
);


-- Fill it with the aggregated unique statistics
INSERT INTO daily_uniques(date, users)
    SELECT
    	date,
        hll_add_agg(hll_hash_integer(user_id))
FROM facts
GROUP BY 1;   


-- Uniques for the day (boring)
SELECT
	date,
    hll_cardinality(users)
FROM daily_uniques;


-- Okay, how about some real examples
-- like the hard to otherwise compute variety.


-- What if you wanted to this week's uniques?
SELECT
	hll_cardinality(hll_union_agg(users))
FROM daily_uniques
WHERE
	date >= '2012-01-02'::date AND
    date <= '2012-01-08'::date;


-- Or the monthly uniques for this year?
SELECT
	EXTRACT(MONTH FROM date) AS month,
    hll_cardinality(hll_union_agg(users))
FROM daily_uniques
WHERE date >= '2012-01-01' AND
      date <  '2013-01-01'
GROUP BY 1;



-- Or how about a sliding window of uniques over the past 6 days?
SELECT
	date,
    #hll_union_agg(users) OVER seven_days
FROM daily_uniques
WINDOW seven_days
AS (ORDER BY date ASC ROWS 6 PRECEDING);



-- Or the number of uniques you saw yesterday that you didn't see today?
SELECT
	date,
    (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques
FROM daily_uniques
WINDOW two_days
AS (ORDER BY date ASC ROWS 1 PRECEDING);

#hll_union_agg("uniqueHllIdentityIds") AS "uniqueHllIdentityIds"

COALESCE(#hll_union_agg("uniqueHllIdentityIds"), 0)::BIGINT AS "users",

t1."uniqueHllIdentityIds"::BIGINT AS "count"

Gotchia #0: Defining Schema

When defining an HLL column, you need to set the registerCount and registerWidth, configuring the max cardinality and accuracy respectively.

Gotchia #1: Insert/Update

Remember to fallback to  hll_empty(x, y) when the set is empty. Be sure to provide matching registerCount and registerWidth for hll_empty().

-- Remember to coalesce if empty set.
COALESCE(
	hll_add_agg(
		hll_hash_text(user_id),
        13,
        6
	),
    hll_empty(13,6)
) AS metric_count

Gotchia #2: Reading

Standard examples show basic cardinality checks.

SELECT date, hll_cardinality(users) FROM daily_uniques;

But the more common, and much more important use case is counting the union of items across two (text) sets, using hll_hash_text().

Nobody wants nulls and integers, so coalesce to zero. Likely you'll want a BigInt as if you were using small integers, you'd likely not be using HLL.

-- 
-- 
COALESCE(
	hll_add_agg(
		hll_hash_text(user_id),
        13,
        6
	),
    hll_empty(13,6)
) AS metric_count