We at Wisdom have been great fans of HyperLogLog for quite some time. More recently however we came to finally get around to working with HyperLogLog from within PostgreSQL as an extension from Citus. It's been a great enhancement at Wisdom, so I thought I'd share a few quirks that could have been better documented to help your team get going with HLL in minutes.
For those of you who are unfamiliar with HyperLogLog, it's a statistical data structure that derives approximations- in basically O(1) time complexity and space complexity. Accuracy can be improved by taking up more space. As an example, 1.3KB can estimate the cardinality of tens of billions of unique values with an accuracy of a few percent.
Even better, you can union or difference various HLL values. This is worth thinking about for a moment. You can take daily snapshots of
daily_active_user_count, and quickly figure out
monthly_active_user_count, safely handling distinct and non-distinct daily users. You can also go the other way, with set intersection, to calculate how many
monthly_new_users you received between two monthly hll values.
Here lies the first tricky gotcha. HLL set intersections are less exciting when you consider in practice that it's usually by convention fairly performant to count new user signups, and the accuracy of HLL becomes unbearable when working with very small results (intersections with high overlap, or differences with low overlap). Let's take the monthly new users example again. Taking the set intersection of two
monthly_active_user_count would likely have high overlap, (let's say 6% of users are new). A 2% error rate on each of the two sets isn't okay when the difference is 6% (representing new users). Reporting new users to be between 4% and 8% isn't good enough. So you should typically stay from intersections— stick with with unions.
Here lies the first tricky gotcha. HLL set intersections are less exciting when you consider in practice that it's usually fairly performant to count new user signups, and the accuracy becomes unbearable when working with very high overlap (small resultant cardinality). Let's take the monthly new users example again. Taking the set intersection of two HLL instances of
monthly_active_user`_count would likely have high overlap between active users. You'd expect roughly 1-5% of active users to be new users. On the high end of our estimate, 5% of users would be returned with a +/- 2% accuracy of the entire set, which is a really wide error band. The predicted precision is terrible, with the upper bound being about 2.3x the lower bound.
Use Case Code Explained
Let's jump into some basic SQL code.
-- Some hosted providers preinstall HLL, -- like AWS's RDS and GCP's Cloud SQL SHOW rds.extensions; -- enable the extension CREATE EXTENSION IF NOT EXISTS hll; -- Create the destination table CREATE TABLE daily_uniques ( date date UNIQUE, users hll ); -- Fill it with the aggregated unique statistics INSERT INTO daily_uniques(date, users) SELECT date, hll_add_agg(hll_hash_integer(user_id)) FROM facts GROUP BY 1; -- Uniques for the day (boring) SELECT date, hll_cardinality(users) FROM daily_uniques; -- Okay, how about some real examples -- like the hard to otherwise compute variety. -- What if you wanted to this week's uniques? SELECT hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHERE date >= '2012-01-02'::date AND date <= '2012-01-08'::date; -- Or the monthly uniques for this year? SELECT EXTRACT(MONTH FROM date) AS month, hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHERE date >= '2012-01-01' AND date < '2013-01-01' GROUP BY 1; -- Or how about a sliding window of uniques over the past 6 days? SELECT date, #hll_union_agg(users) OVER seven_days FROM daily_uniques WINDOW seven_days AS (ORDER BY date ASC ROWS 6 PRECEDING); -- Or the number of uniques you saw yesterday that you didn't see today? SELECT date, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques FROM daily_uniques WINDOW two_days AS (ORDER BY date ASC ROWS 1 PRECEDING);
#hll_union_agg("uniqueHllIdentityIds") AS "uniqueHllIdentityIds"
COALESCE(#hll_union_agg("uniqueHllIdentityIds"), 0)::BIGINT AS "users",
t1."uniqueHllIdentityIds"::BIGINT AS "count"
Gotchia #0: Defining Schema
When defining an HLL column, you need to set the
registerWidth, configuring the max cardinality and accuracy respectively.
Gotchia #1: Insert/Update
Remember to fallback to
hll_empty(x, y) when the set is empty. Be sure to provide matching
-- Remember to coalesce if empty set. COALESCE( hll_add_agg( hll_hash_text(user_id), 13, 6 ), hll_empty(13,6) ) AS metric_count
Gotchia #2: Reading
Standard examples show basic cardinality checks.
SELECT date, hll_cardinality(users) FROM daily_uniques;
But the more common, and much more important use case is counting the union of items across two (text) sets, using
Nobody wants nulls and integers, so coalesce to zero. Likely you'll want a BigInt as if you were using small integers, you'd likely not be using HLL.
-- -- COALESCE( hll_add_agg( hll_hash_text(user_id), 13, 6 ), hll_empty(13,6) ) AS metric_count