We at Wisdom have been great fans of HyperLogLog for quite some time. More recently however we came to finally get around to working with HyperLogLog from within PostgreSQL as an extension from Citus. It's been a great enhancement at Wisdom, so I thought I'd share a few quirks that could have been better documented to help your team get going with HLL in minutes.
For those of you who are unfamiliar with HyperLogLog, it's a statistical data structure that derives approximations- in O(1) time complexity and O(log(log(n)) space complexity. Pretty impressive. The catch is that you get about 1.5% accuracy, configurable of course by taking up more space. As an example, 1.3KB can estimate the cardinality of tens of billions of unique values with an accuracy of a few percent.
Beyond just simple cardinality estimation, can union or difference various HLL values. This is worth thinking about for a moment. You can take daily snapshots of
daily_active_user_count, and quickly figure out
monthly_active_user_count, safely handling distinct daily users. You can also go the other way, with set intersection, to calculate how many
monthly_new_users you received between two monthly hll values.
Use Case Code Explained
Let's jump into some basic SQL code.
-- Some hosted providers preinstall HLL, -- like AWS's RDS and GCP's Cloud SQL SHOW rds.extensions; -- enable the extension CREATE EXTENSION IF NOT EXISTS hll; -- Create the destination table CREATE TABLE daily_uniques ( date date UNIQUE, users hll ); -- Fill it with the aggregated unique statistics INSERT INTO daily_uniques(date, users) SELECT date, hll_add_agg(hll_hash_integer(user_id)) FROM facts GROUP BY 1; -- Uniques for the day (boring) SELECT date, hll_cardinality(users) FROM daily_uniques; -- Okay, how about some real examples -- like the hard to otherwise compute variety. -- What if you wanted to this week's uniques? SELECT hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHERE date >= '2012-01-02'::date AND date <= '2012-01-08'::date; -- Or the monthly uniques for this year? SELECT EXTRACT(MONTH FROM date) AS month, hll_cardinality(hll_union_agg(users)) FROM daily_uniques WHERE date >= '2012-01-01' AND date < '2013-01-01' GROUP BY 1; -- Or how about a sliding window of uniques over the past 6 days? SELECT date, #hll_union_agg(users) OVER seven_days FROM daily_uniques WINDOW seven_days AS (ORDER BY date ASC ROWS 6 PRECEDING); -- Or the number of uniques you saw yesterday that you didn't see today? SELECT date, (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques FROM daily_uniques WINDOW two_days AS (ORDER BY date ASC ROWS 1 PRECEDING);
Gotcha #1: Set Intersection Use Case
Here lies the first tricky gotcha. HLL set intersections are less exciting when you consider in practice that it's usually by convention fairly performant to count new user signups, and the accuracy of HLL becomes unbearable when working with very small results, for example intersections with high overlap or differences with low overlap. Let's take the monthly new users example again. The set intersection of two
monthly_active_user_count would likely have high overlap, (let's say 6% of users are new). A 2% error rate on each of the two sets isn't reasonable when the difference representing new users, is 6%. Reporting new users to be between 4% and 8% isn't rather terrible. So you should typically stay from intersections— stick with with unions.
Here lies the first tricky gotcha. HLL set intersections are less exciting when you consider in practice that it's usually fairly performant to count new user signups, and the accuracy becomes unbearable when working with small cardinalities, such as from very high overlap during set intersections. Let's take the monthly new users example again. The set intersection of two HLL instances of
monthly_active_user_count would likely have high overlap between active users. You'd expect roughly 5% of active users to be new users, so let's pretend for a moment that that's the true value. Again, HLLs are tuneable usually to about 2% error on the total set cardinality so we'd get a value between the range of 5, +/-2%. In our example, the resultant cardinality would be reporting new users to be between 3% and 7%, which is rather terrible. So you should typically stay from intersections— stick with with unions.
Gotcha #1: Defining Schema
When defining an HLL column, you need to set the
registerWidth, configuring the max cardinality and accuracy respectively. You cannot mix different HLL instances containing different register counts and widths.
Gotcha #2: Insert/Update
Remember to fallback to
hll_empty(x, y) when the set is empty. Again, be sure to provide matching
When dealing with analytics visualization, few systems work with both nulls and integers so remember to coalesce to zero. You'll also likely want to cast to BigInt as if you were using small integers, you'd probably not be using HLL.
-- Remember to coalesce if empty set. COALESCE( hll_add_agg( hll_hash_text(user_id), 13, 6 ), hll_empty(13,6) )::BIGINT AS metric_count
Plug: Wisdom is a front-end monitoring tool that helps front-end developers catch and fix bugs faster by combining session replay, error tracking, and developer tools— all in one amazing package.