Postgres HyperLogLog Extension

sql Nov 28, 2019

We at Wisdom have been great fans of HyperLogLog for quite some time. More recently however we came to finally get around to working with HyperLogLog from within PostgreSQL as an extension from Citus. It's been a great enhancement at Wisdom, so I thought I'd share a few quirks that could have been better documented to help your team get going with HLL in minutes.

For those of you who are unfamiliar with HyperLogLog, it's a statistical data structure that derives approximations- in O(1) time complexity and O(log(log(n)) space complexity. Pretty impressive. The catch is that you get about 1.5% accuracy, configurable of course by taking up more space. As an example, 1.3KB can estimate the cardinality of tens of billions of unique values with an accuracy of a few percent.

Beyond just simple cardinality estimation, can union or difference various HLL values. This is worth thinking about for a moment. You can take daily snapshots of daily_active_user_count, and quickly figure out monthly_active_user_count, safely handling distinct daily users. You can also go the other way, with set intersection, to calculate how many monthly_new_users you received between two monthly hll values.

Use Case Code Explained

Let's jump into some basic SQL code.

-- Some hosted providers preinstall HLL,
-- like AWS's RDS and GCP's Cloud SQL
SHOW rds.extensions;
 

-- enable the extension
CREATE EXTENSION IF NOT EXISTS hll;


-- Create the destination table
CREATE TABLE daily_uniques (
    date            date UNIQUE,
    users           hll
);


-- Fill it with the aggregated unique statistics
INSERT INTO
	daily_uniques(date, users)
SELECT
	date,
	hll_add_agg(hll_hash_integer(user_id))
FROM facts
GROUP BY 1;   


-- Uniques for the day (boring)
SELECT
	date,
    hll_cardinality(users)
FROM daily_uniques;


-- Okay, how about some real examples
-- like the hard to otherwise compute variety.


-- What if you wanted to this week's uniques?
SELECT
	hll_cardinality(hll_union_agg(users))
FROM daily_uniques
WHERE
	date >= '2012-01-02'::date AND
    date <= '2012-01-08'::date;


-- Or the monthly uniques for this year?
SELECT
	EXTRACT(MONTH FROM date) AS month,
    hll_cardinality(hll_union_agg(users))
FROM daily_uniques
WHERE
	date >= '2012-01-01' AND
	date <  '2013-01-01'
GROUP BY 1;



-- Or how about a sliding window of uniques over the past 6 days?
SELECT
	date,
    #hll_union_agg(users)
    OVER seven_days
FROM daily_uniques
WINDOW seven_days
AS (ORDER BY date ASC ROWS 6 PRECEDING);



-- Or the number of uniques you saw yesterday that you didn't see today?
SELECT
	date,
    (#hll_union_agg(users) OVER two_days) - #users AS lost_uniques
FROM daily_uniques
WINDOW two_days
AS (ORDER BY date ASC ROWS 1 PRECEDING);

Gotcha #1: Set Intersection Use Case

Here lies the first tricky gotcha. HLL set intersections are less exciting when you consider in practice that it's usually by convention fairly performant to count new user signups, and the accuracy of HLL becomes unbearable when working with very small results, for example intersections with high overlap or differences with low overlap.  Let's take the monthly new users example again. The set intersection of two monthly_active_user_count would likely have high overlap, (let's say 6% of users are new). A 2% error rate on each of the two sets isn't reasonable when the difference representing new users, is 6%. Reporting new users to be between 4% and 8% isn't rather terrible. So you should typically stay from intersections— stick with with unions.

Here lies the first tricky gotcha. HLL set intersections are less exciting when you consider in practice that it's usually fairly performant to count new user signups, and the accuracy becomes unbearable when working with small cardinalities, such as from very high overlap during set intersections.  Let's take the monthly new users example again. The set intersection of two HLL instances of monthly_active_user_count would likely have high overlap between active users. You'd expect roughly 5% of active users to be new users, so let's pretend for a moment that that's the true value. Again, HLLs are tuneable usually to about 2% error on the total set cardinality so we'd get a value between the range of 5, +/-2%. In our example, the resultant cardinality would be reporting new users to be between 3% and 7%, which is rather terrible. So you should typically stay from intersections— stick with with unions.

Gotcha #1: Defining Schema

When defining an HLL column, you need to set the registerCount and registerWidth, configuring the max cardinality and accuracy respectively. You cannot mix different HLL instances containing different register counts and widths.

Gotcha #2: Insert/Update

Remember to fallback to  hll_empty(x, y) when the set is empty. Again, be sure to provide matching registerCount and registerWidth for hll_empty().

When dealing with analytics visualization, few systems work with both nulls and integers so remember to coalesce to zero. You'll also likely want to cast to BigInt as if you were using small integers, you'd probably not be using HLL.

-- Remember to coalesce if empty set.
COALESCE(
	hll_add_agg(
		hll_hash_text(user_id),
        13,
        6
	),
    hll_empty(13,6)
)::BIGINT AS metric_count

We at Wisdom have been using HyperLogLog to query various indexed trends instantly— and we've never been happier. Again, you can find the PostgreSQL extension from Citus on GitHub.


Plug: Wisdom is a front-end monitoring tool that helps front-end developers catch and fix bugs faster by combining session replay, error tracking, and developer tools— all in one amazing package.

Wisdom

We develop the systems behind Wisdom. If you're into this sort of stuff, you should apply to join our team.