Hyper-log-log

https://www.flickr.com/photos/kerrirae/434652225/in/photolist-EpGXT-9o69Uk-5umMz4-34L93-Pk9NE-5bvu86-anNjqu-brXQ2p-e3Lv2c-4qyzok-CiB64-5PsGCi-DahKpB-p348wJ-5jzccs-84FxC2-CTdZ-b3dZDa-D7YF1w-5ay2zJ-sFErLo-qkBurC-psUhJD-76dCPK-865CM6-5gebhk-5FhfRN-aaG4UF-aaG3oX-dugYPR-3uG561-4E6K73-d4puQ7-39KQ1-5juUfB-if9Bo-aTjHve-4MKMbE-4DRXap-2Wvy6-DFwzH-53nYsJ-6gqs5s-FyKjD-4a2BJT-5AXTmB-3edavp-5rioio-7Zjs68-vXKGw3

Suppose you want to figure out how many unique visitors your website registered in its log file – we will just assume that each unique IP address is one person.

You can use some grep command and produce a list of entries. Then the entries can be processed like this (using ruby):

1
2
3
4
5
6
7
8
9
collection = Hash.new
distinct = 0

File.readlines('server_log').each do |line|
  distinct += 1 unless collection[line]
  collection[line] = 1
end

puts distinct

This isn’t even especially slow, assuming the hash table access is fast and it generally is. The algorithm accesses the hash table O(1) times and each access to the hash table is O(log(n)). This means that the entire algorithm runs in O(n*log(n)). Not too bad, especially for something that clean.

So you put your little utility on gitlab.com, and couple of weeks pass and then somebody from Microsoft logs an issue – when they run your utility, it crashes because it ran out of memory!

So you ask them how big the log files were: about 50GB. Yikes, thats a little more than can fit into memory on most machines (especially since we will end up storing the data a couple times). In theory we could read only part of input and process that, which might help a bit, but we pretty much have to have the hash table in memory because we never know which element we need to check up on.

We could solve that problem a bit unintuitively by sorting the input stream. That way we know that the next element is either the current element, in which case we don’t increment the distinct counter, or we know that it is a different element and we increment the counter. In either case we only need to store the current element, the next element and the counter – O(1) memory access, O(n) to walk through the elements, O(n*log(n)) to sort them.

We can sort sets that are too large to fit in memory – merge-sort can easily adapt to it – but that means we have to repeatedly read and write to the disk, which is always going to be very slow, even with a fast SSD; that O(n*log(n)) hides O(log(n)) disk writes.

So lets go down into the big bag of algorithms

Flajolet-Martin Algorithm

I will let Wikipedia introduce this one:

The Flajolet–Martin algorithm is an algorithm for approximating the number of distinct elements in a stream with a single pass and space-consumption which is logarithmic in the maximum number of possible distinct elements in the stream.
Wikipedia

There were two important things to notice:

  1. The algorithm is logarithmic in both maximum number of possible distinct elements and in the use of space (read memory).
  2. The algorithm is approximating.

The result we are going to get is going to be a little of – that is okay though, as it is far better than not getting a result at all (and, at least according to wikipedia the error is not going to be more than 2-3%, which is likely to be less than the number of people who share IP-addresses, or who accessed our service from more than address or device).

As a trade-of to this, the algoritm runs fast – the logarithm of a million is less than 13.82, so there is a huge difference between an algorithm that runs in linear time and one that runs in logarithmic time.

I won’t step through the algorithm, you can see the Wikipedia link above for that, but the basic idea is this: instead of recording if we have seen each element, take a uniform hash of that element, then record the number of least-significant bits of the hash that are zero.

The reason this works is statistically n/2 elements are going to have a 0 in the least significant bit, n/4 are going to have two least significant bits that are zero, etc. So if see an element whoes hash has 7 significant zeroes, we have probably computed 2^7 ish hashes – and since only unique elements are going to have a new hash, that is the same as saying we have computed 2^7 ish unique hashes.