I'm not really convinced by this article. His complaint is that bloom filters are too slow for his purpose, and the cause is the large memory requirements and random memory accesses causing constant cache misses. So, he has 40 million lines and 600 MB of data and it takes 12 seconds for his bloom filter to work. I'd have to dig into it a lot more but my instincts are telling me this is user error because these numbers don't really make sense even on older hardware.
Random memory access is really slow (compared to CPU speed). Therefore it's strictly faster to do exactly one random memory fetch (hashtable) than many (bloom filter).
> For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter.
You sure about that?
If the entire network is healthy. Yes, and maybe by two(?) transatlantic cables is not the efficient way to serve a page. NYC is probably closer to Italy by network topology than is Brazil.
But if CF had a data center in Italy and one in Germany, there are very valid reasons for Italian traffic to end up in Germany. Like a grid failure, flooding. Or fire. So how far you can take this and still meet SLAs is a little tricky.
The important word is "should" : italian people should not arrive in a brazilian datacenter
Indeed, sometimes they still get there. But in itself, greeting italian in brazil is enough to conclude something is going on. One should not conclude, however, that said traffic should be deleted.
The Bloom filter he uses has k=8, that is 8 cache line misses per entry. (k=8 is in the command line output). A "blocked Bloom fiter" requires only one cache line miss per entry, with a slightly worse false positive rate.
// This is a block Bloom filter (from Putze et al.'s "Cache-, Hash- and Space-Efficient Bloom Filters") with some twists:
// 1. Each block is a split Bloom filter - see Section 2.1 of Broder and Mitzenmacher's
// "Network Applications of Bloom Filters: A Survey".
// 2. The number of bits set per Add() is contant in order to take advantage of SIMD instructions.
That's the L1. Blocked -- segmented would have been a better name imo -- approach still depends on the blocks to be resident in L2/LLC. OP's problem is that most of the blocks' pages (not cache-line) are not resident in LLC.
I wonder about an op-buffer approach. ~: buffer the BF ops to aggregate ops on a specific address ranges. Say first buffer n ops for the filter, sort by the (relevant) hash bits, and then apply m ops for a specific page, and so on.
And optimizing that, it may be possible to only process the op-buffer partially, i.e. add ops until the buffer is full, gather the ops that map to a page-segment (4k), and if the rest of the buffer is still all over the place continue buffering, and rinse and repeat.
[p.s.]
OP's task at hand is an off-line matter whereas general BF approaches assume an on-line use case. By adopting an off-line approach, I should think there remain many opportunities for optimizing the runtime.
No, it's not the L1. It's the L2/L3/LLC. With a regular Bloom filter (not blocked one), each of the k memory accesses results in a L2/L3/LLC cache miss, if the Bloom filter doesn't fit in the cache. And here, it doesn't. With a blocked Bloom filter, there is still a cache miss, but only one, and not k. So that's 8 times fewer cache misses.
But yes, buffering the entries and then sorting them does make sense. That way, each cache line is only accessed once, and prefetch works well. We have used radix sorting for the binary fuse filter https://arxiv.org/abs/2201.01174 . For the blocked Bloom filter I'm not sure if it will help however, due to the added complexity. In my tests, it didn't help. There is the "prefetch" operation but we had mixed results with that (buffering and sorting was faster).
I'm not really convinced by this article. His complaint is that bloom filters are too slow for his purpose, and the cause is the large memory requirements and random memory accesses causing constant cache misses. So, he has 40 million lines and 600 MB of data and it takes 12 seconds for his bloom filter to work. I'd have to dig into it a lot more but my instincts are telling me this is user error because these numbers don't really make sense even on older hardware.
> "64 GiB of memory" & "1-error-per-10k" entries.
That's the size of the BF that gives that error rate (per his calcs - didn't check). So that definitely busts the cache.
I didn't say it would fit in the cache.
Correct. You mentioned your instincts.
Random memory access is really slow (compared to CPU speed). Therefore it's strictly faster to do exactly one random memory fetch (hashtable) than many (bloom filter).
Did you try to repeat the experiment?
I wonder how the final version with the single hash table stacks up against:
> For example, source IPs belonging to a legitimate Italian ISP should not arrive in a Brazilian datacenter.
You sure about that?
If the entire network is healthy. Yes, and maybe by two(?) transatlantic cables is not the efficient way to serve a page. NYC is probably closer to Italy by network topology than is Brazil.
But if CF had a data center in Italy and one in Germany, there are very valid reasons for Italian traffic to end up in Germany. Like a grid failure, flooding. Or fire. So how far you can take this and still meet SLAs is a little tricky.
Of course it's not that simple. This "IP address pop catchment" thing is an area of active research. Even we do weird things, like: https://blog.cloudflare.com/cloudflare-servers-dont-own-ips-...
Yes, it is sure
The important word is "should" : italian people should not arrive in a brazilian datacenter
Indeed, sometimes they still get there. But in itself, greeting italian in brazil is enough to conclude something is going on. One should not conclude, however, that said traffic should be deleted.
aside: reading that made me wonder just which of Google vs Cloudflare knows everything about everyone's web accesses.
The Bloom filter he uses has k=8, that is 8 cache line misses per entry. (k=8 is in the command line output). A "blocked Bloom fiter" requires only one cache line miss per entry, with a slightly worse false positive rate.
With a good implementation it would be roughly 10 times faster, eg a SIMD implementation. For example here https://github.com/FastFilter/fastfilter_cpp/blob/master/src... (there are others)
Thanks for the pointer. The file comment says:
Mentioned papers:https://www.cs.amherst.edu/~ccmcgeoch/cs34/papers/cacheeffic...
https://www.eecs.harvard.edu/~michaelm/CS222/bloomsurvey.pdf
That's the L1. Blocked -- segmented would have been a better name imo -- approach still depends on the blocks to be resident in L2/LLC. OP's problem is that most of the blocks' pages (not cache-line) are not resident in LLC.
I wonder about an op-buffer approach. ~: buffer the BF ops to aggregate ops on a specific address ranges. Say first buffer n ops for the filter, sort by the (relevant) hash bits, and then apply m ops for a specific page, and so on.
And optimizing that, it may be possible to only process the op-buffer partially, i.e. add ops until the buffer is full, gather the ops that map to a page-segment (4k), and if the rest of the buffer is still all over the place continue buffering, and rinse and repeat.
[p.s.]
OP's task at hand is an off-line matter whereas general BF approaches assume an on-line use case. By adopting an off-line approach, I should think there remain many opportunities for optimizing the runtime.
No, it's not the L1. It's the L2/L3/LLC. With a regular Bloom filter (not blocked one), each of the k memory accesses results in a L2/L3/LLC cache miss, if the Bloom filter doesn't fit in the cache. And here, it doesn't. With a blocked Bloom filter, there is still a cache miss, but only one, and not k. So that's 8 times fewer cache misses.
But yes, buffering the entries and then sorting them does make sense. That way, each cache line is only accessed once, and prefetch works well. We have used radix sorting for the binary fuse filter https://arxiv.org/abs/2201.01174 . For the blocked Bloom filter I'm not sure if it will help however, due to the added complexity. In my tests, it didn't help. There is the "prefetch" operation but we had mixed results with that (buffering and sorting was faster).