Presenter of that talk here, very cool to see it being shared here.
Running 1BRC was immensely fun, I learned a ton from it. Had you told me before how far the community would be able to push this, I'd have not believed you.
One main take-away for me was that you could improve performance by one order of magnitude over the baseline basically by just doing a good job and avoiding basic mistakes. The resulting code is still well readable and maintainable. In most scenarios, this is where you should stop.
If you want to improve by another order of magnitude (like leaders in the challenge did), code becomes completely non-idiomatic, super-dense, and hard to maintain. You should go there only where it really, really matters, like when building a database kernel for instance. Or well, when trying to win a coding challenge ;)
* 3h deep-dive into the implementation techniques by Thomas Würthinger and Roy van Rijn, two of the top participants of the challenge: https://www.youtube.com/watch?v=_w4-BqeeC0k
> One main take-away for me was that you could improve performance by one order of magnitude over the baseline basically by just doing a good job and avoiding basic mistakes. The resulting code is still well readable and maintainable. In most scenarios, this is where you should stop.
I’ve done a lot of performance work but never heard this expressed so clearly before. Thanks - I’m stealing that.
I’ve found exactly the same thing in my own work optimising text CRDTs. Just writing crdt code in a straightforward, correct way, without unnecessary allocations (and ideally using some good data structures) will get you very, very far. But there’s another order of magnitude available to anyone interested in using exotic, task specific data structures.
I suspect the same is true in just about every field of computer science. Write good code in go / c# / Java / JavaScript / etc and 99% of the time, performance will be just fine. Well, so long as you don’t do anything silly like pull in immutablejs. But there’s usually 10-100x more performance available if you really try.
If you want some examples, I highly recommend Algorithms for Modern Hardware, available for free online:
This was such a great challenge to follow. I learned a ton about different programming languages, their advantages and disadvantages, their idioms and paradigms.
Overall this was just such a great event! Thanks for organizing it!
Roy van Rijn here, the guy that got an honerable mention for being “first”. Ended up 7th or something.
If you’re interested in even more in-depth tips and tricks that were used in this contest, I highly recommend this three hour (!) long deep dive I did with Thomas Wuerthinger (the winner of 1BRC):
As part of learning zig I did an implementation in zig. It got under 1 second within the contest rules on a 6 year old desktop machine. I need to write up a blog when I get the chance.
Besides the standard tricks to speed up the running time like threads and mmap, the biggest speed up is to use a custom hash table that is tuned for the data. I have implemented 5 different versions of hash tables including the Swiss hash table and more recent researched ones. None of them beats a simple linear probing approach. It’s because there’re so few keys, only couple hundreds of them. Tuning a custom hashing function to minimize key collision helps more than anything.
There are also lots of SIMD operations, which Zig has great support in the form of vector programming.
The only thing novel I did was to line up the temperature data from the back instead of from the front for the SIMD operations. That is for temperatures 12.3, 4.5, 67.8, 0.1, I line them up by the last digit after the decimal before packing them into a SIMD register. That way I know their exact positions in the register.
I had similar results exploring hash tables for a nearly identical task here[0].
It seems like swiss tables are tuned for workloads of mostly gets of keys that are not present. Otherwise the two-level lookup is an extra cost with not much benefit.
IIRC Joad Nacer says he is using a hash table for this (again, essentially identical) task on highload.fun[1]. This was sort if surprising for the other competitors because the top couple solutions below that one use some variety of bloom filter instead.
Yes. Swiss is good for non-duplicate lookups. For highly duplicate data the extra memory fetch for the metadata byte really kills the performance.
For this contest, there’re a billion lookups with only couple hundreds distinct keys. That means for most lookups, the full path of locating the key is executed - hashing, metadata comparison, hashed value comparison, and full key comparison. It’s actually quite expensive. Removing any part from the execution path really helps.
When it comes to custom hash tables, I always remember this video [0] by Strager about implementing the perfect hash table and hash function. Fascinating insights and perf hints.
Gosh, ant thnx for sharing. Hope this would never become a mainstream during coding interview. Imaging - please list top 10 hash table implementations and its time and space complexity
Ha. I didn’t know the detail of any of these tables beforehand. It’s just a matter of doing a bit of research. The main takeaway is there’s no perfect hash table. It all depends on the data and the operations. Contests like this are the perfect excuse to deep dive into these kinds of topic.
Great write-up, thanks a lot! If I may come with a suggestion, please round all the running times to significant digits. Reading 100s of numbers of the form 8.327305 becomes hard in the end, and I'd wager it's actually not more accurate than saying 8.3, or even 8. The only place where you seem to need 2 digits is when comparing the hashes, 0.07 vs 0.01.
Please always round to a meaningful number of significant digits.
I think most of the optimizations were pretty low level and had nothing to do with Java itself. You could implement the same techniques in any other (compiled) language and get similar performance results. Things like cache locality, ILP, work stealing, multithreading, SWAR etc. In fact, the top solutions mostly used unsafe to use off-heap memory to avoid GC and copying of data.
In this particular instance Unsafe was used to skip boundary checks. The main difference between the fastest safe version and the fastest unsafe version was how the file was read. The safe version used MappedByteBuffer to read the file, which under the hood memory maps the file and enforces boundaries. The unsafe version memory mapped the file and then read the memory directly using Unsafe.
The fastest version also used Graal which would've helped with startup time.
Maybe this is covered in the actual video, but one potential target for optimisation that I didn't see in the transcript was how the temperature values are converted to integers.
One suggestion is that, after deleting the period, if the numbers have at most 4 digits and 64-bit multiplication is available, the conversion to an integer can be done with a single such multiplication and a few shifts. Ensure each digit is in the low byte of a separate 16-bit word (that is, "space them out" with a zero byte in between each digit; this may require extra work, though there's often a SIMD "scatter"-type instruction that will accomplish it), subtract the ASCII value of the "0" digit from each 16-bit word's low byte (one instruction), then multiply by a constant which has 1000 in its lowest 16-bit word, 100 in the next-lowest, ..., 1 in the highest. The final answer will now be in the highest word, available with a 48-bit right shift.
If multiplication is slow, another trick to gain some speed is to not actually convert from decimal until you need to, possibly right at the end. That is, just subtract the ASCII value of the "0" digit from each 16-bit word's low byte, and then treat the resulting 64-bit value as a very wasteful BCD representation in which each decimal digit occupies 16 bits. These "integers" can be safely added to each other up to 65535/9=7281 times before there is a risk of overflow; you only need to convert to regular binary then. This representation also honours min and max operations.
I don't think people did exactly that, but most indeed did leverage the fact that values only ranged from -99.9 to 99.9 with exactly one fractional digit and handled them as integers (avoiding FP maths) up until the very end when printing out the results.
Most people seriously underestimate how much power there is in a modern Intel/ARM server.
Many businesses think "we have a lot of products". Nope, even when you have more than a million products we can load all of them into RAM, even with 4 KB of memory allocate to each product.
So a million products is small business from a "load into RAM" perspective. Easily handled on a single server.
i realized this when rewriting the website for a music archive. we've got hundreds of thousands of records - but we could still keep it all in-memory easily on a mid-range smartphone.
Modern hardware has really impressive speed. A single core of my not-so-new M2 laptop can decompress data at more than 4 GB/s using lz4. Sometimes I wonder where all that performance goes, eg when waiting for Slack to open ;)
I'm so tired of every single app requiring a hundred minuscule file system ops and another hundred full round trip web requests to do anything. How do you even write code that badly? Meanwhile I'm over here trying to make sure we won't get screwed when my linear loop has to iterate over 100 million items somehow. Oh, that only takes 50ms?
It's not that much, but it's also not nothing. The input file of the challenge was very narrow, just a relatively short text column and a numeric value. It still is >13 GB uncompressed; an actual table with a few more columns with 1B rows will easily be beyond 100 GB. And chances are it's not the only table in your database, so it won't fit into RAM of most common database machines. I.e. still a non-trivial dataset in most contexts.
Twice as slow as top submissions in C# and C++ which is still pretty good for Go but a likely indicator that memory here has become the major bottleneck.
In theory it could be improved, as Go also has most tools for that, with exception of having to manually dive into SIMD, as discussed in previous threads.
Even though I am not a big fan of Go, I feel that in the spirit of the guild of systems stuff with GC languages it always deserves some cheering nonetheless.
Presenter of that talk here, very cool to see it being shared here.
Running 1BRC was immensely fun, I learned a ton from it. Had you told me before how far the community would be able to push this, I'd have not believed you.
One main take-away for me was that you could improve performance by one order of magnitude over the baseline basically by just doing a good job and avoiding basic mistakes. The resulting code is still well readable and maintainable. In most scenarios, this is where you should stop.
If you want to improve by another order of magnitude (like leaders in the challenge did), code becomes completely non-idiomatic, super-dense, and hard to maintain. You should go there only where it really, really matters, like when building a database kernel for instance. Or well, when trying to win a coding challenge ;)
Some more resources for those interested:
* Blog post with the results: https://www.morling.dev/blog/1brc-results-are-in/
* Show & Tell, featuring implementations in languages other than Java: https://github.com/gunnarmorling/1brc/discussions/categories...
* List of many more blog posts discussing 1BRC in different languages: https://github.com/gunnarmorling/1brc?tab=readme-ov-file#1br...
* 3h deep-dive into the implementation techniques by Thomas Würthinger and Roy van Rijn, two of the top participants of the challenge: https://www.youtube.com/watch?v=_w4-BqeeC0k
> One main take-away for me was that you could improve performance by one order of magnitude over the baseline basically by just doing a good job and avoiding basic mistakes. The resulting code is still well readable and maintainable. In most scenarios, this is where you should stop.
I’ve done a lot of performance work but never heard this expressed so clearly before. Thanks - I’m stealing that.
I’ve found exactly the same thing in my own work optimising text CRDTs. Just writing crdt code in a straightforward, correct way, without unnecessary allocations (and ideally using some good data structures) will get you very, very far. But there’s another order of magnitude available to anyone interested in using exotic, task specific data structures.
I suspect the same is true in just about every field of computer science. Write good code in go / c# / Java / JavaScript / etc and 99% of the time, performance will be just fine. Well, so long as you don’t do anything silly like pull in immutablejs. But there’s usually 10-100x more performance available if you really try.
If you want some examples, I highly recommend Algorithms for Modern Hardware, available for free online:
https://en.algorithmica.org/hpc/
This was such a great challenge to follow. I learned a ton about different programming languages, their advantages and disadvantages, their idioms and paradigms.
Overall this was just such a great event! Thanks for organizing it!
(A silent observer)
Amazing! I've just asked few secs ago to share more posts discussing 1BRC in different languages, and here we go. Thank you!
Roy van Rijn here, the guy that got an honerable mention for being “first”. Ended up 7th or something.
If you’re interested in even more in-depth tips and tricks that were used in this contest, I highly recommend this three hour (!) long deep dive I did with Thomas Wuerthinger (the winner of 1BRC):
https://youtu.be/_w4-BqeeC0k?si=p4NsRFPe6Jtq7HvA
There is so much valuable content in that talk, some things that I didn’t even realize until long after the contest was over.
It’s probably the best and most detailed conference talk I’ve ever done.
1BRC = 1 Billion Row Challenge: "Calculate the min, max, and average of 1 billion measurements"
As part of learning zig I did an implementation in zig. It got under 1 second within the contest rules on a 6 year old desktop machine. I need to write up a blog when I get the chance.
please share! it's such great challenge to map each language internals. here's an idea how to write up https://r2p.dev/b/2024-03-18-1brc-go/
Besides the standard tricks to speed up the running time like threads and mmap, the biggest speed up is to use a custom hash table that is tuned for the data. I have implemented 5 different versions of hash tables including the Swiss hash table and more recent researched ones. None of them beats a simple linear probing approach. It’s because there’re so few keys, only couple hundreds of them. Tuning a custom hashing function to minimize key collision helps more than anything.
There are also lots of SIMD operations, which Zig has great support in the form of vector programming.
The only thing novel I did was to line up the temperature data from the back instead of from the front for the SIMD operations. That is for temperatures 12.3, 4.5, 67.8, 0.1, I line them up by the last digit after the decimal before packing them into a SIMD register. That way I know their exact positions in the register.
I had similar results exploring hash tables for a nearly identical task here[0].
It seems like swiss tables are tuned for workloads of mostly gets of keys that are not present. Otherwise the two-level lookup is an extra cost with not much benefit.
IIRC Joad Nacer says he is using a hash table for this (again, essentially identical) task on highload.fun[1]. This was sort if surprising for the other competitors because the top couple solutions below that one use some variety of bloom filter instead.
0: https://easyperf.net/blog/2022/05/28/Performance-analysis-an...
1: https://highload.fun/tasks/2/leaderboard
Yes. Swiss is good for non-duplicate lookups. For highly duplicate data the extra memory fetch for the metadata byte really kills the performance.
For this contest, there’re a billion lookups with only couple hundreds distinct keys. That means for most lookups, the full path of locating the key is executed - hashing, metadata comparison, hashed value comparison, and full key comparison. It’s actually quite expensive. Removing any part from the execution path really helps.
When it comes to custom hash tables, I always remember this video [0] by Strager about implementing the perfect hash table and hash function. Fascinating insights and perf hints.
[0]: https://youtu.be/DMQ_HcNSOAI
Many videos on hash table are great. I found this one particularly good.
[https://www.youtube.com/watch?v=M2fKMP47slQ] C++Now 2018: You Can Do Better than std::unordered_map: New Improvements to Hash Table Performance
Gosh, ant thnx for sharing. Hope this would never become a mainstream during coding interview. Imaging - please list top 10 hash table implementations and its time and space complexity
Ha. I didn’t know the detail of any of these tables beforehand. It’s just a matter of doing a bit of research. The main takeaway is there’s no perfect hash table. It all depends on the data and the operations. Contests like this are the perfect excuse to deep dive into these kinds of topic.
Great write-up, thanks a lot! If I may come with a suggestion, please round all the running times to significant digits. Reading 100s of numbers of the form 8.327305 becomes hard in the end, and I'd wager it's actually not more accurate than saying 8.3, or even 8. The only place where you seem to need 2 digits is when comparing the hashes, 0.07 vs 0.01.
Please always round to a meaningful number of significant digits.
There should be a book about Java perf / hacks derived from this challenge. I'd buy it fore sure.
I think most of the optimizations were pretty low level and had nothing to do with Java itself. You could implement the same techniques in any other (compiled) language and get similar performance results. Things like cache locality, ILP, work stealing, multithreading, SWAR etc. In fact, the top solutions mostly used unsafe to use off-heap memory to avoid GC and copying of data.
In this particular instance Unsafe was used to skip boundary checks. The main difference between the fastest safe version and the fastest unsafe version was how the file was read. The safe version used MappedByteBuffer to read the file, which under the hood memory maps the file and enforces boundaries. The unsafe version memory mapped the file and then read the memory directly using Unsafe.
The fastest version also used Graal which would've helped with startup time.
Gunnar is brilliant.
If there must be a cookie popup at all, this site is a great example of how to do it.
(Edit: although it kinda then spoils it with the newsletter popup)
Nice write-up! (I read the transcript.)
Maybe this is covered in the actual video, but one potential target for optimisation that I didn't see in the transcript was how the temperature values are converted to integers.
One suggestion is that, after deleting the period, if the numbers have at most 4 digits and 64-bit multiplication is available, the conversion to an integer can be done with a single such multiplication and a few shifts. Ensure each digit is in the low byte of a separate 16-bit word (that is, "space them out" with a zero byte in between each digit; this may require extra work, though there's often a SIMD "scatter"-type instruction that will accomplish it), subtract the ASCII value of the "0" digit from each 16-bit word's low byte (one instruction), then multiply by a constant which has 1000 in its lowest 16-bit word, 100 in the next-lowest, ..., 1 in the highest. The final answer will now be in the highest word, available with a 48-bit right shift.
If multiplication is slow, another trick to gain some speed is to not actually convert from decimal until you need to, possibly right at the end. That is, just subtract the ASCII value of the "0" digit from each 16-bit word's low byte, and then treat the resulting 64-bit value as a very wasteful BCD representation in which each decimal digit occupies 16 bits. These "integers" can be safely added to each other up to 65535/9=7281 times before there is a risk of overflow; you only need to convert to regular binary then. This representation also honours min and max operations.
Thanks for sharing!
I don't think people did exactly that, but most indeed did leverage the fact that values only ranged from -99.9 to 99.9 with exactly one fractional digit and handled them as integers (avoiding FP maths) up until the very end when printing out the results.
Obligatory xkcd: https://xkcd.com/356/
The main takeaway is that 1 billion rows is just not that much nowadays.
A corollary is 'postgres is really all you need' unless you have irrefutable proof it isn't.
Most people seriously underestimate how much power there is in a modern Intel/ARM server.
Many businesses think "we have a lot of products". Nope, even when you have more than a million products we can load all of them into RAM, even with 4 KB of memory allocate to each product.
So a million products is small business from a "load into RAM" perspective. Easily handled on a single server.
i realized this when rewriting the website for a music archive. we've got hundreds of thousands of records - but we could still keep it all in-memory easily on a mid-range smartphone.
Modern hardware has really impressive speed. A single core of my not-so-new M2 laptop can decompress data at more than 4 GB/s using lz4. Sometimes I wonder where all that performance goes, eg when waiting for Slack to open ;)
> eg when waiting for Slack to open ;)
Run a Wireshark capture while opening slack, those ~30ms round trips add up
I'm so tired of every single app requiring a hundred minuscule file system ops and another hundred full round trip web requests to do anything. How do you even write code that badly? Meanwhile I'm over here trying to make sure we won't get screwed when my linear loop has to iterate over 100 million items somehow. Oh, that only takes 50ms?
How do you write code this badly.
It's not that much, but it's also not nothing. The input file of the challenge was very narrow, just a relatively short text column and a numeric value. It still is >13 GB uncompressed; an actual table with a few more columns with 1B rows will easily be beyond 100 GB. And chances are it's not the only table in your database, so it won't fit into RAM of most common database machines. I.e. still a non-trivial dataset in most contexts.
> an actual table with a few more columns with 1B rows will easily be beyond 100 GB
Especially if you need to have supplementary indexes to support common queries on that data.
non-trival yes, but still small enough for a hobbyist. my desktop box has 96GB of ram and I don't feel special about it, nor was it expensive.
@ThePrimeTimeagen trigger ... BRazil mentioned ...
New Go Billion Row Challenge w/ Great Optimizations | Prime Reacts
- https://www.youtube.com/watch?v=SZ1PDS7iRU8
- https://r2p.dev/b/2024-03-18-1brc-go/
Spoiler:
SwissMap: A smaller, faster Golang Hash Table
- https://www.dolthub.com/blog/2023-03-28-swiss-map/
- https://github.com/dolthub/swiss
Want to see Paul Allen's implementation in Rust, Zig, Mojo ... and even Bun.sh, Deno.com
... previously https://github.com/shraddhaag/1brc
Twice as slow as top submissions in C# and C++ which is still pretty good for Go but a likely indicator that memory here has become the major bottleneck.
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...
In theory it could be improved, as Go also has most tools for that, with exception of having to manually dive into SIMD, as discussed in previous threads.
Even though I am not a big fan of Go, I feel that in the spirit of the guild of systems stuff with GC languages it always deserves some cheering nonetheless.
thanx! 2.9 seconds (4.5 GB/s using .NET 8 on my Xeon Mac)