> Currently, ARM is the only major vendor with support for TBI
is not true. Intel and AMD both have variants of TBI on their chips, called Linear Address Masking and Upper Address Ignore respectively. It's a bit of a mess, unfortunately, with both masking off different bits from the top of the address (and different bits than ARM TBI does), but it does exist.
Java has been using (or at least had the ability to use) the upper bits for concurrent mark and sweep for a decade - to implement write barriers on objects that are still in the process of being manipulated.
An idea Cliff Click first employed while at Azul and has now made it back into Hotspot.
The problem is, those addresses are completely interchangeable, nothing stops e.g. malloc() from allocating addresses somewhere around the very top of the legal addresses instead from starting near the .data's end. In fact, it seems that mmap(3) in general does pretty much that by default, so reusing address's top-bits is inherently unreliable: you don't know how much of those are actually unused which is precisely the reason why x64 made addresses effectively signed-extended integers.
You opt-in to any of the top byte masking schemes via prctl on Linux. It's fully forward compatible, in that programs that don't enable it will continue to work like normal. Additionally, Linux won't map memory at addresses higher than 2*48 by default either because non-hardware accelerated top bits pointer tagging would have the same problem. I don't think either of your complaints are valid here.
It is worth mentioning that on Intel x86 going all the way back to Haswell you can use PDEP/PEXT instructions to efficiently combine the high bits and the low bits into a single seamless tag. This can provide a lot of bit width. The caveat is AMD x86 implemented these instructions as uselessly slow microcode until quite recently, which created some portability issues.
> However, that is not something that is easy to create a microbenchmark for. The benefit of nan-boxing is that we don’t have to dereference a pointer to get the float.
That's not the only benefit. The main benefit is arguably that you don't have to allocate floats on the heap and garbage collect them. Numerical code allocates lots of numbers, so having these all be inline rather than heap-allocated saves lots of space and time.
If you malloc or roll your own every allocation has to be big enough to be put back on the free list. And the overhead for combining adjacent segments back together, which will involve additional cache lines at least 12.5% of the time. cache line / pointer size, and anything larger than a pointer has higher probability.
If you GC then it’s more pointer chasing during mark. Which will cache thrash at least one CPU, even if it’s not the one where most of the code is running.
On NaN-boxing, it's possible to put tags in the top instead of low bits - 64-bit floats have 52 bits of mantissa, 4 of which are in the top 16; though you only get 7 tags due to having to leave qNaN & infinity (8 may be possible if you can guarantee that the zero tag never has zero payload), or 4 for potentially simpler tag checks. Or, going the other direction, you can double the tag count to 14 or 16 by also using the sign, at the cost of a "<<1" in the is-float check.
I had never heard of "top byte ignore," but it reminds me of the macOS migration to "32-bit clean" software as the hardware began to support more than the low 24 address lines.
The other approach is CompressedOops, where instead of wasting pointer bits (and maybe using them for tags), Java's HotSpot VM chooses to only store a 32-bit offset for an eight-aligned heap object if the entire heap is known to fit within 2^(32+3) which is 32 GB from its base address.
And didn't somebody write about creating a large aligned arena for each type and essentially grabbing the base address of the arena as a (non-unique) type tag for its objects? Then the moving GC would use these arenas as semispaces.
I like how SBCL does it. Heap box addresses have their bit 0 set, which makes them odd and thus unfit for direct access. But real accesses are register indirect with offset, with odd offsets to get an even address. So each time you see an odd address offset in SBCL-generated assembly, you know you're dealing with a heap box. I can only surmise this was a deliberate choice to aid orientation when reading generated assembly. If so, somebody among the designers of SBCL has a heart for crazy people like me.
On the other hand, this may make it worse for aarch64 & RISC-V, which can have shorter encodings for loads/stores with an immediate offset that's a multiple of the operated-on data: https://godbolt.org/z/zT5czdvWc
One huge benefit of keeping boxed objects on odd values on the low bits is that you can implement addition/subtraction of the integers without removing the flag bit, doing the operation and then re-adding it, instead you can just add 2 values with the regular add instruction and then use it (since the lowest bit won't change). On the other hand having the pointer offset doesn't do much of a difference since heap values will often be accessed with an offset (and offset-load/store instructions are more or less for free in most cases so subtracting a one to an offset doesn't change the cost)
I suspect, but haven't properly measured, that pointer tagging upsets speculative loads / branch prediction (when you're loading an address) to varying extent across different tagging schemes and different cpu implementations. I'd hope setting low bits are cheaper than high bits but really should write the microbenchmark to find out someday. Anyone know of existing attempts to characterise that?
You want the address to be visible to the CPU somewhat early so that the target (might be) in the cache before you use it. I'd expect pointer tagging to obstruct that mechanism - in the worst case codegen might mask out the bits immediately before the memory operation. I don't know how transparent this sort of thing is to the core in practice and haven't found anyone else measuring it.
That's not really how out-of-order execution in CPUs work. The address doesn't have to be fully computed X cycles before a load in order to be filled. Loads are filled as their dependencies are computed: requiring an additional operation to compute the address means your address is essentially 1 cycle delayed - but that's delay, not throughput, and only actually makes your code slower if your pipeline stalls
Data memory-dependent prefetchers are a thing (..with expected side-channel potential), and tagging would conceivably make it non-functional. Though, realistically, I wouldn't expect for it to make much difference.
I'm fairly certain that the lower bits are masked away on memory reads by pretty much everything that has an on-board cache anyhow, so they're fair game. Some ISAs even mandate this masking-away for large-than-byte loads.
My guesswork for x64 would be that all is good if dereferencing the tagged value would hit in the same cache line as dereferencing the untagged value. Though I could also be persuaded that x64 completely ignores the top 16 bits until the last moment (to check consistency with the 17th bit) in which case high tagging would be free. It seems relatively likely to be something that is different across different x64 implementations. But so far I'm just running with "it's probably fine, should benchmark later"
This doesn't mention split tagging (or of course, the best answer of "no tagging", which is often best implemented as "tag stored in debug mode, but not normally since static types are tracked").
If you can reduce your tag to a single bit (object vs primitive), a single byte of tag data can cover 8 variables, and a bigger integer can cover a whole 64 variables, plenty for most functions.
these tag bits are often used for the GC and there you really don't want to collect all of the tag data together since it would cause false sharing issues
It's not for unrelated values though. Actually, the real observation is that tags aren't useful once you have a value, they're useful to get to the value in the first place.
In a stack frame, all the local variables have their tags together.
For the fields of an object, all the tags are stored together in the object.
Older MacBooks are Intel, and newer MacBooks claim to have an emulation layer faster than native x86. If nothing else, it's the machine the author had, and it's some data point.
> Currently, ARM is the only major vendor with support for TBI
is not true. Intel and AMD both have variants of TBI on their chips, called Linear Address Masking and Upper Address Ignore respectively. It's a bit of a mess, unfortunately, with both masking off different bits from the top of the address (and different bits than ARM TBI does), but it does exist.
Then it might as well not exist, if it's so messy.
Honestly, TBI seems like a very bad idea, because it breaks forward-compatibility.
Java has been using (or at least had the ability to use) the upper bits for concurrent mark and sweep for a decade - to implement write barriers on objects that are still in the process of being manipulated.
An idea Cliff Click first employed while at Azul and has now made it back into Hotspot.
72057594037927936 addresses ought to be enough for anybody... ;)
The problem is, those addresses are completely interchangeable, nothing stops e.g. malloc() from allocating addresses somewhere around the very top of the legal addresses instead from starting near the .data's end. In fact, it seems that mmap(3) in general does pretty much that by default, so reusing address's top-bits is inherently unreliable: you don't know how much of those are actually unused which is precisely the reason why x64 made addresses effectively signed-extended integers.
You opt-in to any of the top byte masking schemes via prctl on Linux. It's fully forward compatible, in that programs that don't enable it will continue to work like normal. Additionally, Linux won't map memory at addresses higher than 2*48 by default either because non-hardware accelerated top bits pointer tagging would have the same problem. I don't think either of your complaints are valid here.
It is worth mentioning that on Intel x86 going all the way back to Haswell you can use PDEP/PEXT instructions to efficiently combine the high bits and the low bits into a single seamless tag. This can provide a lot of bit width. The caveat is AMD x86 implemented these instructions as uselessly slow microcode until quite recently, which created some portability issues.
> However, that is not something that is easy to create a microbenchmark for. The benefit of nan-boxing is that we don’t have to dereference a pointer to get the float.
That's not the only benefit. The main benefit is arguably that you don't have to allocate floats on the heap and garbage collect them. Numerical code allocates lots of numbers, so having these all be inline rather than heap-allocated saves lots of space and time.
The wasted space has administrative overhead, and the administrative overhead can partially poison the cache.
Do you mean wasted space inside the pointers, or on the heap?
If you malloc or roll your own every allocation has to be big enough to be put back on the free list. And the overhead for combining adjacent segments back together, which will involve additional cache lines at least 12.5% of the time. cache line / pointer size, and anything larger than a pointer has higher probability.
If you GC then it’s more pointer chasing during mark. Which will cache thrash at least one CPU, even if it’s not the one where most of the code is running.
On NaN-boxing, it's possible to put tags in the top instead of low bits - 64-bit floats have 52 bits of mantissa, 4 of which are in the top 16; though you only get 7 tags due to having to leave qNaN & infinity (8 may be possible if you can guarantee that the zero tag never has zero payload), or 4 for potentially simpler tag checks. Or, going the other direction, you can double the tag count to 14 or 16 by also using the sign, at the cost of a "<<1" in the is-float check.
I had never heard of "top byte ignore," but it reminds me of the macOS migration to "32-bit clean" software as the hardware began to support more than the low 24 address lines.
https://en.wikipedia.org/wiki/Classic_Mac_OS_memory_manageme...
The other approach is CompressedOops, where instead of wasting pointer bits (and maybe using them for tags), Java's HotSpot VM chooses to only store a 32-bit offset for an eight-aligned heap object if the entire heap is known to fit within 2^(32+3) which is 32 GB from its base address.
https://news.ycombinator.com/item?id=22398251
And didn't somebody write about creating a large aligned arena for each type and essentially grabbing the base address of the arena as a (non-unique) type tag for its objects? Then the moving GC would use these arenas as semispaces.
I like how SBCL does it. Heap box addresses have their bit 0 set, which makes them odd and thus unfit for direct access. But real accesses are register indirect with offset, with odd offsets to get an even address. So each time you see an odd address offset in SBCL-generated assembly, you know you're dealing with a heap box. I can only surmise this was a deliberate choice to aid orientation when reading generated assembly. If so, somebody among the designers of SBCL has a heart for crazy people like me.
On the other hand, this may make it worse for aarch64 & RISC-V, which can have shorter encodings for loads/stores with an immediate offset that's a multiple of the operated-on data: https://godbolt.org/z/zT5czdvWc
One huge benefit of keeping boxed objects on odd values on the low bits is that you can implement addition/subtraction of the integers without removing the flag bit, doing the operation and then re-adding it, instead you can just add 2 values with the regular add instruction and then use it (since the lowest bit won't change). On the other hand having the pointer offset doesn't do much of a difference since heap values will often be accessed with an offset (and offset-load/store instructions are more or less for free in most cases so subtracting a one to an offset doesn't change the cost)
I suspect, but haven't properly measured, that pointer tagging upsets speculative loads / branch prediction (when you're loading an address) to varying extent across different tagging schemes and different cpu implementations. I'd hope setting low bits are cheaper than high bits but really should write the microbenchmark to find out someday. Anyone know of existing attempts to characterise that?
Why would that impact speculative loads/branch prediction? The pointers are untagged before they are accessed so it should not impact the loads.
You want the address to be visible to the CPU somewhat early so that the target (might be) in the cache before you use it. I'd expect pointer tagging to obstruct that mechanism - in the worst case codegen might mask out the bits immediately before the memory operation. I don't know how transparent this sort of thing is to the core in practice and haven't found anyone else measuring it.
That's not really how out-of-order execution in CPUs work. The address doesn't have to be fully computed X cycles before a load in order to be filled. Loads are filled as their dependencies are computed: requiring an additional operation to compute the address means your address is essentially 1 cycle delayed - but that's delay, not throughput, and only actually makes your code slower if your pipeline stalls
Data memory-dependent prefetchers are a thing (..with expected side-channel potential), and tagging would conceivably make it non-functional. Though, realistically, I wouldn't expect for it to make much difference.
I'm fairly certain that the lower bits are masked away on memory reads by pretty much everything that has an on-board cache anyhow, so they're fair game. Some ISAs even mandate this masking-away for large-than-byte loads.
My guesswork for x64 would be that all is good if dereferencing the tagged value would hit in the same cache line as dereferencing the untagged value. Though I could also be persuaded that x64 completely ignores the top 16 bits until the last moment (to check consistency with the 17th bit) in which case high tagging would be free. It seems relatively likely to be something that is different across different x64 implementations. But so far I'm just running with "it's probably fine, should benchmark later"
This doesn't mention split tagging (or of course, the best answer of "no tagging", which is often best implemented as "tag stored in debug mode, but not normally since static types are tracked").
If you can reduce your tag to a single bit (object vs primitive), a single byte of tag data can cover 8 variables, and a bigger integer can cover a whole 64 variables, plenty for most functions.
these tag bits are often used for the GC and there you really don't want to collect all of the tag data together since it would cause false sharing issues
It's not for unrelated values though. Actually, the real observation is that tags aren't useful once you have a value, they're useful to get to the value in the first place.
In a stack frame, all the local variables have their tags together.
For the fields of an object, all the tags are stored together in the object.
Sorry, if the tests were run on a MacBook why are there Intel assembly snippets?
Older MacBooks are Intel, and newer MacBooks claim to have an emulation layer faster than native x86. If nothing else, it's the machine the author had, and it's some data point.
They said it was an ARM MacBook
I wrote the snippets in x86 asm because I assumed that more people are familiar with it than ARM. Also I ran them on an x86 processor as well.
[flagged]