2001 - GCC 3.0:
- Introduction of `-ffast-math`: The `-ffast-math` flag was introduced, enabling aggressive floating-point optimizations at the cost of strict IEEE compliance.
2004 - GCC 3.4:
- Refinements and Extensions: Additional optimizations added, including better handling of subnormal numbers and more aggressive operation reordering.
2007 - GCC 4.2:
- Introduction of Related Flags: `-funsafe-math-optimizations` flag introduced, offering more granular control over specific optimizations.
2010 - GCC 4.5:
- Improvements in Vectorization: Enhanced vectorization capabilities, particularly for SIMD hardware using SSE and AVX instruction sets.
2013 - GCC 4.8:
- More Granular Control: Introduction of flags like `-fno-math-errno`, improving efficiency by assuming mathematical functions do not set `errno`.
2017 - GCC 7.0:
- Enhanced Complex Number Optimizations: Improved performance for complex number arithmetic, benefiting scientific and engineering applications.
2021 - GCC 11.0:
- Better Support for Modern Hardware: Optimizations leveraging modern CPU architectures and instruction sets like AVX-512.
2024 - GCC 13.0 (Experimental):
- Experimental Features: Additional optimizations focused on new CPU features and better handling of edge cases.
I don't want to read through 11 years of comments, but what does observable effects does corrupting mxcsr have? Surely it's just enabling FTZ/DAZ mode for the FPU which could harm bit-exactness in floating point math... but in code that cares about it, it's not uncommon to see RAII guards to configure the behavior you want (and usually, it's desirable to not comply with IEEE754 w.r.t subnormal numbers).
Like it's a quirk that floating point math can be non-deterministic but how bad is it, and is it actually a bug?
I think the biggest thing is that programmers can safely assume that a floating point `x-y` is nonzero if `x != y`. You can actually go farther and know that it's an exact computation (with no error) if the two are close [1]. But both results only hold if subnormals don't flush to or behave like zero.
It's not too hard to imagine how an algorithm might depend upon that — there could be a branch for the case where `x == y` and then a branch that relies upon dividing by `(x-y)` and assumes that it's not a division by zero.
I don't think an algorithm that relies on this is particularly well designed. Anything that trusts a float is non-zero is probably some kind of division where you must avoid division by zero. In that case you should explicitly checking for the zero condition instead of relying on the semantics of real numbers, since floats are not real numbers.
That’s precisely the misapprehension that makes folks think that -ffast-math is fine. Each and every floating point number is an exact quantity. They are real mathematical objects that identify exact values. They just have limited precision and might round (think snap) to the nearest representable number after each computation. You might not use them that way, but it doesn’t mean others shouldn’t.
For example, floating point without can represent two numbers whose difference is smaller than the number can represent and rounds to zero. Subnormal floating point doesn't fix this, it just moves the impacts of underflow to smaller magnitudes that are less likely to be seen.
Look, -ffast-math isn't always fine, but specifically I'm looking at DAZ/FTZ enabled, possibly non-deterministically process-wide (which is bad, don't get me wrong!). But part of why this doesn't phase people is because in practice, programs don't care and the observable effects don't lead to bugs.
Perhaps a better example than `z/(x-y)` is `log(x-y)`. Unlike division by 0.0, `log(0.0)` often throws an immediate error whereas `log(5e-324)` is a finite — and meaningful! — result.
Plenty of OSS authors are contributing from work, so there's a good chance they're running LTS distros. Give it another few years for those to get updated.
I find it interesting that -ffast-math includes -funsafe-math-optimizations.
Having an option that says "unsafe" creates the expectation that unsafe options are labeled like that, and then enabling said unsafe option from another option that doesn't have "unsafe" in the name is a bit surprising.
> Unbeknownst to me, even with --dry-run pip will execute arbitrary code found in the package's setup.py. In fact, merely asking pip to download a package can execute arbitrary code (see pip issues 7325 and 1884 for more details)! So when I tried to dry-run install almost 400K Python packages, hilarity ensued. I spent a long time cleaning up the mess, and discovered some pretty poor setup.py practices along the way. But hey, at least I got two free pictures of anime catgirls, deposited directly into my home directory. Convenient!
I've had to clean up after npm packages as well. Git hooks I didn't ask for. Shell modifications I didn't ask for. I hope that, one day, these package managers will utilize sandboxing/containerization to avoid messes like this.
It largely isn't, but python is a big ship and it takes time to turn. There's been a lot of movement in python packaging semi-recently and as far as I can tell using setup.py has been considered legacy for a while now.
Biggest problem is that the new way to do packaging is not documented very well. It's split over three or four different projects. It mostly amounts to "you need to make a pyproject.toml file" but it's somewhat tricky to find what such a file is even supposed to contain.
Speaking of poor programming practices, how about copy-pasting code without understanding it?
Notice the author's mention of how many packages with -Ofast trace their lineage back through this comment:
# Initially copied from
# https://github.com/actions/starter-workflows/blob/main/ci/python-package.yml
# And later based on the version jamadden updated at
# gevent/gevent, and then at zodb/relstorage and zodb/perfmetrics
Although the linked config was modified 2 years ago to removed the unsafe math options[1],
the copy-paste propagated before then.
Naturally anyone asking CoPilot, ChatGPT, or other modern LLM-based interface for config code will likely get something based on this, with the problematic -Ofast option included.
Okay, it's unacceptable. Now what? Sadly, we (you, I, the Python leadership) can't just visit every single Python package publisher/maintainer and ask politely but firmly ask them to fix their packaging issues. And even if we could, at least half of them would either ignore us or would mess up in some new, exciting ways while trying to adapt to the flavour du jour of Python packaging.
I think the answer is to not have a flavour du jour. Pick one, crown it the winner, document it well and give it some time. If the solution is always changing, nobody is going to want to put the effort in.
This is exactly our problem. We develop a couple of packages (which we don’t put on PyPI so any security issue is affects nobody except us). But it’s not our job and we cannot justify someone spending too much time on this. I have a look every now and then to try to figure out what we’re supposed to do, but there are 154 solutions, some of them are outdated and unmaintained, some of them are broken in more or less subtle ways, and some of them have a bare minimum of documentation. So yeah. We’ll change when we have a clear, documented path forward.
You'll have to ask every developer why it's acceptable, because they keep using this ecosystem. And then ask their managers. But this is what happens when you don't use package ecosystems with maintainers being run by adults.
The "--only-binary=:all:" option should force Pip to only install packages that provide wheels, which doesn't run arbitrary setup.py code (right? I've been assuming it doesn't and really hope that is a valid assumption).
That will cause some installations to fail if wheels are not available. However, given that wheels are increasingly common (even for pure-python "source" packages), this can be used as a sort of bisect to enumerate and isolate/audit/file issues on/sandbox/etc. the remaining setup.py-based distributions.
What's the real-world consequences of having the floating point behavior changed? The article mentions some types of iterative algorithms but it's not clear how often those would be used. Would be interested to know what actual issues arose in any downstream projects.
In the case of flushing subnormals to zero, it's easy to end up with divides by zero when it wouldn't otherwise. `0/0.0` is `NaN` but `0/subnormal` is `Inf`.
In other cases, `-ffast-math` just introduces arbitrary and strange behaviors. Sometimes you end up with higher precision than you expected. Other times you end up with less. Other times it'll helpfully just re-arrange things such that it's a zero. For example, the classical Kahan summation does the following:
A -ffast-math compiler will see that — algebraically — you can just substitute `sum + y` into the equation for `c` and get 0. It's `sum + y - sum - y`. And that's true for real maths. But it's not true for floating point numbers.
It explicitly destroys any attempt at _working with_ floating point numbers.
Computation errors that will be discovered only with great difficulties, when someone is careful enough to use some additional verification methods for the numeric results. The magnitude of the error caused by an underflow is completely unpredictable, which is why in any serious program it is unacceptable to ignore the underflows.
Even when the errors happen to be discovered soon enough, people will be puzzled about their origin and they may waste a lot of time analyzing their algorithms, because there are a lot of things that can cause numeric errors.
Because at their origin the underflows will neither signal any exception nor generate any unusual value, the errors will be normally caught much later, perhaps after thousands of other operations, when their cause will not be obvious.
This fight between the people who want to get only correct results from computers and the people who do not care whether the results are erroneous as long as the results are obtained after a delay shorter by a few percents than for obtaining correct results has continued for decades, almost since electronic computers have been invented.
Errors are acceptable in games or for some other graphics or audio applications that generate ephemeral images or sounds, but they are not acceptable for any engineering purposes.
The option "-Ofast" is like smoking. There is no doubt that it is a bad habit, but whoever wants to smoke should be free to do it. On the other hand, exactly like a smoker should not be permitted to smoke in a closed room with non-smokers, the option "-Ofast" and the related options should not be permitted to alter the behavior of any other programs that are linked with an object file compiled with it.
The behavior of "-Ofast" where it affects globally the content of MXCSR is unacceptable.
The right behavior would have been for any function compiled with "-Ofast" to save MXCSR, put in it any desired value for the duration of the function execution and restore it at function exit. Moreover, when invoking any other function it should restore the original MXCSR before the call and put again in it the desired local value after the function returns back.
> The behavior of "-Ofast" where it affects globally the content of MXCSR is unacceptable.
> The right behavior would have been for any function compiled with "-Ofast" to save MXCSR, put in it any desired value for the duration of the function execution and restore it at function exit. Moreover, when invoking any other function it should restore the original MXCSR before the call and put again in it the desired local value after the function returns back.
I completely agree here. The problem is that introducing floating-point environment into the programming model requires disabling some of the transformations you want to do with fast-math in the first place, so doing it this way kind of doesn't work to enable what you wanted to do. Furthermore, one of the SPEC benchmarks gets like a 30% speedup if you turn on denormal flushing, and that's the kind of change that gives compiler engineer management heartburn if you want to tell them to forgo that speedup because the optimization is dumb.
I am grateful for blog posts like this one because this does help build a case for getting the compiler to stop doing stupid stuff like globally enabling DAZ/FTZ bits.
Not all CPUs behave badly when handling underflows, denormals or other exceptional cases.
On some CPUs the penalties for exceptional cases are negligible (e.g. on AMD Zen), so it is impossible for any SPEC benchmark to get a 30% speedup when ignoring underflows.
There are also bad CPU models, which have a microprogrammed handling of the exceptional cases, which can add e.g. a penalty of somewhat more than one hundred CPU cycles for each underflow, which is very similar to the penalty for a load from the main memory (which misses the caches).
If one of the SPEC benchmarks has so many underflows that on a bad CPU model it can be sped up by 30% when ignoring the underflows, then it is pretty much guaranteed that when the underflows are ignored the results computed by that benchmark are erroneous.
Cheating at the SPEC benchmark by ignoring the underflows could be easily avoided by changing the benchmarking rules, by providing a file with the valid results and by not accepting any benchmark claims where the results do not match those known to be good.
You can cheat at any SPEC benchmark with bogus compiler options, if there is no constraint on the compiled program to be correct. For instance you could make a compilation option that deletes 90% of the generated machine code, resulting in a 10 times faster benchmark program.
For some reason, the compiler writers that handle the generation of code for floating-point operations feel that they are licensed to generate incorrect code, even when similarly incorrect code would be rejected in other contexts. The fact that the floating-point operations are inexact, so they are accompanied by some inherent errors, does not give the right to a compiler writer or library writer to introduce arbitrary errors in the computations. The reason why a standard exists is precisely for guaranteeing that the errors are bounded and you can predict their effects.
In most cases it is unpredictable when such an improvement happens and what is unpredictable about correctness is useless.
It is much more frequent for -ffast-math to generate bigger errors than to benefit from an accidental cancellation of the errors.
The only case when the result of underflows is predictable is when the values whose computation may result in an underflow are immediately added to other values that are known to be big enough, and they are not used for anything else.
It is not frequent to have so much information about which expressions will underflow and which will not, and about the range of possible values of the values that would be added. When there really exists so much information about what must be computed, then there are chances that the computation can be rearranged in a way that would avoid the underflows, in which case there would be no need to change MXCSR to ignore the underflows, because they would not happen.
> It is much more frequent for -ffast-math to generate bigger errors than to benefit from an accidental cancellation of the errors.
In my experience underflows are very uncommon and indicate that you're doing something wrong, whereas fp contractions are extremely common. So I disagree with you.
I agree that using fused multiply-add in almost all cases improves the accuracy of the results.
Nevertheless, I consider that a compiler option that gives you good FMA, but only together with bad behavior at underflows, is something exceedingly stupid.
FMA is a standard operation, while ignoring underflows is rightly prohibited by the floating-point standard. There should never have been a compiler option that mixes standard operations with non-standard operations.
In C or C++, you can use the "fma" function to get FMA where desired.
This is more awkward, but it is safe.
Any better solution requires a change in the programming language. For instance, it could be required that adjacent * and + must always be contracted into an FMA, unless separated by parentheses. This would allow the programmer to select between FMA and FMUL+FADD, as desired.
FMA has appeared in 1990 (in IBM POWER) and it has been the greatest advance in floating-point computation after the introduction of Intel 8087 in 1980 (which has resulted in the IEEE standard a few years later). It is weird that 34 years later, most, if not all, programming languages do not handle well the generation of FMA operations.
This sounds quite a lot like my allergy to using floating point for doing anything critical at all. It's more difficult to force deterministic results on all platforms for float than it is for known-bitwidth-integers.
I once accidentally inherited the fast math flag from a parent project in to builds of LuaJIT. The result was that LuaJIT's tagged unions, which are used pretty much everywhere and rely on normal fp functionality, broke drastically. Type checking for all values in the VM broke.
Some arguably badly written programs depend on different execution contexts (across machines, processes, whatever) having consistent FP behavior for the same inputs. Having the FP behavior change out of your control can result in a desync across those contexts. Depending on the exact error behavior, these can be real hard to track down.
Well, the program does different things. Some people might like that. Others might not. It's just one more factor in the generally observed quality of software.
The original sin here is that original 1980s designs carry over: the processor retains FP unit state, rather than each instruction indicating what subnormal flush mode (or rounding mode or whatever) one wishes to use with no retained FP unit state. See also: the IEEE FP exception design (e.g., signaling NaNs) causing havoc with SIMD, deep pipelining, out-of-order execution etc.
It might or might not depending on whether the compiler leaves corresponding state in the intermediate files. This might toggle depending on -fLTO. Ofast is not a terribly good idea really.
Is there a simple way to check if my Python script is affected? Because I guess numpy only complains if the FPU has been screwed up before it loads (not if some other package loaded later does it)?
After thinking about this a bit, my conclusion is that it's a deficiency in the calling convention. It should specify the floating point flags as either caller saved or callee saved.
Kind of related topic: Using libraries is a pain - Is there any evidence that people are using LLM coding tools to write library functions instead of importing libraries? How would we tell? Can you think of second order effects from this?
The LLM coding tools will likely reproduce the errors from the code they were trained on,
including compiler options like -Ofast.
Because they tools are generating text based on tokens, the -Ofast option can and likely will appear in contexts where it is completely inappropriate.
Programmers who use this code will propagate errors found and fixed 2 years ago.
2001 - GCC 3.0: - Introduction of `-ffast-math`: The `-ffast-math` flag was introduced, enabling aggressive floating-point optimizations at the cost of strict IEEE compliance.
2004 - GCC 3.4: - Refinements and Extensions: Additional optimizations added, including better handling of subnormal numbers and more aggressive operation reordering.
2007 - GCC 4.2: - Introduction of Related Flags: `-funsafe-math-optimizations` flag introduced, offering more granular control over specific optimizations.
2010 - GCC 4.5: - Improvements in Vectorization: Enhanced vectorization capabilities, particularly for SIMD hardware using SSE and AVX instruction sets.
2013 - GCC 4.8: - More Granular Control: Introduction of flags like `-fno-math-errno`, improving efficiency by assuming mathematical functions do not set `errno`.
2017 - GCC 7.0: - Enhanced Complex Number Optimizations: Improved performance for complex number arithmetic, benefiting scientific and engineering applications.
2021 - GCC 11.0: - Better Support for Modern Hardware: Optimizations leveraging modern CPU architectures and instruction sets like AVX-512.
2024 - GCC 13.0 (Experimental): - Experimental Features: Additional optimizations focused on new CPU features and better handling of edge cases.
Sources: - GCC documentation archives - Release notes from various GCC versions - [GCC Wiki](https://gcc.gnu.org/wiki/) - [Krister Walfridsson's blog](https://kristerw.github.io)
Your timeline is missing:
2012: I file the obvious bug:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55522
Early 2023: bug is fixed.
Feb 2024: clang follows suit:
https://github.com/llvm/llvm-project/pull/80475
And maybe later this year this problem will finally be gone in common Linux distros.
I don't want to read through 11 years of comments, but what does observable effects does corrupting mxcsr have? Surely it's just enabling FTZ/DAZ mode for the FPU which could harm bit-exactness in floating point math... but in code that cares about it, it's not uncommon to see RAII guards to configure the behavior you want (and usually, it's desirable to not comply with IEEE754 w.r.t subnormal numbers).
Like it's a quirk that floating point math can be non-deterministic but how bad is it, and is it actually a bug?
I think the biggest thing is that programmers can safely assume that a floating point `x-y` is nonzero if `x != y`. You can actually go farther and know that it's an exact computation (with no error) if the two are close [1]. But both results only hold if subnormals don't flush to or behave like zero.
It's not too hard to imagine how an algorithm might depend upon that — there could be a branch for the case where `x == y` and then a branch that relies upon dividing by `(x-y)` and assumes that it's not a division by zero.
1. https://en.wikipedia.org/wiki/Sterbenz_lemma
I don't think an algorithm that relies on this is particularly well designed. Anything that trusts a float is non-zero is probably some kind of division where you must avoid division by zero. In that case you should explicitly checking for the zero condition instead of relying on the semantics of real numbers, since floats are not real numbers.
That’s precisely the misapprehension that makes folks think that -ffast-math is fine. Each and every floating point number is an exact quantity. They are real mathematical objects that identify exact values. They just have limited precision and might round (think snap) to the nearest representable number after each computation. You might not use them that way, but it doesn’t mean others shouldn’t.
For example, floating point without can represent two numbers whose difference is smaller than the number can represent and rounds to zero. Subnormal floating point doesn't fix this, it just moves the impacts of underflow to smaller magnitudes that are less likely to be seen.
Look, -ffast-math isn't always fine, but specifically I'm looking at DAZ/FTZ enabled, possibly non-deterministically process-wide (which is bad, don't get me wrong!). But part of why this doesn't phase people is because in practice, programs don't care and the observable effects don't lead to bugs.
> floating point without can represent two numbers whose difference is smaller than the number can represent and rounds to zero
Did you intend to say "without FTZ/DAZ enabled"? If so, that's completely and provenly false.
https://en.wikipedia.org/wiki/Sterbenz_lemma
Perhaps a better example than `z/(x-y)` is `log(x-y)`. Unlike division by 0.0, `log(0.0)` often throws an immediate error whereas `log(5e-324)` is a finite — and meaningful! — result.
Wouldn't the compiler optimize (x-y)!=0 to x!=y? Seems like a good optimisation to me and one probably in accordance to the C standard.
It would also make it impossible to have a decent non-zero check for
if ((x-y)!=0) progress((x-t)/(x-y));
Plenty of OSS authors are contributing from work, so there's a good chance they're running LTS distros. Give it another few years for those to get updated.
As always, -funsafe-math-optimizations are neither fun nor safe
I find it interesting that -ffast-math includes -funsafe-math-optimizations.
Having an option that says "unsafe" creates the expectation that unsafe options are labeled like that, and then enabling said unsafe option from another option that doesn't have "unsafe" in the name is a bit surprising.
It's +funsafe for those.
-movflags +funsafe
(2022)
I remember this one because of this part:
> Unbeknownst to me, even with --dry-run pip will execute arbitrary code found in the package's setup.py. In fact, merely asking pip to download a package can execute arbitrary code (see pip issues 7325 and 1884 for more details)! So when I tried to dry-run install almost 400K Python packages, hilarity ensued. I spent a long time cleaning up the mess, and discovered some pretty poor setup.py practices along the way. But hey, at least I got two free pictures of anime catgirls, deposited directly into my home directory. Convenient!
I've had to clean up after npm packages as well. Git hooks I didn't ask for. Shell modifications I didn't ask for. I hope that, one day, these package managers will utilize sandboxing/containerization to avoid messes like this.
Python and node software is something I always try to run in containers for convenience and security reasons
Here's the line that copies those pictures in: https://github.com/akazukin5151/koneko/blob/master/setup.py#...
Why is this ecosystem considered acceptable?
It’s because academia is a sort of anchor that preserves poor practices.
> Why is this ecosystem considered acceptable?
It largely isn't, but python is a big ship and it takes time to turn. There's been a lot of movement in python packaging semi-recently and as far as I can tell using setup.py has been considered legacy for a while now.
Biggest problem is that the new way to do packaging is not documented very well. It's split over three or four different projects. It mostly amounts to "you need to make a pyproject.toml file" but it's somewhat tricky to find what such a file is even supposed to contain.
Speaking of poor programming practices, how about copy-pasting code without understanding it? Notice the author's mention of how many packages with -Ofast trace their lineage back through this comment:
Although the linked config was modified 2 years ago to removed the unsafe math options[1], the copy-paste propagated before then.Naturally anyone asking CoPilot, ChatGPT, or other modern LLM-based interface for config code will likely get something based on this, with the problematic -Ofast option included.
[1] https://github.com/zopefoundation/meta/commit/9c07520e90d9a1...
Not to mention that the Python packaging tutorial was missing a few bits, at least as of earlier this year: https://daveon.design/introducing-fontimize-subset-fonts-to-...
Again, why is this acceptable?
Okay, it's unacceptable. Now what? Sadly, we (you, I, the Python leadership) can't just visit every single Python package publisher/maintainer and ask politely but firmly ask them to fix their packaging issues. And even if we could, at least half of them would either ignore us or would mess up in some new, exciting ways while trying to adapt to the flavour du jour of Python packaging.
I think the answer is to not have a flavour du jour. Pick one, crown it the winner, document it well and give it some time. If the solution is always changing, nobody is going to want to put the effort in.
This is exactly our problem. We develop a couple of packages (which we don’t put on PyPI so any security issue is affects nobody except us). But it’s not our job and we cannot justify someone spending too much time on this. I have a look every now and then to try to figure out what we’re supposed to do, but there are 154 solutions, some of them are outdated and unmaintained, some of them are broken in more or less subtle ways, and some of them have a bare minimum of documentation. So yeah. We’ll change when we have a clear, documented path forward.
Just put your package folder into a tarball as-is; to install, execute
I'm only half-joking.> I think the answer is to not have a flavour du jour. Pick one, crown it the winner, document it well and give it some time.
It has been done with distutils: it has been around for more than a decade. The result is the ecosystem that you are complaining about.
This is really a great idea. Once we finalize the choice, when can we expect the upload into the hivemind?
Right after the maintainers of libtelepathy.so release the version with Discord support.
Well it's an ecosystem that is meant to be easy to use by people who don't know what they're doing.
The same problems happen in any language that links C and C++ libraries largely.
It's less common for other languages because they just don't do it as much.
You'll have to ask every developer why it's acceptable, because they keep using this ecosystem. And then ask their managers. But this is what happens when you don't use package ecosystems with maintainers being run by adults.
If I wanted to compile a Go binary using an existing dynamically loaded C library, compiled with -ffast-math, what happens?
The "--only-binary=:all:" option should force Pip to only install packages that provide wheels, which doesn't run arbitrary setup.py code (right? I've been assuming it doesn't and really hope that is a valid assumption).
That will cause some installations to fail if wheels are not available. However, given that wheels are increasingly common (even for pure-python "source" packages), this can be used as a sort of bisect to enumerate and isolate/audit/file issues on/sandbox/etc. the remaining setup.py-based distributions.
Depends on the platform you're on, some platforms have very few wheels.
Naive question, does this have the same effect if I use virtual environments?
Yes, because the line that does it is literally:
that's a feature. You can conveniently do
rather than than the harder to rememberYes. Virtual environments aren’t a full sandbox.
I don't use pip but tbh this makes me much less likely to even try it. So many things about it sound unsafe.
Like dry-run not even working as you'd expect? I wonder if that got fixed after this article?
Probably the fix will be to rename the option --dank-run
What's the real-world consequences of having the floating point behavior changed? The article mentions some types of iterative algorithms but it's not clear how often those would be used. Would be interested to know what actual issues arose in any downstream projects.
In the case of flushing subnormals to zero, it's easy to end up with divides by zero when it wouldn't otherwise. `0/0.0` is `NaN` but `0/subnormal` is `Inf`.
In other cases, `-ffast-math` just introduces arbitrary and strange behaviors. Sometimes you end up with higher precision than you expected. Other times you end up with less. Other times it'll helpfully just re-arrange things such that it's a zero. For example, the classical Kahan summation does the following:
https://en.wikipedia.org/wiki/Kahan_summation_algorithmA -ffast-math compiler will see that — algebraically — you can just substitute `sum + y` into the equation for `c` and get 0. It's `sum + y - sum - y`. And that's true for real maths. But it's not true for floating point numbers.
It explicitly destroys any attempt at _working with_ floating point numbers.
Computation errors that will be discovered only with great difficulties, when someone is careful enough to use some additional verification methods for the numeric results. The magnitude of the error caused by an underflow is completely unpredictable, which is why in any serious program it is unacceptable to ignore the underflows.
Even when the errors happen to be discovered soon enough, people will be puzzled about their origin and they may waste a lot of time analyzing their algorithms, because there are a lot of things that can cause numeric errors.
Because at their origin the underflows will neither signal any exception nor generate any unusual value, the errors will be normally caught much later, perhaps after thousands of other operations, when their cause will not be obvious.
This fight between the people who want to get only correct results from computers and the people who do not care whether the results are erroneous as long as the results are obtained after a delay shorter by a few percents than for obtaining correct results has continued for decades, almost since electronic computers have been invented.
Errors are acceptable in games or for some other graphics or audio applications that generate ephemeral images or sounds, but they are not acceptable for any engineering purposes.
The option "-Ofast" is like smoking. There is no doubt that it is a bad habit, but whoever wants to smoke should be free to do it. On the other hand, exactly like a smoker should not be permitted to smoke in a closed room with non-smokers, the option "-Ofast" and the related options should not be permitted to alter the behavior of any other programs that are linked with an object file compiled with it.
The behavior of "-Ofast" where it affects globally the content of MXCSR is unacceptable.
The right behavior would have been for any function compiled with "-Ofast" to save MXCSR, put in it any desired value for the duration of the function execution and restore it at function exit. Moreover, when invoking any other function it should restore the original MXCSR before the call and put again in it the desired local value after the function returns back.
> The behavior of "-Ofast" where it affects globally the content of MXCSR is unacceptable.
> The right behavior would have been for any function compiled with "-Ofast" to save MXCSR, put in it any desired value for the duration of the function execution and restore it at function exit. Moreover, when invoking any other function it should restore the original MXCSR before the call and put again in it the desired local value after the function returns back.
I completely agree here. The problem is that introducing floating-point environment into the programming model requires disabling some of the transformations you want to do with fast-math in the first place, so doing it this way kind of doesn't work to enable what you wanted to do. Furthermore, one of the SPEC benchmarks gets like a 30% speedup if you turn on denormal flushing, and that's the kind of change that gives compiler engineer management heartburn if you want to tell them to forgo that speedup because the optimization is dumb.
I am grateful for blog posts like this one because this does help build a case for getting the compiler to stop doing stupid stuff like globally enabling DAZ/FTZ bits.
Not all CPUs behave badly when handling underflows, denormals or other exceptional cases.
On some CPUs the penalties for exceptional cases are negligible (e.g. on AMD Zen), so it is impossible for any SPEC benchmark to get a 30% speedup when ignoring underflows.
There are also bad CPU models, which have a microprogrammed handling of the exceptional cases, which can add e.g. a penalty of somewhat more than one hundred CPU cycles for each underflow, which is very similar to the penalty for a load from the main memory (which misses the caches).
If one of the SPEC benchmarks has so many underflows that on a bad CPU model it can be sped up by 30% when ignoring the underflows, then it is pretty much guaranteed that when the underflows are ignored the results computed by that benchmark are erroneous.
Cheating at the SPEC benchmark by ignoring the underflows could be easily avoided by changing the benchmarking rules, by providing a file with the valid results and by not accepting any benchmark claims where the results do not match those known to be good.
You can cheat at any SPEC benchmark with bogus compiler options, if there is no constraint on the compiled program to be correct. For instance you could make a compilation option that deletes 90% of the generated machine code, resulting in a 10 times faster benchmark program.
For some reason, the compiler writers that handle the generation of code for floating-point operations feel that they are licensed to generate incorrect code, even when similarly incorrect code would be rejected in other contexts. The fact that the floating-point operations are inexact, so they are accompanied by some inherent errors, does not give the right to a compiler writer or library writer to introduce arbitrary errors in the computations. The reason why a standard exists is precisely for guaranteeing that the errors are bounded and you can predict their effects.
It's not so easy because -ffast-math can actually be more correct than running without.
In most cases it is unpredictable when such an improvement happens and what is unpredictable about correctness is useless.
It is much more frequent for -ffast-math to generate bigger errors than to benefit from an accidental cancellation of the errors.
The only case when the result of underflows is predictable is when the values whose computation may result in an underflow are immediately added to other values that are known to be big enough, and they are not used for anything else.
It is not frequent to have so much information about which expressions will underflow and which will not, and about the range of possible values of the values that would be added. When there really exists so much information about what must be computed, then there are chances that the computation can be rearranged in a way that would avoid the underflows, in which case there would be no need to change MXCSR to ignore the underflows, because they would not happen.
> It is much more frequent for -ffast-math to generate bigger errors than to benefit from an accidental cancellation of the errors.
In my experience underflows are very uncommon and indicate that you're doing something wrong, whereas fp contractions are extremely common. So I disagree with you.
I agree that using fused multiply-add in almost all cases improves the accuracy of the results.
Nevertheless, I consider that a compiler option that gives you good FMA, but only together with bad behavior at underflows, is something exceedingly stupid.
FMA is a standard operation, while ignoring underflows is rightly prohibited by the floating-point standard. There should never have been a compiler option that mixes standard operations with non-standard operations.
In C or C++, you can use the "fma" function to get FMA where desired.
This is more awkward, but it is safe.
Any better solution requires a change in the programming language. For instance, it could be required that adjacent * and + must always be contracted into an FMA, unless separated by parentheses. This would allow the programmer to select between FMA and FMUL+FADD, as desired.
FMA has appeared in 1990 (in IBM POWER) and it has been the greatest advance in floating-point computation after the introduction of Intel 8087 in 1980 (which has resulted in the IEEE standard a few years later). It is weird that 34 years later, most, if not all, programming languages do not handle well the generation of FMA operations.
fp contract isn't a uniform "decrease in error" either, though. As a simple example, it introduces error in the straightforward:
This sounds quite a lot like my allergy to using floating point for doing anything critical at all. It's more difficult to force deterministic results on all platforms for float than it is for known-bitwidth-integers.
I once accidentally inherited the fast math flag from a parent project in to builds of LuaJIT. The result was that LuaJIT's tagged unions, which are used pretty much everywhere and rely on normal fp functionality, broke drastically. Type checking for all values in the VM broke.
Some arguably badly written programs depend on different execution contexts (across machines, processes, whatever) having consistent FP behavior for the same inputs. Having the FP behavior change out of your control can result in a desync across those contexts. Depending on the exact error behavior, these can be real hard to track down.
Well, the program does different things. Some people might like that. Others might not. It's just one more factor in the generally observed quality of software.
No AI expert here, but my first thought was that it could be an attack vector to screw lots of AI projects.
The original sin here is that original 1980s designs carry over: the processor retains FP unit state, rather than each instruction indicating what subnormal flush mode (or rounding mode or whatever) one wishes to use with no retained FP unit state. See also: the IEEE FP exception design (e.g., signaling NaNs) causing havoc with SIMD, deep pipelining, out-of-order execution etc.
That's very important info for some pythonistas, thanks for sharing!
The bit about the behavior propagating through shared libraries is yet another reason to prefer static linking.
It propagates through static linking too, if anything in the program is compiled with -Ofast
It might or might not depending on whether the compiler leaves corresponding state in the intermediate files. This might toggle depending on -fLTO. Ofast is not a terribly good idea really.
<dang>, this is (2022).
Is there a simple way to check if my Python script is affected? Because I guess numpy only complains if the FPU has been screwed up before it loads (not if some other package loaded later does it)?
Contrary to what is said in the article, gevent has the fix, it has been merged as https://github.com/gevent/gevent/commit/e29bd2ee11ca5f78cc9c... 2 years ago.
That fix was a month after this article was written.
After thinking about this a bit, my conclusion is that it's a deficiency in the calling convention. It should specify the floating point flags as either caller saved or callee saved.
Well that explains the LLMs getting their answers wrong ;)
I just wanted to say—this is an absolutely brilliant write-up. Thanks so much for all your hard work
> I have never met a scientist who can resist the lure of fast-but-dangerous math
This made me chuckle
Kind of related topic: Using libraries is a pain - Is there any evidence that people are using LLM coding tools to write library functions instead of importing libraries? How would we tell? Can you think of second order effects from this?
The LLM coding tools will likely reproduce the errors from the code they were trained on, including compiler options like -Ofast. Because they tools are generating text based on tokens, the -Ofast option can and likely will appear in contexts where it is completely inappropriate. Programmers who use this code will propagate errors found and fixed 2 years ago.
I'd take a bet against that actually as people about fastmath is exactly the kind of hand holding OpenAI in particular train for.
> propagate errors found and fixed 2 years ago
Ahh! Yeah, we need better value functions.
https://news.ycombinator.com/item?id=41188647
This is worth reading all the way through. Ouch.