It's amazing to step back and look at how much of NVIDIA's success has come from unforeseen directions. For their original purpose of making graphics chips, the consumer vs pro divide was all about CAD support and optional OpenGL features that games didn't use. Programmable shaders were added for the sake of graphics rendering needs, but ended up spawning the whole GPGPU concept, which NVIDIA reacted to very well with the creation and promotion of CUDA. GPUs have FP64 capabilities in the first place because back when GPGPU first started happening, it was all about traditional HPC workloads like numerical solutions to PDEs.
Fast forward several years, and the cryptocurrency craze drove up GPU prices for many years without even touching the floating-point capabilities. Now, FP64 is out because of ML, a field that's almost unrecognizable compared to where it was during the first few years of CUDA's existence.
NVIDIA has been very lucky over the course of their history, but have also done a great job of reacting to new workloads and use cases. But those shifts have definitely created some awkward moments where their existing strategies and roadmaps have been upturned.
Maybe some luck. But there’s also a principle that if you optimize the hell out of something and follow customer demand, there’s money to be made.
Nvidia did a great job of avoiding the “oh we’re not in that market” trap that sunk Intel (phones, GPUs, efficient CPUs). Where Intel was too big and profitable to cultivate adjacent markets, Nvidia did everything they could to serve them and increase demand.
I don't think it was luck. I think it was inevitable.
They positioned the company on high performance computing, even if maybe they didn't think they were a HPC company, and something was bound to happen in that market because everybody was doing more and more computing. Then they executed well with the usual amount of greed that every company has.
The only risk for well positioned companies is being too ahead of times: being in the right market but not surviving long enough to see a killer app happen.
When they couldn't deliver the console GPU they promised for the Dreamcast (the NV2), Shoichiro Irimajiri, the Sega CEO at the time let them keep the cash in exchange for stock [0].
Without it Nvidia would have gone bankrupt months before Riva 128 changed things.
Sega console arm went bust not that it mattered. But they sold the stock for about $15mn (3x).
Had they held it, Jensen Huang ,estimated itd be worth a trillion[1]. Obviously Sega and especially it's console arm wasn't really into VC but...
My wet dream has always been what if Sega and Nvidia stuck together and we had a Sega tegra shield instead of a Nintendo switch? Or even what if Sega licensed itself to the Steam Deck? You can tell I'm a sega fan boy but I can't help that the Mega Drive was the first console I owned and loved!
Most people don't appreciate how many dead end applications NVIDIA explored before finding deep learning. It took a very long time, and it wasn't luck.
It was luck that a viable non-graphics application like deep learning existed which was well-suited to the architecture NVIDIA already had on hand. I certainly don't mean to diminish the work NVIDIA did to build their CUDA ecosystem, but without the benefit of hindsight I think it would have been very plausible that GPU architectures would not have been amenable to any use cases that would end up dwarfing graphics itself. There are plenty of architectures in the history of computing which never found a killer application, let alone three or four.
Even that is arguably not lucky, it just followed a non-obvious trajectory. Graphics uses a fair amount of linear algebra, so people with large scale physical modeling needs (among many) became interested. To an extent the deep learning craze kicked off because of developments in computation on GPUs enabled economical training.
Nvidia started their GPGPU adventure by acquiring a physics engine and porting it over to run on their GPUs. Supporting linear algebra operations was pretty much the goal from the start.
They were also full of lies when they have started their GPGPU adventure (like also today).
For a few years they have repeated continuously how GPGPU can provide about 100 times more speed than CPUs.
This has always been false. GPUs are really much faster, but their performance per watt has oscillated during most of the time around 3 times and sometimes up to 4 times greater in comparison with CPUs. This is impressive, but very far from the "100" factor originally claimed by NVIDIA.
Far more annoying than the exaggerated performance claims, is how the NVIDIA CEO was talking during the first GPGPU years about how their GPUs will cause a democratization of computing, giving access for everyone to high-throughput computing.
After a few years, these optimistic prophecies have stopped and NVIDIA has promptly removed FP64 support from their price-acceptable GPUs.
A few years later, AMD has followed the NVIDIA example.
Now, only Intel has made an attempt to revive GPUs as "GPGPUs", but there seems to be little conviction behind this attempt, as they do not even advertise the capabilities of their GPUs. If Intel will also abandon this market, than the "general-purpose" in GPGPUs will really become dead.
Sure FP64 is a problem and not always available in the capacity people would like it to be, but there are a lot of things you can do just fine with FP32 and all of that research and engineering absolutely is done on GPU.
The AI-craze also made all of it much more accessible. You don't need advanced C++ knowledge anymore to write and run a CUDA project anymore. You can just take Pytorch, JAX, CuPy or whatnot and accelerate your numpy code by an order of magnitude or two. Basically everyone in STEM is using Python these days and the scientific stack works beautifully with nvidia GPUs. Guess which chip maker will benefit if any of that research turns out to be a breakout success in need of more compute?
For determining the maximum performance achievable, the performance per watt is what matters, as the power consumption will always be limited by cooling and by the available power supply.
Even if we interpret the NVIDIA claim as referring to the performance available in a desktop, the GPU cards had power consumptions at most double in comparison with CPUs. Even with this extra factor there has been more than an order of magnitude between reality and the NVIDIA claims.
Moreover I am not sure whether around 2010 and before that, when these NVIDIA claims were frequent, the power permissible for PCIe cards had already reached 300 W, or it was still lower.
In any case the "100" factor claimed by NVIDIA was supported by flawed benchmarks, which compared an optimized parallel CUDA implementation of some algorithm with a naive sequential implementation on the CPU, instead of comparing it with an optimized multithreaded SIMD implementation on that CPU.
There's something of a feedback loop here, in that the reason that transformers and attention won over all the other forms of AI/ML is that they worked very well on the architecture that NVIDIA had already built, so you could scale your model size very dramatically just by throwing more commodity hardware at it.
I remember it differently. CUDA was built with the intention of finding/enabling something like deep learning. I thought it was unrealistic too and took it on faith in people more experienced than me, until I saw deep learning work.
Some of the near misses I remember included bitcoin. Many of the other attempts didn't ever see the light of day.
Luck in english often means success by chance rather than one's own efforts or abilities. I don't think that characterizes CUDA. I think it was eventual success in the face of extreme difficulty, many failures, and sacrifices. In hindsight, I'm still surprised that Jensen kept funding it as long as he did. I've never met a leader since who I think would have done that.
CUDA was profitable very early because of oil and gas code, like reverse time migration and the like. There was no act of incredible foresight from jensen. In fact, I recall him threatening to kill the program if large projects that made it not profitable failed, like the Titan super computer at oak ridge.
Nobody cared about deep learning back in 2007, when CUDA released. It wasn't until the 2012 AlexNet milestone that deep neural nets start to become en vogue again.
I clearly remember Cuda being made for HPC and scientific applications. They added actual operations for neural nets years after it was already a boom. Both instances were reactions, people already used graphics shaders for scientific purposes and cuda for neural nets, in both cases Nvidia was like oh cool money to be made.
Parallel computing goes back to the 1960s (at least). I've been involved in it since the 1980s. Generally you don't create an architecture and associated tooling for some specific application. The people creating the architecture only have a sketchy understanding of application areas and their needs. What you do is have a bright idea/pet peeve. Then you get someone to fund building that thing you imagined. Then marketing people scratch their heads as to who they might sell it to. It's at that point you observed "this thing was made for HPC, etc" because the marketing folks put out stories and material that said so. But really it wasn't. And as you note, it wasn't made for ML or AI either. That said in the 1980s we had "neural networks" as a potential target market for parallel processing chips so it's aways there as a possibility.
So it could just as easily have been Intel or AMD, despite them not having CUDA or any interest in that market? Pure luck that the one large company that invested to support a market reaped most of the benefits?
Because GPUs require a lot on the software side, and AMD sucks at software. They are a CPU company that bought a GPU company. ATI should have been left alone.
The whole GPU history is off and being driven by finance bros as well. Everyone believes Nvidia kicked off the GPU AI craze when Ilya Sutskever cleaned up on AlexNet with an Nvidia GPU back in 2012, or when Andrew Ng and team at Stanford published their "Large Scale Deep Unsupervised Learning using Graphics Processors" in 2009, but in 2004, a couple of Korean researches were the first to implement neural networks on a GPU, using ATI Radeons (now AMD): https://www.sciencedirect.com/science/article/abs/pii/S00313...
I remember ATI and Nvidia were neck-and-neck to launch the first GPUs around 2000. Just so much happening so fast.
I'd also say Nvidia had the benefit of AMD going after and focusing on Intel both at the server level as well as the integrated laptop processors, which was the reason they bought ATI.
While implementing double-precision by double-single may be a solution in some cases, the article fails to mention the overflow/underflow problem, which is critical in scientific/technical computing (a.k.a. HPC).
With the method from the article, the exponent range remains the same as in single precision, instead of being increased to that of double precision.
There are a lot of applications for which such an exponent range would cause far too frequent overflows and underflows. This could be avoided by introducing a lot of carefully-chosen scaling factors in all formulae, but this tedious work would remove the main advantage of floating-point arithmetic, i.e. the reason why computations are not done in fixed-point.
The general solution of this problem is to emulate double-precision with 3 numbers, 2 FP32 for the significand and a third number for the exponent, either a FP number or an integer number, depending on which format is more convenient for a given GPU.
This is possible, but it lowers considerably the achievable ratio between emulated FP64 throughput and hardware FP32 throughput, but the ratio is still better than the vendor-enforced 1:64 ratio.
Nevertheless, for now any small business or individual user can achieve a much better performance per dollar for FP64 throughput by buying Intel Battlemage GPUs, which have a 1:8 FP64/FP32 throughput ratio. This is much better than you can achieve by emulating FP64 on NVIDIA or AMD GPUs.
Intel B580 is a small GPU, so it has only a FP64 throughput about equal to a Ryzen 9 9900X and smaller than a Ryzen 9 9950X. However it provides that throughput at a much lower price. Thus if you start with a PC with a 9900X/9950X, you can double or almost double the FP64 throughput for a low additional price with an Intel GPU. Multiple GPUs will proportionally multiply the throughput.
The sad part is that with the current Intel CEO and with NVIDIA being a shareholder of Intel, it is unclear whether Intel will continue to compete in the GPU market, or they will abandon it, leaving us at the mercy of NVIDIA and AMD, which both refuse to provide products with good FP64 support to small businesses and individual users.
Yeah fair enough. The exponent of an FP32 has only 8 bits instead of 11 bits. I'll make an edit to make this explicit.
It's also fairly interesting how Nvidia handles this for the Ozaki scheme: https://docs.nvidia.com/cuda/cublas/#floating-point-emulatio.... They generally need to align all numbers in a matrix row to the maximum exponent (of a number in the row) but depending on scale difference of two numbers this might not be feasible without extending the number of mantissa bits significantly. So they dynamically (Dynamic Mantissa Control) decide if they use Ozaki's scheme or execute on native FP64 hardware. Or they let the user decide on the number of mantissa bits (Fixed Mantissa Control) which is faster but has no longer the guarantees for FP64 precision.
No mention of the Radeon VII from 2019 where for some unfathomable reason AMD forgot about the segmentation scam and put real FP64 into a gaming GPU. From this 2023 list, it's still faster at FP64 than any other consumer GPU by a wide margin (enterprise GPU's aren't in the list). Scroll all the way to the end.
They did a mild segmentation with that one, by reducing the throughput from 1:2 to 1:4 in the consumer variant, with the hope of forcing people to buy the "professional" version.
Even with the throughput reduction, Radeon VII had a performance somewhat better than the previous best FP64 product, AMD Hawaii, due to the large and fast memory. Most later consumer GPUs from NVIDIA and AMD have never approached again such a high memory interface throughput.
Radeon VII has remained for many years the champion of FP64 performance per dollar. I am still using one bought in 2019, 7 years ago.
Last year was the first time when a GPU with good FP64 performance per dollar has appeared again after Radeon VII: Intel Battlemage B580. Unfortunately it is a small GPU, but nonetheless the performance per dollar is excellent.
Let's say X=10% of the GPU area (~75mm^2) is dedicated to FP32 SIMD units. Assume FP64 units are ~2-4x bigger. That would be 150-300mm^2, a huge amount of area that would increase the price per GPU. You may not agree with these assumptions. Feel free to change them. It is an overhead that is replicated per core. Why would gamers want to pay for any features they don't use?
Not to say there isn't market segmentation going on, but FP64 cost is higher for massively parallel processors than it was in the days of high frequency single core CPUs.
I'm pretty sure that's not a remotely fair assumption to make. We've seen architectures that can eg. do two FP32 operations or one FP64 operation with the same unit, with relatively low overhead compared to a pure FP32 architecture. That's pretty much how all integer math units work, and it's not hard to pull off for floating point. FP64 units don't have to be—and seldom have been—implemented as massive single-purpose blocks of otherwise-dark silicon.
When the real hardware design choice is between having a reasonable 2:1 or 4:1 FP32:FP64 ratio vs having no FP64 whatsoever and designing a completely different core layout for consumer vs pro, the small overhead of having some FP64 capability has clearly been deemed worthwhile by the GPU makers for many generations. It's only now that NVIDIA is so massive that we're seeing them do five different physical implementations of "Blackwell" architecture variants.
Only the multiplier is significantly bigger, up to 4 times. Some shifters may also be up to twice bigger. The adders are slightly bigger, due to bigger carry-look-ahead networks.
So you must count mainly the area occupied by multipliers and shifters, which is likely to be much less than 10%.
There is an area increase, but certainly not of 50% (300 m^2). Even an area increase of 10% (e.g. 60-70 mm^2 for the biggest GPUs seems incredibly large).
Reducing the FP64/FP32 throughput ratio from 1:2 to 1:4 or at most to 1:8 is guaranteed to make the excess area negligible. I am sure that the cheap Intel Battlemage with 1:8 does not suffer because of this.
Any further reductions, from 1:16 in old GPUs until 1:64 in recent GPUs cannot have any other explanation except the desire for market segmentation, which eliminates small businesses and individual users from the customers who can afford the huge prices of the GPUs with FP64 support.
I'm not a hardware guy, but an explanation I've seen from someone who is says that it's not much extra hardware to add to a 2×f32 FMA unit the capability to do 1×f64. You already have all of the per-bit logic, you mostly just need to add an extra control line to make a few carries propagate. So the size overhead of adding FP64 to the SIMD units is more like 10-50%, not 100-300%.
Most of the logic can be reused, but the FP64 multiplier is up to 4 times larger. Also some shifters are up to 2 times larger (because they need more stages, even if they shift the same number of bits). Small size increases occur in other blocks.
Even so, the multipliers and shifters occupy only a small fraction of the total area, a fraction that is smaller then implied by their number of gates, because they have very regular layouts.
A reduction from the ideal 1:2 FP64/FP32 throughput to 1:4 or in the worst case to 1:8 should be enough to make negligible the additional cost of supporting FP64, while still keeping the throughput of a GPU competitive with a CPU.
The current NVIDIA and AMD GPUs cannot compete in FP64 performance per dollar or per watt with Zen 5 Ryzen 9 CPUs. Only Intel B580 is better in FP64 performance per dollar than any CPU, though its total performance is exceeded by CPUs like 9950X.
Why would gamers want to pay for any features they don't use?
Obviously they don't want to. Now flip it around and ask why HPC people would want to force gamers to pay for something that benefits the HPC people... Suddenly the blog post makes perfect sense.
Similar to when Nvidia released LHR GPUs that nerfed performance for Ethereum mining.
NVIDIA GeForce RTX 3060 LHR which tried to hinder mining at the bios level.
The point wasn't to make the average person lose out by preventing them mining on their gaming GPU. But to make miners less inclined to buy gaming GPUs. They also released a series of crypto mining GPUs around the same time.
NVIDIA could make 2 separate products, a GPU for gamers and a FP accelerator for HPC.
Thus everybody would pay for what they want.
The problem is that both NVIDIA and AMD do not want to make, like AMD did until a decade ago and NVIDIA stopped doing a few years earlier, a FP accelerator of reasonable size and which would be sold at a similar profit margin with their consumer GPUs.
Instead of this, they want to sell only very big FP accelerators and at huge profit margins, preferably at 5-digit prices.
This makes impossible for small businesses and individual users to use such FP accelerators.
Those are accessible only for big companies, who can buy them in bulk and negotiate lower prices than the retail prices, and who will also be able to keep them busy for close to 24/7, in order to be able to amortize the excessive profit margins of the "datacenter" GPU vendors.
One decade and a half ago, the market segmentation was not yet excessive, so I was happy to buy "professional" GPUs, with unlocked FP64 throughput, at a price about twice greater in comparison with consumer GPUs.
Nowadays, I can no longer afford such a thing, because the similar GPUs are no longer 2 times more expensive, but 20 to 50 times more expensive.
So during the last 2 decades, first I shifted much of my computations from CPUs to GPUs, but then I had to shift them back to CPUs, because there are no upgrades for my old GPUs, any newer GPU being slower, not faster.
Throughout this article you have been voicing a desire for affordable and high-througput fp64 processors, blaming vendors for not building the product you desire at a price you are willing to pay.
We hear you: your needs are not being met. Your use case is not profitable enough to justify paying the sky-high prices they now demand. In particular, because you don't need to run the workload 24/7.
What alternatives have you looked into? For example, Blackwell nodes are available from the likes of AWS.
I think that you might have confused me with the author of the article.
American companies have a pronounced preference for business-to-business products, where they can sell large quantities in bulk and at very large profit margins that would not be accepted by small businesses or individual users, who spend their own money, instead of spending the money of an anonymous employer.
If that is the only way for them to be profitable, good for them. However such policies do not deserve respect. They demonstrate the inefficiencies in the management of these companies, which prevent them from competing efficiently in markets for low-margin commodity products.
From my experience, I am pretty certain that a smaller die version of the AMD "datacenter" GPUs could be made and it could be profitable, like such GPUs were a decade ago, when AMD was still making them. However today they no longer have any incentive to do such things, as they are content with selling a smaller number of units, but with much higher margins, and they do not feel any pressure to tighten their costs.
Fortunately at least in CPUs there has been a steady progress and AMD Zen 5 has been a great leap in floating-point throughput, exceeding the performance of older GPUs.
I am not blaming vendors for not building the product that I desire, but I am disappointed that years ago they have fooled me to waste time in porting applications to their products, which I bought instead of spending money for something else, but then they have discontinued such products, with no upgrade path.
Because I am old enough to remember what happened 15 to 20 years ago, I am annoyed about the hypocrisy of some discourses of the NVIDIA CEO, which have been repeated for several years after introducing CUDA, which were more or less equivalent with promises that the goal of NVIDIA is to put a "supercomputer" on the desk of everyone, only for him to pivot completely from these claims and remove FP64 from "consumer" GPUs, in order to be able to sell "enterprise" GPUs at inflated prices. Then soon this prompted AMD to imitate the same strategy.
FP64 performance is limited on consumer because the US government deems it important to nuclear weapons research.
Past a certain threshold of FP64 throughput, your chip goes in a separate category and is subject to more regulation about who you can sell to and know-your-customer. FP32 does not matter for this threshold.
It's surprising that this restriction continues to linger at all. The newest nuclear warhead models in the US arsenal were developed in the 1970s, when supercomputer performance was well below 1 gigaflop. When the US stopped testing nuclear warheads in 1992, top end supercomputers were under 10 gigaflops. The only thing the US arsenal needs faster computers for is simulating the behavior of its aging warhead stockpile without physical tests, which is not going to matter to a state building its first nuclear weapons.
Can't wait until they update this to also include export controls around FP8 and FP4 etc in order to combat deepfakes, and then all of a sudden not be able to buy increasingly powerful consumer GPUs.
This is so interesting, especially given that it is in theory possible to emulate FP64 using FP32 operations.
I do think though that Nvidia generally didn't see much need for more FP64 in consumer GPUs since they wrote in the Ampere (RTX3090) white paper: "The small number of FP64 hardware units are included to ensure any programs with FP64 code operate correctly, including FP64 Tensor Core code."
I'll try adding an additional graph where I plot the APP values for all consumer GPUs up to 2023 (when the export control regime changed) to see if the argument of Adjusted Peak Performance for FP64 has merit.
Do you happen to know though if GPUs count as vector processors or not under these regulations since the weighing factor changes depending on the definition?
https://www.federalregister.gov/documents/2018/10/24/2018-22...
What I found so far is that under Note 7 it says: "A ‘vector processor’ is defined as a processor with built-in instructions that perform multiple calculations on floating-point vectors (one-dimensional arrays of 64-bit or larger numbers) simultaneously, having at least 2 vector functional units and at least 8 vector registers of at least 64 elements each."
Nvidia GPUs have only 32 threads per warp, so I suppose they don't count as a vector processor (which seems a bit weird but who knows)?
Only two of these examples meet the definition of vector processor, and these are very clearly classical vector processor computers, the Cray X1E and the NEC SX-8 (as in, if you're preparing a guide on historical development of vector processing, you're going to be explicitly including these systems or their ancestors as canonical examples of what you mean by a vector super computer!). And the definition is pretty clearly tailored to make sure that SIMD units in existing CPUs wouldn't qualify for the definition of vector processor.
The interesting case to point out is the last example, a "Hypothetical coprocessor-based Server" which hypothetically describes something that is actually extremely similar to the result of GPGPU-based HPC systems: "The
host microprocessor is a quad-core (4 processors) chip, and the coprocessor is a specialized chip with 64 floating-point engines operating in parallel, attached to the host microprocessor through a specialized expansion bus (HyperTransport or CSI-like)." This hypothetical system is not a "vector processor," it goes on to explain.
From what I can find, it seems that neither NVidia nor the US government considers the GPUs to count as vector processors and thus give it the 0.3 rather than the 0.9 weight.
> it is in theory possible to emulate FP64 using FP32 operations
I’d say it’s better than theory, you can definitely use float2 pairs of fp32 floats to emulate higher precision. Quad precision using too, using float4. Here’s the code: https://andrewthall.com/papers/df64_qf128.pdf
Also note it’s easy to emulate fp64 using entirely integer instructions. (As a fun exercise, I attempted both doubles and quads in GLSL: https://www.shadertoy.com/view/flKSzG)
While it’s relatively easy to do, these approaches are a lot slower than fp64 hardware. My code is not optimized, not ieee compliant, and not bug-free, but the emulated doubles are at least an order of magnitude slower than fp32, and the quads are two order of magnitude slower. I don’t think Andrew Thall’s df64 can achieve a 1:4 float to double perf ratio either.
And not sure, but I don’t think CUDA SMs are vector processors per se, and not because of the fixed warp size, but more broadly because of the design & instruction set. I could be completely wrong though, and Tensor Cores totally might count as vector processors.
What is easy to do is to emulate FP128 with FP64 (double-double) or even FP256 with FP64.
The reason is that the exponent range of FP64 is typically sufficient to avoid overflows and underflows in most applications.
On the other hand, the exponent range of FP32 is insufficient for most scientific-technical computing.
For an adequate exponent range, you must use either three FP32 per FP64, or two FP32 and an integer. In this case the emulation becomes significantly slower than the simplistic double-single emulation.
With the simpler double-single emulation, you cannot expect to just plug it in most engineering applications, e.g. SPICE for electronic circuit simulation, and see that the application works. Some applications could be painstakingly modified to work with such an implementation, but that is not normally an option.
So to be interchangeable with the use of standard FP64 you really must also emulate the exponent range, at the price of much slower emulation.
I did this at some point in the past, but today it makes no sense in comparison with the available alternatives.
Today, the best FP64 performance per dollar by far, is achieved with Ryzen 9950X or Ryzen 9900X, in combination with Inter Battlemage B580 GPUs.
When money does not matter, you can use AMD Epyc in combination with AMD "datacenter" GPUs, which would achieve much better performance per watt, but the performance per dollar would be abysmally low.
Oh yes I forgot to mention it, you’re absolutely right, Thall’s method for df64 and qf128 gives you double/quad precision mantissa with single-precision exponent ranges, and the paper is clear about that.
FWIW, my own example (emulating doubles/quads with ints) gives the full exponent range with no wasted bits since I’m just emulating IEEE format directly.
Of course there are also bignum libraries that can do arbitrary precision. I guess one of the things I meant to convey but didn’t say directly is that using double precision isn’t export controlled, as one might interpret the top of thi thread, but a certain level of fp64 performance might be.
Yep, I do GPU passthrough to virtual machines because I would not let Windows touch my bare metal. You have to patch your ROM headers and hide the fact that you're in a VM from the OS.
So even as an end-user, a single person, I cannot naturally use my card how I please without significant technical investment. Imagine buying a $1000 piece of equipment and then being told what you can and can't do with it.
They did it by limiting the supply of cards. Even if you are ready to pay 4x of MSRP, you can't buy 100 of the card at once. Many consumers bought 1 GPU at 2-4x of MSRP.
A question that has been bugging me for a while is what will NVIDIA do with its HPC business? By HPC I mean clusters intended for non-AI related workloads. Are they going to cater to them separetely, or are they going to tell them to just emulate FP64?
Hopper had 60 TF FP64, Blackwell has 45 TF, and Rubin has 33 TF.
It is pretty clear that Nvidia is sunsetting FP64 support, and they are selling a story that no serious computational scientist I know believes, namely that you can use low precision operations to emulate higher precision.
This is kind of amazing - I still have a bunch of Titan V's (2017-2018) that do 7 TF FP64. 8 years old and managing 1/4 of what Rubin does, and the numbers are probably closer if you divide by the power draw.
(Needless to say, the FP32 / int8 / etc. numbers are rather different.)
For a long time AMD has been offering much better FP64 performance than NVIDIA, in their CDNA GPUs (which continue the older AMD GCN ISA, instead of resembling the RDNA used in gaming GPUs).
Nevertheless, the AMD GPUs continue to have their old problems, weak software support, so-and-so documentation, software incompatibility with the cheap GPUs that could be used directly by a programmer for developing applications.
There is a promise that AMD will eventually unify the ISA of their "datacenter" and gaming GPUs, like NVIDIA has always done, but it in unclear when this will happen.
Thus they are a solution only for big companies or government agencies.
this article is so dumb. NVIDIA delivered what the market wanted - gamers dont need FP64, they dont waste silicon on it. now enterprise doesnt want FP64 anymore and they are reducing silicon for it too
weird way to frame delivering exactly what the consumer wants as a big market segmentation fuck the user conspiracy
Your framing is what's backwards. NVIDIA artificially nerfed FP64 for a long time before they started making multiple specialized variants of their architectures. It's not a conspiracy theory; it's historical fact that they shipped the same die with drastically different levels of FP64 capability. In a very real way, consumers were paying for transistors they couldn't use, subsidizing the pro parts.
This isn't really true, and it wouldn't be a big deal even if it was.
Die areas for consumer card chips are smaller than die areas for datacenter card chips, and this has held for a few generations now. They can't possibly be the same chips, because they are physically different sizes. The lowest-end consumer dies are less than 1/4 the area of datacenter dies, and even the highest-end consumer dies are only like 80% the area of datacenter dies. This implies there must be some nontrivial differentiation going on at the silicon level.
Secondly, you are not paying for the die area anyway. Whether a chip is obtained from being specially made for that exact model of GPU, or it is obtained from being binned after possibly defective areas get fused off, you are paying for the end-result product. If that product meets the expected performance, it is doing its job. This is not a subsidy (at least, not in that direction), the die is just one small part of what makes a usable GPU card, and excess die area left dark isn't even pure waste, as it helps with heat dissipation.
The fact that nVidia excludes decent FP64 from all of its prosumer offerings (*) can still be called "artificial" insofar as it was indeed done on purpose for market segmentation purposes, but it's not some trivial trick. They really are just not putting it into the silicon. This has been the case for longer than it wasn't by now, even.
* = The Quadro line of "professional" workstation cards nowadays are just consumer cards with ECC RAM and special drivers
> consumers were paying for transistors they couldn’t use
This is Econ 101 these days. It’s cheaper to design and manufacture 1 product than 2. Many many products have features that are enabled for higher paying customers, from software to kitchen appliances to cars, and much much more.
The combined product design is also subsidizing some of the costs for everyone, so be careful what you wish for. If you could use all the transistors you have, you’d be paying more either way, either because design and production costs go up, or because you’re paying for the higher end model and being the one subsidizing the existence of the high end transistors other people don’t use.
It's amazing to step back and look at how much of NVIDIA's success has come from unforeseen directions. For their original purpose of making graphics chips, the consumer vs pro divide was all about CAD support and optional OpenGL features that games didn't use. Programmable shaders were added for the sake of graphics rendering needs, but ended up spawning the whole GPGPU concept, which NVIDIA reacted to very well with the creation and promotion of CUDA. GPUs have FP64 capabilities in the first place because back when GPGPU first started happening, it was all about traditional HPC workloads like numerical solutions to PDEs.
Fast forward several years, and the cryptocurrency craze drove up GPU prices for many years without even touching the floating-point capabilities. Now, FP64 is out because of ML, a field that's almost unrecognizable compared to where it was during the first few years of CUDA's existence.
NVIDIA has been very lucky over the course of their history, but have also done a great job of reacting to new workloads and use cases. But those shifts have definitely created some awkward moments where their existing strategies and roadmaps have been upturned.
Maybe some luck. But there’s also a principle that if you optimize the hell out of something and follow customer demand, there’s money to be made.
Nvidia did a great job of avoiding the “oh we’re not in that market” trap that sunk Intel (phones, GPUs, efficient CPUs). Where Intel was too big and profitable to cultivate adjacent markets, Nvidia did everything they could to serve them and increase demand.
I don't think it was luck. I think it was inevitable.
They positioned the company on high performance computing, even if maybe they didn't think they were a HPC company, and something was bound to happen in that market because everybody was doing more and more computing. Then they executed well with the usual amount of greed that every company has.
The only risk for well positioned companies is being too ahead of times: being in the right market but not surviving long enough to see a killer app happen.
They were also bailed out by Sega.
When they couldn't deliver the console GPU they promised for the Dreamcast (the NV2), Shoichiro Irimajiri, the Sega CEO at the time let them keep the cash in exchange for stock [0].
Without it Nvidia would have gone bankrupt months before Riva 128 changed things.
Sega console arm went bust not that it mattered. But they sold the stock for about $15mn (3x).
Had they held it, Jensen Huang ,estimated itd be worth a trillion[1]. Obviously Sega and especially it's console arm wasn't really into VC but...
My wet dream has always been what if Sega and Nvidia stuck together and we had a Sega tegra shield instead of a Nintendo switch? Or even what if Sega licensed itself to the Steam Deck? You can tell I'm a sega fan boy but I can't help that the Mega Drive was the first console I owned and loved!
[0] https://www.gamespot.com/articles/a-5-million-gift-from-sega...
[1] https://youtu.be/3hptKYix4X8?t=5483&si=h0sBmIiaduuJiem_
Most people don't appreciate how many dead end applications NVIDIA explored before finding deep learning. It took a very long time, and it wasn't luck.
It was luck that a viable non-graphics application like deep learning existed which was well-suited to the architecture NVIDIA already had on hand. I certainly don't mean to diminish the work NVIDIA did to build their CUDA ecosystem, but without the benefit of hindsight I think it would have been very plausible that GPU architectures would not have been amenable to any use cases that would end up dwarfing graphics itself. There are plenty of architectures in the history of computing which never found a killer application, let alone three or four.
Even that is arguably not lucky, it just followed a non-obvious trajectory. Graphics uses a fair amount of linear algebra, so people with large scale physical modeling needs (among many) became interested. To an extent the deep learning craze kicked off because of developments in computation on GPUs enabled economical training.
Nvidia started their GPGPU adventure by acquiring a physics engine and porting it over to run on their GPUs. Supporting linear algebra operations was pretty much the goal from the start.
They were also full of lies when they have started their GPGPU adventure (like also today).
For a few years they have repeated continuously how GPGPU can provide about 100 times more speed than CPUs.
This has always been false. GPUs are really much faster, but their performance per watt has oscillated during most of the time around 3 times and sometimes up to 4 times greater in comparison with CPUs. This is impressive, but very far from the "100" factor originally claimed by NVIDIA.
Far more annoying than the exaggerated performance claims, is how the NVIDIA CEO was talking during the first GPGPU years about how their GPUs will cause a democratization of computing, giving access for everyone to high-throughput computing.
After a few years, these optimistic prophecies have stopped and NVIDIA has promptly removed FP64 support from their price-acceptable GPUs.
A few years later, AMD has followed the NVIDIA example.
Now, only Intel has made an attempt to revive GPUs as "GPGPUs", but there seems to be little conviction behind this attempt, as they do not even advertise the capabilities of their GPUs. If Intel will also abandon this market, than the "general-purpose" in GPGPUs will really become dead.
GPGPU is doing better than ever.
Sure FP64 is a problem and not always available in the capacity people would like it to be, but there are a lot of things you can do just fine with FP32 and all of that research and engineering absolutely is done on GPU.
The AI-craze also made all of it much more accessible. You don't need advanced C++ knowledge anymore to write and run a CUDA project anymore. You can just take Pytorch, JAX, CuPy or whatnot and accelerate your numpy code by an order of magnitude or two. Basically everyone in STEM is using Python these days and the scientific stack works beautifully with nvidia GPUs. Guess which chip maker will benefit if any of that research turns out to be a breakout success in need of more compute?
> GPGPU can provide about 100 times more speed than CPUs
Ok. You're talking about performance.
> their performance per watt has oscillated during most of the time around 3 times and sometimes up to 4 times greater in comparison with CPUs
Now you're talking about perf/W.
> This is impressive, but very far from the "100" factor originally claimed by NVIDIA.
That's because you're comparing apples to apples per apple cart.
For determining the maximum performance achievable, the performance per watt is what matters, as the power consumption will always be limited by cooling and by the available power supply.
Even if we interpret the NVIDIA claim as referring to the performance available in a desktop, the GPU cards had power consumptions at most double in comparison with CPUs. Even with this extra factor there has been more than an order of magnitude between reality and the NVIDIA claims.
Moreover I am not sure whether around 2010 and before that, when these NVIDIA claims were frequent, the power permissible for PCIe cards had already reached 300 W, or it was still lower.
In any case the "100" factor claimed by NVIDIA was supported by flawed benchmarks, which compared an optimized parallel CUDA implementation of some algorithm with a naive sequential implementation on the CPU, instead of comparing it with an optimized multithreaded SIMD implementation on that CPU.
At the time, desktop power consumption was never a true limiter. Even for the notorious GTX 480, TDP was only 250 W.
That aside, it still didn't make sense to compare apples to apples per apple cart...
There's something of a feedback loop here, in that the reason that transformers and attention won over all the other forms of AI/ML is that they worked very well on the architecture that NVIDIA had already built, so you could scale your model size very dramatically just by throwing more commodity hardware at it.
It was definitely luck, greg. And Nvidia didn't invent deep learning, deep learning found nvidias investment in CUDA.
I remember it differently. CUDA was built with the intention of finding/enabling something like deep learning. I thought it was unrealistic too and took it on faith in people more experienced than me, until I saw deep learning work.
Some of the near misses I remember included bitcoin. Many of the other attempts didn't ever see the light of day.
Luck in english often means success by chance rather than one's own efforts or abilities. I don't think that characterizes CUDA. I think it was eventual success in the face of extreme difficulty, many failures, and sacrifices. In hindsight, I'm still surprised that Jensen kept funding it as long as he did. I've never met a leader since who I think would have done that.
CUDA was profitable very early because of oil and gas code, like reverse time migration and the like. There was no act of incredible foresight from jensen. In fact, I recall him threatening to kill the program if large projects that made it not profitable failed, like the Titan super computer at oak ridge.
Nobody cared about deep learning back in 2007, when CUDA released. It wasn't until the 2012 AlexNet milestone that deep neural nets start to become en vogue again.
I clearly remember Cuda being made for HPC and scientific applications. They added actual operations for neural nets years after it was already a boom. Both instances were reactions, people already used graphics shaders for scientific purposes and cuda for neural nets, in both cases Nvidia was like oh cool money to be made.
Parallel computing goes back to the 1960s (at least). I've been involved in it since the 1980s. Generally you don't create an architecture and associated tooling for some specific application. The people creating the architecture only have a sketchy understanding of application areas and their needs. What you do is have a bright idea/pet peeve. Then you get someone to fund building that thing you imagined. Then marketing people scratch their heads as to who they might sell it to. It's at that point you observed "this thing was made for HPC, etc" because the marketing folks put out stories and material that said so. But really it wasn't. And as you note, it wasn't made for ML or AI either. That said in the 1980s we had "neural networks" as a potential target market for parallel processing chips so it's aways there as a possibility.
So it could just as easily have been Intel or AMD, despite them not having CUDA or any interest in that market? Pure luck that the one large company that invested to support a market reaped most of the benefits?
It was luck, but that doesn't mean they didn't work very hard too.
Luck is when preparation meets opportunity.
The counter question is: why have AMD been so bad by comparison?
Because GPUs require a lot on the software side, and AMD sucks at software. They are a CPU company that bought a GPU company. ATI should have been left alone.
The whole GPU history is off and being driven by finance bros as well. Everyone believes Nvidia kicked off the GPU AI craze when Ilya Sutskever cleaned up on AlexNet with an Nvidia GPU back in 2012, or when Andrew Ng and team at Stanford published their "Large Scale Deep Unsupervised Learning using Graphics Processors" in 2009, but in 2004, a couple of Korean researches were the first to implement neural networks on a GPU, using ATI Radeons (now AMD): https://www.sciencedirect.com/science/article/abs/pii/S00313...
I remember ATI and Nvidia were neck-and-neck to launch the first GPUs around 2000. Just so much happening so fast.
I'd also say Nvidia had the benefit of AMD going after and focusing on Intel both at the server level as well as the integrated laptop processors, which was the reason they bought ATI.
While implementing double-precision by double-single may be a solution in some cases, the article fails to mention the overflow/underflow problem, which is critical in scientific/technical computing (a.k.a. HPC).
With the method from the article, the exponent range remains the same as in single precision, instead of being increased to that of double precision.
There are a lot of applications for which such an exponent range would cause far too frequent overflows and underflows. This could be avoided by introducing a lot of carefully-chosen scaling factors in all formulae, but this tedious work would remove the main advantage of floating-point arithmetic, i.e. the reason why computations are not done in fixed-point.
The general solution of this problem is to emulate double-precision with 3 numbers, 2 FP32 for the significand and a third number for the exponent, either a FP number or an integer number, depending on which format is more convenient for a given GPU.
This is possible, but it lowers considerably the achievable ratio between emulated FP64 throughput and hardware FP32 throughput, but the ratio is still better than the vendor-enforced 1:64 ratio.
Nevertheless, for now any small business or individual user can achieve a much better performance per dollar for FP64 throughput by buying Intel Battlemage GPUs, which have a 1:8 FP64/FP32 throughput ratio. This is much better than you can achieve by emulating FP64 on NVIDIA or AMD GPUs.
Intel B580 is a small GPU, so it has only a FP64 throughput about equal to a Ryzen 9 9900X and smaller than a Ryzen 9 9950X. However it provides that throughput at a much lower price. Thus if you start with a PC with a 9900X/9950X, you can double or almost double the FP64 throughput for a low additional price with an Intel GPU. Multiple GPUs will proportionally multiply the throughput.
The sad part is that with the current Intel CEO and with NVIDIA being a shareholder of Intel, it is unclear whether Intel will continue to compete in the GPU market, or they will abandon it, leaving us at the mercy of NVIDIA and AMD, which both refuse to provide products with good FP64 support to small businesses and individual users.
Yeah fair enough. The exponent of an FP32 has only 8 bits instead of 11 bits. I'll make an edit to make this explicit.
It's also fairly interesting how Nvidia handles this for the Ozaki scheme: https://docs.nvidia.com/cuda/cublas/#floating-point-emulatio.... They generally need to align all numbers in a matrix row to the maximum exponent (of a number in the row) but depending on scale difference of two numbers this might not be feasible without extending the number of mantissa bits significantly. So they dynamically (Dynamic Mantissa Control) decide if they use Ozaki's scheme or execute on native FP64 hardware. Or they let the user decide on the number of mantissa bits (Fixed Mantissa Control) which is faster but has no longer the guarantees for FP64 precision.
Yeah, double-word floating-point loses many of the desirable properties of the usual floating-point.
No mention of the Radeon VII from 2019 where for some unfathomable reason AMD forgot about the segmentation scam and put real FP64 into a gaming GPU. From this 2023 list, it's still faster at FP64 than any other consumer GPU by a wide margin (enterprise GPU's aren't in the list). Scroll all the way to the end.
https://www.eatyourbytes.com/list-of-gpus-by-processing-powe...
They did a mild segmentation with that one, by reducing the throughput from 1:2 to 1:4 in the consumer variant, with the hope of forcing people to buy the "professional" version.
Even with the throughput reduction, Radeon VII had a performance somewhat better than the previous best FP64 product, AMD Hawaii, due to the large and fast memory. Most later consumer GPUs from NVIDIA and AMD have never approached again such a high memory interface throughput.
Radeon VII has remained for many years the champion of FP64 performance per dollar. I am still using one bought in 2019, 7 years ago.
Last year was the first time when a GPU with good FP64 performance per dollar has appeared again after Radeon VII: Intel Battlemage B580. Unfortunately it is a small GPU, but nonetheless the performance per dollar is excellent.
Thats because Radeon VIIs were just AMD Instinct MI50 server gpus which didn't make the cut or were left over.
I'm not sure why the article dismisses cost.
Let's say X=10% of the GPU area (~75mm^2) is dedicated to FP32 SIMD units. Assume FP64 units are ~2-4x bigger. That would be 150-300mm^2, a huge amount of area that would increase the price per GPU. You may not agree with these assumptions. Feel free to change them. It is an overhead that is replicated per core. Why would gamers want to pay for any features they don't use?
Not to say there isn't market segmentation going on, but FP64 cost is higher for massively parallel processors than it was in the days of high frequency single core CPUs.
> Assume FP64 units are ~2-4x bigger.
I'm pretty sure that's not a remotely fair assumption to make. We've seen architectures that can eg. do two FP32 operations or one FP64 operation with the same unit, with relatively low overhead compared to a pure FP32 architecture. That's pretty much how all integer math units work, and it's not hard to pull off for floating point. FP64 units don't have to be—and seldom have been—implemented as massive single-purpose blocks of otherwise-dark silicon.
When the real hardware design choice is between having a reasonable 2:1 or 4:1 FP32:FP64 ratio vs having no FP64 whatsoever and designing a completely different core layout for consumer vs pro, the small overhead of having some FP64 capability has clearly been deemed worthwhile by the GPU makers for many generations. It's only now that NVIDIA is so massive that we're seeing them do five different physical implementations of "Blackwell" architecture variants.
A FP64 unit can share most of two FP32 units.
Only the multiplier is significantly bigger, up to 4 times. Some shifters may also be up to twice bigger. The adders are slightly bigger, due to bigger carry-look-ahead networks.
So you must count mainly the area occupied by multipliers and shifters, which is likely to be much less than 10%.
There is an area increase, but certainly not of 50% (300 m^2). Even an area increase of 10% (e.g. 60-70 mm^2 for the biggest GPUs seems incredibly large).
Reducing the FP64/FP32 throughput ratio from 1:2 to 1:4 or at most to 1:8 is guaranteed to make the excess area negligible. I am sure that the cheap Intel Battlemage with 1:8 does not suffer because of this.
Any further reductions, from 1:16 in old GPUs until 1:64 in recent GPUs cannot have any other explanation except the desire for market segmentation, which eliminates small businesses and individual users from the customers who can afford the huge prices of the GPUs with FP64 support.
> Assume FP64 units are ~2-4x bigger.
I'm not a hardware guy, but an explanation I've seen from someone who is says that it's not much extra hardware to add to a 2×f32 FMA unit the capability to do 1×f64. You already have all of the per-bit logic, you mostly just need to add an extra control line to make a few carries propagate. So the size overhead of adding FP64 to the SIMD units is more like 10-50%, not 100-300%.
Most of the logic can be reused, but the FP64 multiplier is up to 4 times larger. Also some shifters are up to 2 times larger (because they need more stages, even if they shift the same number of bits). Small size increases occur in other blocks.
Even so, the multipliers and shifters occupy only a small fraction of the total area, a fraction that is smaller then implied by their number of gates, because they have very regular layouts.
A reduction from the ideal 1:2 FP64/FP32 throughput to 1:4 or in the worst case to 1:8 should be enough to make negligible the additional cost of supporting FP64, while still keeping the throughput of a GPU competitive with a CPU.
The current NVIDIA and AMD GPUs cannot compete in FP64 performance per dollar or per watt with Zen 5 Ryzen 9 CPUs. Only Intel B580 is better in FP64 performance per dollar than any CPU, though its total performance is exceeded by CPUs like 9950X.
Why would gamers want to pay for any features they don't use?
Obviously they don't want to. Now flip it around and ask why HPC people would want to force gamers to pay for something that benefits the HPC people... Suddenly the blog post makes perfect sense.
Similar to when Nvidia released LHR GPUs that nerfed performance for Ethereum mining.
NVIDIA GeForce RTX 3060 LHR which tried to hinder mining at the bios level.
The point wasn't to make the average person lose out by preventing them mining on their gaming GPU. But to make miners less inclined to buy gaming GPUs. They also released a series of crypto mining GPUs around the same time.
So fairly typical market segregation.
https://videocardz.com/newz/nvidia-geforce-rtx-3060-anti-min...
NVIDIA could make 2 separate products, a GPU for gamers and a FP accelerator for HPC.
Thus everybody would pay for what they want.
The problem is that both NVIDIA and AMD do not want to make, like AMD did until a decade ago and NVIDIA stopped doing a few years earlier, a FP accelerator of reasonable size and which would be sold at a similar profit margin with their consumer GPUs.
Instead of this, they want to sell only very big FP accelerators and at huge profit margins, preferably at 5-digit prices.
This makes impossible for small businesses and individual users to use such FP accelerators.
Those are accessible only for big companies, who can buy them in bulk and negotiate lower prices than the retail prices, and who will also be able to keep them busy for close to 24/7, in order to be able to amortize the excessive profit margins of the "datacenter" GPU vendors.
One decade and a half ago, the market segmentation was not yet excessive, so I was happy to buy "professional" GPUs, with unlocked FP64 throughput, at a price about twice greater in comparison with consumer GPUs.
Nowadays, I can no longer afford such a thing, because the similar GPUs are no longer 2 times more expensive, but 20 to 50 times more expensive.
So during the last 2 decades, first I shifted much of my computations from CPUs to GPUs, but then I had to shift them back to CPUs, because there are no upgrades for my old GPUs, any newer GPU being slower, not faster.
Throughout this article you have been voicing a desire for affordable and high-througput fp64 processors, blaming vendors for not building the product you desire at a price you are willing to pay.
We hear you: your needs are not being met. Your use case is not profitable enough to justify paying the sky-high prices they now demand. In particular, because you don't need to run the workload 24/7.
What alternatives have you looked into? For example, Blackwell nodes are available from the likes of AWS.
I think that you might have confused me with the author of the article.
American companies have a pronounced preference for business-to-business products, where they can sell large quantities in bulk and at very large profit margins that would not be accepted by small businesses or individual users, who spend their own money, instead of spending the money of an anonymous employer.
If that is the only way for them to be profitable, good for them. However such policies do not deserve respect. They demonstrate the inefficiencies in the management of these companies, which prevent them from competing efficiently in markets for low-margin commodity products.
From my experience, I am pretty certain that a smaller die version of the AMD "datacenter" GPUs could be made and it could be profitable, like such GPUs were a decade ago, when AMD was still making them. However today they no longer have any incentive to do such things, as they are content with selling a smaller number of units, but with much higher margins, and they do not feel any pressure to tighten their costs.
Fortunately at least in CPUs there has been a steady progress and AMD Zen 5 has been a great leap in floating-point throughput, exceeding the performance of older GPUs.
I am not blaming vendors for not building the product that I desire, but I am disappointed that years ago they have fooled me to waste time in porting applications to their products, which I bought instead of spending money for something else, but then they have discontinued such products, with no upgrade path.
Because I am old enough to remember what happened 15 to 20 years ago, I am annoyed about the hypocrisy of some discourses of the NVIDIA CEO, which have been repeated for several years after introducing CUDA, which were more or less equivalent with promises that the goal of NVIDIA is to put a "supercomputer" on the desk of everyone, only for him to pivot completely from these claims and remove FP64 from "consumer" GPUs, in order to be able to sell "enterprise" GPUs at inflated prices. Then soon this prompted AMD to imitate the same strategy.
From the top of my head, overhead is around 10% or so.
https://www.youtube.com/watch?v=lEBQveBCtKYApparently FP80, which is even wider than FP64, is beneficial for pathfinding algorithms in games.
Pathfinding for hundredths of units is a task worth putting on GPU.
Has FP80 ever existed anywhere other than x87?
10% sounds implausibly high. Even on GPUs, most of area are various memories and interconnect.
FP64 performance is limited on consumer because the US government deems it important to nuclear weapons research.
Past a certain threshold of FP64 throughput, your chip goes in a separate category and is subject to more regulation about who you can sell to and know-your-customer. FP32 does not matter for this threshold.
https://en.wikipedia.org/wiki/Adjusted_Peak_Performance
It is not a market segmentation tactic and has been around since 2006. It's part of the mind-numbing annual export control training I get to take.
It's surprising that this restriction continues to linger at all. The newest nuclear warhead models in the US arsenal were developed in the 1970s, when supercomputer performance was well below 1 gigaflop. When the US stopped testing nuclear warheads in 1992, top end supercomputers were under 10 gigaflops. The only thing the US arsenal needs faster computers for is simulating the behavior of its aging warhead stockpile without physical tests, which is not going to matter to a state building its first nuclear weapons.
Can't wait until they update this to also include export controls around FP8 and FP4 etc in order to combat deepfakes, and then all of a sudden not be able to buy increasingly powerful consumer GPUs.
This is so interesting, especially given that it is in theory possible to emulate FP64 using FP32 operations.
I do think though that Nvidia generally didn't see much need for more FP64 in consumer GPUs since they wrote in the Ampere (RTX3090) white paper: "The small number of FP64 hardware units are included to ensure any programs with FP64 code operate correctly, including FP64 Tensor Core code."
I'll try adding an additional graph where I plot the APP values for all consumer GPUs up to 2023 (when the export control regime changed) to see if the argument of Adjusted Peak Performance for FP64 has merit.
Do you happen to know though if GPUs count as vector processors or not under these regulations since the weighing factor changes depending on the definition?
https://www.federalregister.gov/documents/2018/10/24/2018-22... What I found so far is that under Note 7 it says: "A ‘vector processor’ is defined as a processor with built-in instructions that perform multiple calculations on floating-point vectors (one-dimensional arrays of 64-bit or larger numbers) simultaneously, having at least 2 vector functional units and at least 8 vector registers of at least 64 elements each."
Nvidia GPUs have only 32 threads per warp, so I suppose they don't count as a vector processor (which seems a bit weird but who knows)?
Wikipedia links to this guide to the APP, published in December 2006 (much closer to when the rule itself came out): https://web.archive.org/web/20191007132037/https://www.bis.d.... At the end of the guide is a list of examples.
Only two of these examples meet the definition of vector processor, and these are very clearly classical vector processor computers, the Cray X1E and the NEC SX-8 (as in, if you're preparing a guide on historical development of vector processing, you're going to be explicitly including these systems or their ancestors as canonical examples of what you mean by a vector super computer!). And the definition is pretty clearly tailored to make sure that SIMD units in existing CPUs wouldn't qualify for the definition of vector processor.
The interesting case to point out is the last example, a "Hypothetical coprocessor-based Server" which hypothetically describes something that is actually extremely similar to the result of GPGPU-based HPC systems: "The host microprocessor is a quad-core (4 processors) chip, and the coprocessor is a specialized chip with 64 floating-point engines operating in parallel, attached to the host microprocessor through a specialized expansion bus (HyperTransport or CSI-like)." This hypothetical system is not a "vector processor," it goes on to explain.
From what I can find, it seems that neither NVidia nor the US government considers the GPUs to count as vector processors and thus give it the 0.3 rather than the 0.9 weight.
> it is in theory possible to emulate FP64 using FP32 operations
I’d say it’s better than theory, you can definitely use float2 pairs of fp32 floats to emulate higher precision. Quad precision using too, using float4. Here’s the code: https://andrewthall.com/papers/df64_qf128.pdf
Also note it’s easy to emulate fp64 using entirely integer instructions. (As a fun exercise, I attempted both doubles and quads in GLSL: https://www.shadertoy.com/view/flKSzG)
While it’s relatively easy to do, these approaches are a lot slower than fp64 hardware. My code is not optimized, not ieee compliant, and not bug-free, but the emulated doubles are at least an order of magnitude slower than fp32, and the quads are two order of magnitude slower. I don’t think Andrew Thall’s df64 can achieve a 1:4 float to double perf ratio either.
And not sure, but I don’t think CUDA SMs are vector processors per se, and not because of the fixed warp size, but more broadly because of the design & instruction set. I could be completely wrong though, and Tensor Cores totally might count as vector processors.
What is easy to do is to emulate FP128 with FP64 (double-double) or even FP256 with FP64.
The reason is that the exponent range of FP64 is typically sufficient to avoid overflows and underflows in most applications.
On the other hand, the exponent range of FP32 is insufficient for most scientific-technical computing.
For an adequate exponent range, you must use either three FP32 per FP64, or two FP32 and an integer. In this case the emulation becomes significantly slower than the simplistic double-single emulation.
With the simpler double-single emulation, you cannot expect to just plug it in most engineering applications, e.g. SPICE for electronic circuit simulation, and see that the application works. Some applications could be painstakingly modified to work with such an implementation, but that is not normally an option.
So to be interchangeable with the use of standard FP64 you really must also emulate the exponent range, at the price of much slower emulation.
I did this at some point in the past, but today it makes no sense in comparison with the available alternatives.
Today, the best FP64 performance per dollar by far, is achieved with Ryzen 9950X or Ryzen 9900X, in combination with Inter Battlemage B580 GPUs.
When money does not matter, you can use AMD Epyc in combination with AMD "datacenter" GPUs, which would achieve much better performance per watt, but the performance per dollar would be abysmally low.
Oh yes I forgot to mention it, you’re absolutely right, Thall’s method for df64 and qf128 gives you double/quad precision mantissa with single-precision exponent ranges, and the paper is clear about that.
FWIW, my own example (emulating doubles/quads with ints) gives the full exponent range with no wasted bits since I’m just emulating IEEE format directly.
Of course there are also bignum libraries that can do arbitrary precision. I guess one of the things I meant to convey but didn’t say directly is that using double precision isn’t export controlled, as one might interpret the top of thi thread, but a certain level of fp64 performance might be.
To me it is crazy that NVIDIA somehow got away with telling owners of consumer grade hardware.that they cannot be used in datacenters.
My understanding is this was not enforceable in Europe, and maybe elsewhere
> My understanding is this was not enforceable in Europe, and maybe elsewhere
"Not enforceable" just means they can't sue you. It doesn't mean they can't say "We won't sell to you anymore".
Code 43 was a worldwide thing before we found a workaround.
Yep, I do GPU passthrough to virtual machines because I would not let Windows touch my bare metal. You have to patch your ROM headers and hide the fact that you're in a VM from the OS.
So even as an end-user, a single person, I cannot naturally use my card how I please without significant technical investment. Imagine buying a $1000 piece of equipment and then being told what you can and can't do with it.
They did it by limiting the supply of cards. Even if you are ready to pay 4x of MSRP, you can't buy 100 of the card at once. Many consumers bought 1 GPU at 2-4x of MSRP.
Table to compare Blackwell U300 to U200 (-97% FP64 performance): https://www.forum-3dcenter.org/vbulletin/showpost.php?p=1380...
I hope for their fall. I invest in their success
A question that has been bugging me for a while is what will NVIDIA do with its HPC business? By HPC I mean clusters intended for non-AI related workloads. Are they going to cater to them separetely, or are they going to tell them to just emulate FP64?
Hopper had 60 TF FP64, Blackwell has 45 TF, and Rubin has 33 TF.
It is pretty clear that Nvidia is sunsetting FP64 support, and they are selling a story that no serious computational scientist I know believes, namely that you can use low precision operations to emulate higher precision.
See for example, https://www.theregister.com/2026/01/18/nvidia_fp64_emulation...
It seems the emulation approach is slower, has more errors, and doesn't apply to FP64 vector, only matrix operations.
This is kind of amazing - I still have a bunch of Titan V's (2017-2018) that do 7 TF FP64. 8 years old and managing 1/4 of what Rubin does, and the numbers are probably closer if you divide by the power draw.
(Needless to say, the FP32 / int8 / etc. numbers are rather different.)
For a long time AMD has been offering much better FP64 performance than NVIDIA, in their CDNA GPUs (which continue the older AMD GCN ISA, instead of resembling the RDNA used in gaming GPUs).
Nevertheless, the AMD GPUs continue to have their old problems, weak software support, so-and-so documentation, software incompatibility with the cheap GPUs that could be used directly by a programmer for developing applications.
There is a promise that AMD will eventually unify the ISA of their "datacenter" and gaming GPUs, like NVIDIA has always done, but it in unclear when this will happen.
Thus they are a solution only for big companies or government agencies.
AMD MI430X is taking that market.
this article is so dumb. NVIDIA delivered what the market wanted - gamers dont need FP64, they dont waste silicon on it. now enterprise doesnt want FP64 anymore and they are reducing silicon for it too
weird way to frame delivering exactly what the consumer wants as a big market segmentation fuck the user conspiracy
Your framing is what's backwards. NVIDIA artificially nerfed FP64 for a long time before they started making multiple specialized variants of their architectures. It's not a conspiracy theory; it's historical fact that they shipped the same die with drastically different levels of FP64 capability. In a very real way, consumers were paying for transistors they couldn't use, subsidizing the pro parts.
> subsidizing the pro parts.
You got this wrong way around. It's the high margin (pro) products subsidizing low margin (consumer) products.
In general, yes, but when consumer parts are spending silicon area on features they can't use, it is happening in the other direction too.
This isn't really true, and it wouldn't be a big deal even if it was.
Die areas for consumer card chips are smaller than die areas for datacenter card chips, and this has held for a few generations now. They can't possibly be the same chips, because they are physically different sizes. The lowest-end consumer dies are less than 1/4 the area of datacenter dies, and even the highest-end consumer dies are only like 80% the area of datacenter dies. This implies there must be some nontrivial differentiation going on at the silicon level.
Secondly, you are not paying for the die area anyway. Whether a chip is obtained from being specially made for that exact model of GPU, or it is obtained from being binned after possibly defective areas get fused off, you are paying for the end-result product. If that product meets the expected performance, it is doing its job. This is not a subsidy (at least, not in that direction), the die is just one small part of what makes a usable GPU card, and excess die area left dark isn't even pure waste, as it helps with heat dissipation.
The fact that nVidia excludes decent FP64 from all of its prosumer offerings (*) can still be called "artificial" insofar as it was indeed done on purpose for market segmentation purposes, but it's not some trivial trick. They really are just not putting it into the silicon. This has been the case for longer than it wasn't by now, even.
* = The Quadro line of "professional" workstation cards nowadays are just consumer cards with ECC RAM and special drivers
> consumers were paying for transistors they couldn’t use
This is Econ 101 these days. It’s cheaper to design and manufacture 1 product than 2. Many many products have features that are enabled for higher paying customers, from software to kitchen appliances to cars, and much much more.
The combined product design is also subsidizing some of the costs for everyone, so be careful what you wish for. If you could use all the transistors you have, you’d be paying more either way, either because design and production costs go up, or because you’re paying for the higher end model and being the one subsidizing the existence of the high end transistors other people don’t use.