I find it hard to believe that it actually is a microcode issue.
Mostly because Intel has way too much motivation to pass it off as a microcode issue, as they can fix a microcode issue for free, by pushing out a patch. If it's an actual hardware issue, then Intel will be forced to actually recall all the faulty CPUs, which could cost them billions.
The other reason, is that it took them way too long to give details. If it's as simple as a buggy microcode requesting an out-of-spec voltage from the motherboard, they should have been able to diagnose the problem extremely quickly and fix it in just a few weeks. They would have detected the issue as soon as they put voltage logging on the motherboard's VRM. And according to some sources, Intel have apparently been shipping non-faulty CPUs for months now (since April, from memory), and those don't have an updated microcode.
This long delay and silence feels like they spent months of R&D trying to create a workaround, create a new voltage spec to provide the lowest voltage possible. Low enough to work around a hardware fault on as many units as possible, without too large of a performance regression, or creating new errors on other CPUs because of undervolting.
I suspect that this microcode update will only "fix" the crashes for some CPUs. My prediction is that in another month Intel will claim there are actually two completely independent issues, and reluctantly issue a recall for anything not fixed by the microcode.
As I understand it, there are multiple voltages inside the CPU, so just monitoring the motherboard VRM won't cut it.
That said I too am very skeptical. I just issued a moratorium on the purchase of anything Intel 13th/14th gen in our company and waiting for some actual proof that the issue is fully resolved.
On Raptor lake, there are a few integrated voltage regulators to which provide new voltages for specialised uses (like the E core's L2 cache, parts of DDR memory IO, PCI-E IO), but the current draw on those regulators is pretty low. The bulk of the power comes directly from motherboard VRMs on one of several rails with no internal regulation. Most of the power draw is grouped onto just two rails, VccGT for the GPU, and VccCore (also known as VccIA in other generations) which powers all the P-cores, all the E-cores and, the ring bus and the last-level cache.
Which means all cores share the same voltage, and it's trivial to monitor externally.
I guess it's possible the bug could be with only of the integrated voltage regulators, but those seem to only power various IO devices, and I struggle to see how they could trigger this type of instability.
Keep in mind that the L2 cache is the last level cache for the E cores, and is shared by the entire cluster of four E cores. (One of the two clusters connects to the ring bus and shares the main L3, the other goes directly to main memory)
I'm guessing Intel can shut down VccCore entirely (which wipes every other cache), while keeping just enough voltage to maintain the E core L2 cache. By keeping valid data in L2, they can resume execution on an E core much quicker.
And as long as the reason for waking is a small periodic housekeeping task, they don't even need to wake up main memory. All the data fits in the 2MB of L2 cache. This makes resuming even faster and saves even more power. Finally, quick resumes allow the task to complete quicker and shut down VccCore again, which saves even more power.
This extreme level of power saving isn't really useful for desktops, but very useful for laptops and tablets. BTW, I'm not talking about a sleep mode here, the CPU will ideally be able to enter this mode anytime there is no tasks to run for at least the next millisecond, so it can save power even when the user is actively using the system.
It's most likely both a hardware issue and a microcode issue.
Making CPUs is kind-of like sorting eggs. When they're made, they all have slightly different characteristics and get placed into bins (IE, "binned") based on how they meet the specs.
To oversimplify, the cough "better" chips are sold at higher prices because they can run at higher clock speeds and/or handle higher voltages. If there's a spec of dust on the die, a feature gets turned off and the chip is sold for a lower price.
In this case, this is most likely an edge case that would not be a defect if shipping microcode already handled it. (Although it is appropriate to ask if it would result in effected chips going into a lower-price bin if they are effected.)
> If there's a spec of dust on the die, a feature gets turned off and the chip is sold for a lower price.
Do you mean that if a 13900KS CPU has a manufacturing defect, it gets downgraded and sold as 13900F or something else according to the nature of the defect?
For any named product (such as Raptor Lake) intel only make 1-3 unique silicon dies. Any product in the
Alder Lake only had two dies, 8P+8E and 6P+0E [1]. Every single SKU comes from those two dies, if it has E cores, it's the 8P+8E die. Which means Alder Lake-N is actually the 8P+8E dies with all the P cores disabled.
The laptop versions, Alder Lake-P (20w) and Alder Lake-U (9 and 15w) are also the 8P+8E die, they couldn't use the 6P+0E die, because it has no E cores at all.
Raptor Lake is only one die with 8 P cores and 16 E cores, which they sell as every i9 and i7, along with the two top i5 designs. In the 13th generation, the remaining i5s are the Alder Lake 8P+8E die and the i3s are all Alder Lake 6P+0E dies.
The manufacturing defects aren't binary, it's not a simple pass/fail. It's all very analog: Some dies are simply able to reach higher clock speeds, or use more or less power. They test every single die and bin it based on its capabilities. The ones with the best power consumption go to the P and U SKUs. The ones which can reach the highest clock speeds are labeled as 13900KS, dies which just miss that get sold as 13900K, the rest get spread over all remaining SKUs based on their capabilities.
Intel couldn't decide to exclusively make 13900KS dies if they wanted to, because they are simply the top 0.1% of dies. They are forced to make 1000 dies, use the best one and sell the rest as lower SKUs.
Silicon lottery was when you as a customer could get dies of varying degrees, some of which could be clocked higher than others. For the manufacturer it's not a lottery at all because the scales make the yields for various bins mostly predictable. Binning also means that you as a customer are much less likely to get a chip that is significantly better than specced although it still happens when chips sold as a lower bin for market segmentation purposes.
The months of R&D to create a workaround could simply be because the subset of motherboards which trigger this issue are doing something borderline/unexpected with their voltage management, and finding a workaround for that behaviour in CPU microcode is non-trivial. Not all motherboard models appear to trigger the fault, which suggests that motherboard behaviour is at least a contributing factor to the problem.
Towards the middle of the video it brings up some very interesting evidence, from online game server farms that use 13900 and 14900 variants for their high single-core performance for the cost, but with server-grade motherboards and chipsets that do not do any overclocking, and would be considered "conservative". But these environments show a very high statistical failure rate for these particular CPU models. This suggests that some high percentage of CPUs produced are affected, and it's long run-time over which the problem can develop, not just enthusiast/gamer motherboards pushing high power levels.
All modern CPUs come out of the factory with many many bugs. The errata you see published are only the ones that they find after shipping (if you're lucky, they might not even publish all errata). Many bugs are fixed in testing and qualification before shipping.
That's how CPU design goes. The way that is done is by pushing as much to firmware as possible, adding chicken switches and fallback paths, and all sorts of ways to intercept regular operation and replace it with some trap to microcode or flush or degraded operation.
Applying fixes and workaround might cost quite a bit of performance (think spectre disabling of some kinds of branch predictors for an obvious very big one). And in some cases you even see in published errata they leave some theoretical correctness bugs unfixed entirely. Where is the line before accepting returns? Very blurry and unclear.
Almost certainly, huge parts of their voltage regulation (which goes along with frequency, thermal, and logic throttling) will be highly configurable. Quite likely it's run by entirely programmable microcontrollers on chip. Things that are baked into silicon might be voltage/droop sensors, temperature sensors, etc., and those could behave unexpectedly, although even then there might be redundancy or ways to compensate for small errors.
I don't see they "passed it off" as a microcode issue, just said that a microcode patch could fix it. As you see it's very hard from the outside to know if something can be reasonably fixed by microcode or to call it a "microcode issue". Most things can be fixed with firmware/microcode patches, by design. And many things are. For example if some voltage sensor circuit on the chip behaved a bit differently than expected in the design but they could correct it by adding some offsets to a table, then the "issue" is that silicon deviates from the model / design and that can not be changed, but firmware update would be a perfectly good fix, to the point they might never bother to redo the sensor even if they were doing a new spin of the masks.
On the voltage issue, they did not say it was requesting an out of spec voltage, they said it was incorrect. This is not necessarily detectable out of context. Dynamic voltage and frequency scaling and all the analog issues that go with it are fiendishly complicated, voltage requested from a regulator is not what gets seen at any given component of the chip, loads, switching, capacitance, frequency, temperature, etc., can all conspire to change these things. And modern CPUs run as close to absolute minimum voltage/timing guard bands as possible to improve efficiency, and they boost up to as high voltages as they can to increase performance. A small bug or error in some characterization data in this very complicated algorithm of many variables and large multi dimensional tables could easily cause voltage/timing to go out of spec and cause instability. And it does not necessarily leave some nice log you can debug because you can't measure voltage from all billion components in the chip on a continuous basis.
And some bugs just take a while to find and fix. I'm not a tester per se but I found a logic bug in a CPU (not Intel but commercial CPU) that was quickly reproducible and resulted in a very hard lockup of a unit in the core, but it still took weeks to find it. Imagine some ephemeral analog bug lurking in a dusty corner of their operating envelope.
Then you actually have to develop the fix, then you have to run that fix through quite a rigorous testing process and get reasonable confidence that it solves the problem, before you would even make this announcement to say you've solved it. Add N more weeks for that.
So, not to say a dishonest or bad motivation from Intel is out of the question. But it seems impossible to make such speculations from the information we have. This announcement would be quite believable to me.
I agree with most of what you said, so cherry picking one thingy to reply to isn't my intention, but
"And some bugs just take a while to find and fix."
I think it's less that it took awhile to find the bug/etc, more so that they've been pretty much radio silent for six months. AMD had the issue with burning 7 series CPUs, they were quick to at least put out a statement that they'll make customers whole again.
Well as it comes to Intel executive management and PR, I'm entirely unqualified to make any educated comment or speculation about it. I can't say I'm aware of Intel ever having great renown for its handling of product defects though.
The thing is, "incorrect" implies the existence of a static "correct". Which I interpret as a static spec which a microcode bug violated and could be fixed back to that static spec with a simple microcode update.
I do find your suggested scenario to be very plausible. That Intel have discovered their original voltage algorithm was flawed, leading to instability. And it is very feasible that simply updating the microcode is the correct fix for such an issue.
If Intel had explicitly stated that the original voltage algorithm spec was wrong, and the new one fixes the issue, I'd be pretty willing to believe them, and probably wouldn't have written that comment.
I'm not saying your integration of "incorrect voltage" as meaning "voltage that we now know causes instability" is wrong. It's an ambiguous statement and either interpretation is valid. But I have experience working with PR people, they know how to avoid ambiguous statements.
PR people are also experts at using ambiguous statements to their advantage. Crafting statements where not only are there multiple possible interoperation, but statements where the average reader will tend to interpret in the best possible way. I have experience in helping PR people to craft such statements. There are a few other examples of "ambitious statements" in that statement, which leads me to question the honesty of the whole thing.
I believe that the waters may be muddied enough that they wont have to do a full recall and only if you 'provide evidence' the system is still crashing.
Except normally the result of a microcode workaround is that the chip no longer performs at its claimed/previously-measured level. Not "as good" by any standard.
For example, Intel CPU + Spectre mitigation is not "as good" as a CPU that didn't have the vulnerability in the first place.
Microcode changes don't have to affect performance negatively. Do you have any evidence this one will? If it's a voltage algorithm failure, then I would expect that they could run it as advertised with corrected microcode. Unstable power is a massive issue for electronics like this and I have no problem believing their explanation. Bad power causes all sorts of weird issues.
If it was a microcode bug to begin with, fixing the bug wouldn't need to degrade performance. If it was e.g. a bad sensor, that you can "correct" well enough by postprocessing, it doesn't need to degrade performance. But if it's essentially incorrect binning -- the hardware can't function as they thought it would, use microcode to limit e.g. voltage to the range where it works right -- then that will degrade performance.
At least with spectre applying the mitigation was a choice. You could turn it off and game at full speed, while turning it on for servers and web browsing for safety.
"Unfortunately for John, the branches made a pact with Satan
and quantum mechanics [...] In exchange for their last remaining
bits of entropy, the branches cast evil spells on future genera-
tions of processors. Those evil spells had names like “scaling-
induced voltage leaks” and “increasing levels of waste heat”
[...] the branches,
those vanquished foes from long ago, would have the last laugh."
"John was terrified by the collapse of the parallelism bubble,
and he quickly discarded his plans for a 743-core processor
that was dubbed The Hydra of Destiny and whose abstract
Platonic ideal was briefly the third-best chess player in Gary,
Indiana. Clutching a bottle of whiskey in one hand and a shot-
gun in the other, John scoured the research literature for ideas
that might save his dreams of infinite scaling. He discovered
several papers that described software-assisted hardware
recovery. The basic idea was simple: if hardware suffers more
transient failures as it gets smaller, why not allow software to
detect erroneous computations and re-execute them? This idea
seemed promising until John realized THAT IT WAS THE
WORST IDEA EVER. Modern software barely works when the
hardware is correct, so relying on software to correct hardware
errors is like asking Godzilla to prevent Mega-Godzilla from
terrorizing Japan. THIS DOES NOT LEAD TO RISING PROP-
ERTY VALUES IN TOKYO. It’s better to stop scaling your
transistors and avoid playing with monsters in the first place,
instead of devising an elaborate series of monster checks-
and-balances and then hoping that the monsters don’t do what
monsters are always going to do because if they didn’t do those
things, they’d be called dandelions or puppy hugs."
> According to my dad, flying in airplanes used to be fun... Everybody was attractive ....
this is how I feel about electric car supercharging stations at the moment. There is a definitely a privilege aspect, which some attractive people are beneficiaries of in a predictable way, as well as other expensive maintenance for their health and attraction.
so I could see myself saying the same thing to my children
Remains to be seen how the microcode patch affects performance, and how these CPUs that have been affected by over-voltage to the point of instability will have aged in 6 months, or a few years from now.
More voltage generally improves stability, because there is more slack to close timing. Instability with high voltage suggests dangerous levels. A software patch can lower the voltage from this point on, but it can't take back any accumulated fatigue.
I was recently looking at building and buying a couple systems. I've always liked Intel. I went AMD this time.
It seemed like the base frequencies vs boost frequencies were much farther apart on Intel than with most of the AMDs. This was especially true on the laptops were cooling is a larger concern. So I suspect they were pushing limits.
Also, the performance core vs efficiency core stuff seemed kind of gimmicky with so few performance cores and so many efficiency cores. Like look at this 20 core processor! Oh wait, it's really an 8 core when it comes to performance. Hard to compare that to a 12 core 3D cached Ryzen with even higher clock...
I will say, it seems intel might still have some advantages. It seems AMD had an issue supporting ECC with the current chipsets. I almost went Intel because of it. I ended up deciding that DDR5 built in error correction was enough for me. The performance graphs also seem to indicate a smoother throughput suggesting more efficient or elegant execution (less blocking?). But on the average the AMDs seem to be putting out similar end results even if the graph is a bit more "spikey".
> It seems AMD had an issue supporting ECC with the current chipsets.
AMD has the advantage with regards to ECC. Intel doesn't support ECC at all on consumer chips, you need to go Xeon. AMD supports it on all chips, but it is up to the motherboard vendor to (correctly) implement. You can get consumer-class AM4/5 boards that have ECC support.
There was a strange happening with AMD laptop CPUs (“APUs”): the non-soldered DDR5 variants of the 7x40’s were advertised to support ECC RAM on AMD’s website up until a couple months before any actual laptops were sold, then that was silently changed and ECC is only on the PRO models now. I still don’t know if this is a straightforward manufacturing or chipset issue of some kind or a sign of market segmentation to come.
(I’m quite salty I couldn’t get my Framework 13 with ECC RAM because of this.)
Unfortunately not. I can't say for current gen, but the 5000 series APUs like the 5600G do not support ECC. I know, I tried...
But yes, most Ryzen CPUs do have ECC functionality, and have had it since the 1000 series, even if not officially supported. Official support for ECC is only on Ryzen PRO parts.
Intel has always had randomly supported ECC on desktop CPUs. Sometimes it was just a few low end SKUs, sometimes higher end SKUs. 14th gen it appears i9s and i7s do, didn't check i5s, but i3s did not.
My understanding is that it's screwed up for multiple vendors and chipsets. The boards might say they support it, but there are some updates saying it's not. It seemed extremely hard to find any that actually supported it. It was actually easier to find new Intel boards supporting ECC.
yeah wendell put out a video a few weeks ago exploring a bunch of problems with asrock rack-branded server-market B650 motherboards and basically the ECC situation was exactly what everyone warns about: the various BIOS versions wandered between "works, but doesn't forward the errors", "doesn't work, and doesn't forward the errors", and (excitingly) "doesn't work and doesn't even post". We are a year and a half after zen4 launched and there barely are any server-branded boards to begin with, and even those boards don't work right.
I don't know how many times it has to be said but "doesn't explicitly disable" is not the same thing as "support". There are lots of other enablement steps that are required to get ECC to work properly, and they really need to be explicitly tested with each release (which if it is "not explicitly disabled", it's not getting tested). Support means you can complain to someone when it doesn't work right.
AMD churns AGESA really, really hard and it breaks all the time. Partners have to try and chase the upstream and sometimes it works and sometimes it doesn't. Elmor (Asus's Bios Guy) talked about this on Overclock.net back around 2017-2018 when AMD was launching X399 and talked about some of the troubles there and with AM4.
That said, the current situation has seemingly lit a fire under the board partners, with Intel out of commission and all these customers desperate for an alternative to their W680/raptor lake systems (which do support ecc officially, btw) in these performance-sensitive niches or power-limited datacenter layouts, they are finally cleaning up the mess like, within the last 3 weeks or so. They've very quickly gone from not caring about these boards to seeing a big market opportunity.
can't believe how many times I've explained in the last month that yes, people do actually run 13700Ks in the datacenter... with ECC... and actually it's probably some pretty big names in fact. A previous video dropped the tidbit that one of the major affected customers is Citadel Capital - and yeah, those are the guys who used to get special EVEREST and BLACK OPS skus from intel for the same thing. Client platform is better at that, the very best sapphire rapids or epyc -F or -X3D sku is going to be like 75% of the performance at best. It's also the fastest thing available for serving NVMe flash storage (and Intel specifically targeted this, the Xeon E-2400 series with the C266 chipset can talk NVMe SAS natively on its chipset with up to 4 slimsas ports...)
Yeah I think that’s the bright spot, now that there’s a branded offering for server-flavored Ryzen now maybe there is a permanent justification for doing proper validation.
I just feel vindicated lol, it always comes up that “well works fine for me!” and the reality is it’s a total crapshoot with even server-branded boards often not working. There is zero chance your gigabyte UD3 or whatever is going to be consistently supported across bios and often it will not be.
And AMD is really really tied to AGESA releases, so it’s fairly important on that side. Although I guess maybe we’re seeing now what happens if you let too much be abstracted away… but on the other hand partners were blowing up AMD chips last year too.
If you’re comfortable always testing, and always having the possibility of there being some big AGESA problem and ecc being broken on the new versions… ok I guess.
There is a reason the i3 chips were perennial favorites for edge servers and NASs. And I think it's really, really hard to overstate the long-term damage from reputation loss here. Intel, meltdown aside, was always no-drama in terms of reliability. Other than C2000/C3000, I guess.
or at least... maybe on the CPU side they were no-drama. Other than C2000/C3000. Granted the powervr graphics on the atoms way back did suck... and meltdown... and avx-512 being rolled back... /phillip j fry counting on his fingers
maybe "blue-chip coded" is a better way to express it ig
but like, there is a notable decline in the quality of execution of intel overall, pretty much across the board, and cpu was always their core vertical, right? That was their business redoubt. intel is blue chip chips, especially CPUs. And now it's falling - really it's been falling for a while. Meltdown I can generally excuse (yes, shush), nobody appreciated sidechannels back then even if they were theoretically known. C2000/C3000 is another fuckup. yeah it's the super-io/serial bus controller... technically not their IP but it happens to be in a critical path, on their node, killing their processor. They fucked up the validation there, evidently.
I-225V had three steppings and I-226V is still not fully fixed (windows/linux have just turned off the EEE/802.11az feature instead). Puma was a god damned mess.
Sapphire rapids was late, still a huge mess, and actually the -W platform had not only insane power draw, but also insaner transients. 750W average, spiking up to 1500W under load, with pretty steep holdup requirements. And actually that was locked behind a "water cooled" bios option, the processor just "refused to all-core turbo" otherwise. And Intel didn't wanna actually say that the "water cooled" behavior was the spec or intentional turbo limits etc. In hindsight hmmm, that all took a bit of a different tone, didn't it?
Supposedly there is going to be a SPR-W refresh with a new stepping to fix this... emerald rapids is also very power-hungry and there were some unconfirmed murmurs suggesting it might have the same crash problems.
Intel's in some real danger especially with AMD ascendant like this. Like it doesn't take very long of this real damage to customers etc and that "we're blue-chip!" thing will cease to be, and that is the last prop keeping intel's finances above the water here. Sure, it will take a while to fully wind down but... this is a great example of how intel's fuckups are driving their clients literally into the arms of the competition. A month or two ago, Asrock Rack didn't give a shit about the B650-2L2T or whatever. Guess what? Now Epyc Mini exists and oems are going to be paying attention to that. Oops.
Damn, didn't realise that was still being problematic too. :(
And yeah, Intel's current stumble with 13th/14th gen cpus seems like worst possible timing for such an extreme fuck up. That's not going to go well for future planning/purchase decisions by business customers.
ECC support wasn't good initially on AM5, but there are now Epyc branded chips for the AM5 socket which officially support ECC DDR5. They come in the same flavors as the Ryzen 7xx0 chips, but are branded as Epyc.
More E-core is reasonable for multi threaded application performance. It's efficient for power and die area as the name indicates, so they can implement more E-cores than P-cores for the same power/area budget. It's not suitable for who need many single threaded performance cores like VM server, but I don't know is there any major consumer usage requires such performance.
I can sort of see that. The way I saw it explained as them being much lower clock and having a pretty small shared cache. I could see E cores as being great for running background processes and stuff. All the benchmarks seem to show the AMDs with 2/3rd the cores being around the same performance and with similar power draw. I'm not putting them down. I'm just saying it seems gimmicky to say "look at our 20 core!" with the implicit idea that people will compare that with an AMD 12 core seeing 20>12, but not seeing the other factors like cost and benchmarks.
> so few performance cores and so many efficiency cores
I was baffled by this too but what they don't make clear is the performance cores have hyperthreading the efficiency cores do not.
So what they call 2P+4E actually becomes an 8 core system as far as something like /proc/cpuinfo is concerned. They're also the same architecture so code compiled for a particular architecture will run on either core set and can be moved from one to the other as the scheduler dictates.
> They're also the same architecture so code compiled for a particular architecture will run on either core set and can be moved from one to the other as the scheduler dictates.
I don't know if that has done more good than harm, since they ripped AVX-512 out for multiple generations to ensure parity.
A major differentiator is that Intel CPUs with E cores don’t allow the use of AVX-512, but all current AMD CPUs do. The new Zen 5 chips will run circles around Intel for any such workload. Video encoding, 3D rendering, and AI come to mind. For developers: many database engines can use AVX-512 automatically.
> Like look at this 20 core processor! Oh wait, it's really an 8 core when it comes to performance.
The E cores are about half as fast as the P cores depending on use case, at about 30% of the size. If you have a program that can use more than 8 cores, then that 8P+12E CPU should approach a 14P CPU in speed. (And if it can't use more than 8 cores then P versus E doesn't matter.) (Or if you meant 4P+16E then I don't think those exist.)
> Hard to compare that to a 12 core 3D cached Ryzen with even higher clock...
Only half of those cores properly get the advantage of the 3D cache. And I doubt those cores have a higher clock.
AMD's doing quite well but I think you're exaggerating a good bit.
> If you have a program that can use more than 8 cores, then that 8P+12E CPU should approach a 14P CPU in speed
Only if you use work stealing queues or (this is ridiculously unlikely) run multithreaded algorithms that are aware of the different performance and split the work unevenly to compensate.
It’s a common strategy for small tasks where the overhead of dispatching the task greatly exceeds the computation of it. It’s also a better way to maximize L1/L2 cache hit rates by improving memory locality.
Eg you have 100M rows and you want to cluster them by a distance function (naively), running dist(arr[i], arr[j]) is crazy fast, the problem is just that you have so many of them. It is faster to run it on one core than dispatch it from one queue to multiple cores, but best to assign the work ahead of time to n cores and have them crunch the numbers.
It has always been a bad idea to dispatch so naively and dispatch to the same number of threads as you have cores. What if a couple cores are busy, and you spend almost twice as much time as you need waiting for the calculation to finish? I don't know how much software does that, and most of it can be easily fixed to dispatch half a million rows at a time and get better performance on all computers.
Also on current CPUs it'll be affected by hyperthreading and launch 28 threads, which would probably work out pretty well overall.
If you don't pin them to cores, the OS is still free to assign threads to cores as it pleases. Assuming the scheduler is somewhat fair, threads will progress at roughly the same rate.
Then you no longer have 14 cores in this example, but only len(P) cores. Also most code written in the wild isn’t going to use an architecture-specific library for this.
Yeah, the 20 core Intels are benchmarking about the same as the 12 core AMD X3Ds. But many people just see 20>12. Either one is more than fine for most people.
"Oh wait, it's really an 8 core when it comes to performance [cores]". So yes, should not be an 8 core all together, but like you said about 14 cores, or 12 with the 3D cache.
"And I doubt those cores have a higher clock."
I'm not sure what we're comparing them to. They should be capable of higher clock than the E cores. I thought all the AMD cores had the ability to hit the max frequency (but not necessarily at the same time). And some of the cores might not be able to take advantage of the 3D cache, but that doesn't limit their frequency, from my understanding.
It’s kind of funny and reminiscent of the AMD bulldozer days where they had a ton of cores compared to the contemporary Intel chips, especially at low/mid price points but the AMD chips were laughably underwhelming for single core performance which was even more important then.
I can’t speak to the Intel chips because I’ve been out of the Intel game for a long time but my 5700X3D does seem to happily run all cores at max clock speed.
> I'm not sure what we're comparing them to. They should be capable of higher clock than the E cores.
Oh, just higher clocked than the E cores. Yeah that's true, but if you're using that many cores at once you probably only care about total speed.
You said 12 core with higher clock versus 8, so I thought you were comparing to the performance cores.
> I thought all the AMD cores had the ability to hit the max frequency (but not necessarily at the same time).
The cores under the 3D cache have a notable clock penalty on existing CPUs.
> And some of the cores might not be able to take advantage of the 3D cache, but that doesn't limit their frequency, from my understanding.
Right, but my point is it's misleading to call out higher core count and the advantages of 3D stacking. The 3D stacking mostly benefits the cores it's on top of, which is 6-8 of them on existing CPUs.
It's due to the stacked cache being harder to cool and not supporting as high of a voltage. So the 3D CCD clocks lower, but for some workloads it's still faster (mainly ones dealing with large buffers, like games, most compute heavy benchmarks fit in normal caches and the non 3D V-Cache variants take the win).
Maybe a stretch - but this reminds me of blood sugar regulation for people with type 1 diabetes.
Too low is dangerous because you lose rational thought, and the ability to maintain consciousness or self-recover. However, despite not having the immediate dangers of being low, having high blood sugar over time is the condition which causes long-term organ damage.
I think it's telling that they are delaying the microcode patch until after all the reviewers publish their Zen5 reviews and the comparisons of those chips against current Raptorlake performance.
Because the benchmarks will still exist on the sites after the microcode is released and a lot of the sites won't bother to go back and update them with the accurate performance level.
Reminds me of Sudden Northwood Death Syndrome, 2002.
Looks like history may be repeating itself, or at least rhyming somewhat.
Back then, CPUs ran on fixed voltages and frequencies and only overclockers discovered the limits. Even then, it was rare to find reports of CPUs killed via overvolting, unless it was to an extreme extent --- thermal throttling, instability, and shutdown (THERMTRIP) seemed to occur before actual damage, preventing the latter from happening.
Now, with CPU manufacturers attempting to squeeze all the performance they can, they are essentially doing this overclocking/overvolting automatically and dynamically in firmware (microcode), and it's not surprising that some bug or (deliberate?) ignorance that overlooked reliability may have pushed things too far. Intel may have been more conservative with the absolute maximum voltages until recently, and of course small process sizes with higher potential for electromigration are a source of increased fragility.
Also anecdotal, but I have an 8th-gen mobile CPU that has been running hard against the thermal limits (100C) 24/7 for over 5 years (stock voltage, but with power limits all unlocked), and it is still 100% stable. This and other stories of CPUs in use for many years with clogged or even detached heatsinks seem to contribute to the evidence that high voltage is what kills CPUs, and neither heat nor frequency.
Edit: I just looked up the VCore maximum for the 13th/14th processors - the datasheet says 1.72V! That is far more than I expected for a 10nm process. For comparison, a 1st-gen i7 (45nm) was specified at 1.55V absolute maximum, and in the 32nm version they reduced that to 1.4V; then for the 22nm version it went up slightly to 1.52V.
> Back then, CPUs ran on fixed voltages and frequencies and only overclockers discovered the limits. Even then, it was rare to find reports of CPUs killed via overvolting, unless it was to an extreme extent --- thermal throttling, instability, and shutdown (THERMTRIP) seemed to occur before actual damage, preventing the latter from happening.
Oh the memories. I had a Thunderbird-core Athlon with a stock frequency of (IIRC) 1050Mhz. It was stable at 1600Mhz, and I ran it that way for years. I was able to get it to 1700Mhz, but then my CPU's stability depended on ambient temperatures. When the room got hot in the summer my workstation would randomly kernel panic.
Interesting, I hadn’t heard about the Pentium overlocking issues. My theory on the current issue that running chips for long periods of time at 100C is not good for chip longevity, but voltages could also be an issue. I came up with this theory last summer when I built my rig with a 13900k, though I was doing it with the intention of trying to set things up so the CPU could last 10 years.
Anecdotally, my CPU has been a champ and I haven’t noticed any stability issues despite doing both a lot of gaming and a lot of compiling on it. I lost a bit of performance but not much setting a power limit of 150W.
The mobile issue seems more anecdote than data? Almost as if people on Reddit heard the 13/14 CPUs were bad, then their laptop crashed, and they decided "it happened to me too".
The problem may exist, but Alderon Games' report on the mobile chip is more of an anecdote here because there's not enough data points (unlike their desktop claims), and the only SKU they give (13900HX) is actually a desktop chip in a mobile package (BGA instead of LGA, so we're back into the original issue). So in the end, even with Alderon's claims, there's really not enough data points to come to a conclusion on the mobile side of things.
> "The laptops crash in the exact same way as the desktop parts including workloads under Unreal Engine, decompression, ycruncher or similar. Laptop chips we have seen failing include but not limited to 13900HX etc.," Cassells said.
> "Intel seems to be down playing the issues here most likely due to the expensive costs related to BGA rework and possible harm to OEMs and Partners," he continued. "We have seen these crashes on Razer, MSI, Asus Laptops and similar used by developers in our studio to work on the game. The crash reporting data for my game shows a huge amount of laptops that could be having issues."
I'm not denying that the problem exists, but I don't think Alderon provided enough data to come to a conclusion, unlike on the desktop, where it's supported by other parties in addition to Alderon's data (where you can largely point to 14900KS/K/non-K/T, 13900KS/K/non-K/T, 14700K, and 13700K being the one affected)
Right now, the only example given is HX (which is a repackaged desktop chip[^], as mentioned), so I'm not denying that the problem is happening on HX based on their claims (and it makes a lot of sense that HX is affected! See below), but what about H CPUs? What about P CPUs? What about U CPUs? The difference in impact between "only HX is impacted" and "HX/H/P/U parts are all affected" is a few orders of magnitude (a very top-end 13th Gen mobile SKUs versus every 13th Gen mobile SKUs). Currently, we don't have enough data how widespread the issue is, and that makes it difficult to assess who is impacted by this issue from this data alone.
[^]: HX is the only mobile CPU with B0 stepping, which is the same as desktop 13th/14th Gen, while the mobile H/P/U family are J0 and Q0, which are essentially a higher clocked 12th Gen (i.e., using Golden Cove rather than Raptor Cove)
Alderon are the people claiming 100% of units fail which doesn’t seem supported by anyone else either. Wendell and GN seem to have scoped the issue to around 10-25% across multiple different sources.
Like they are the most extreme claimants at this point. Are they really credible?
I haven't watched the video in its entirely, but it does feel like single core boost might be the main culprit in this scenario. That actually makes a lot of sense why game servers are the one that's affected the most. Though that makes me wonder about HX CPUs failing, since these SKUs doesn't have TVB, but given Wendell's 10-25% failure rate on all-core load, it does seem like Intel may actually have multiple issues with RPL here. Things definitely doesn't look good.
For server CPUs there's not a similar problem or they realize server purchasers may be less willing to tolerate it? I'm not all that thrilled with the prospect of buying Intels especially when wondering about waiting to 5 year out replacement compared to a few generations ago, but AMD server choices can be a bit limited and I'm not really sure how to evaluate if there may be increasing surprises more across the board.
Are you talking about Xeon Scalable? Although they share the same core design as the desktop counterpart (Xeon Scalable 4th Gen shares the same Golden Cove as 12th Gen, Xeon Scalable 5th Gen shares the same Raptor Cove as 13th/14th Gen), they're very different from the desktop counterpart (monolithic vs tile/EMIB-based, ring bus vs mesh, power gate vs FIVR), and often running in a more conservative configuration (lower max clock, more conservative V/F curves, etc.). There has been a rumor about Xeon Scalable 5th Gen having the same issue, but it's more of a gossip rather than a data point.
The issue does happen with desktop chips that are being used in a server context when pairing with workstation chipset such as W680. However, there haven't been any reports of Xeon E-2400/E-3400 (which is essentially a desktop chip repurposed as a server) with C266 having these issues, though it may be because there hasn't been a large deployment of these chips on the server just yet (or even if there are, it's still too early to tell).
Do note that even without this particular issue, Xeon Scalable 4th Gen (Sapphire Rapids) is not a good chip (speaking from experience, I'm running w-3495x). It has plenty of issues such as slow clock ramp, high latency, high idle power draw, and the list goes on. While Xeon Scalable 5th Gen (Emerald Rapids) seems to have fixed most of these issues, Zen 4 EPYC is still a much better choice.
After watching https://youtube.com/watch?v=gTeubeCIwRw and some related content, I personally don't believe it's an issue fixable with microcode. I guess we'll see.
Because HN doesn't provide link previews, I'd recommend adding some information about the content to your comment. Otherwise we have to click through to YouTube for the comment to make any sense.
That said, the video is the GamersNexus one where they talk about an unverified claim that this is a fabrication process issue caused by oxidation between atomic deposition layers. If that's the case, then yeah, microcode can only do so much. But like Steve says in the video, the oxidation theory has yet to be proven and they're just reporting what they have so far ahead of the Zen 5 reviews coming soon.
GN mentioned shipping a few samples to a lab (number dependent on the price quote from said lab), so I hope we’ll have some closure regarding this hypothesis.
This is the most pressing question. If it was just a microcode issue a cooloff and power cycle ought to at least reset things but according to Wendel from Level 1 Tech, that doesn't seem to always be the case.
The problem is that running at too high of a voltage for sustained periods can cause physical degradation of the chip in some cases. Hopefully not here!
Just want to say, I'm incredibly happy with my 7800X3D. It runs ~70C max like Intel chips used to and with a $35 air cooler and it's on average the fastest chip for gaming workloads right now.
Same, in my BIOS I can activate a "ECO Mode", which lets me decide if I want to run my 7950x on full 170W TDP, 105W TDP or 60W TDP.
I benchmarked it, the difference between 170 and 105 is basically zero, and the difference to 60W is just a few percent of a performance hit, but way worth it, as it's ~0.3€/kWh over here.
you might want to check a tool called PBO2Tunner (https://www.cybermania.ws/apps/pbo2-tuner/), you can tweak values like EDC,TDC and PPT (power limit) from the GUI, and it also accepts command line commands so you can automate those tasks.
I made scripts that "cap" the power consumption of the cpu based on what applications are running. (i.e. only going all in on certain games, dynamically swaping between 65-90-120-180w handmade profiles)
i made with power saving in mind given the idle power consumption is rather high on modern ryzens.
edit: actually made a mistake given that PBO2Tunner is for Zen3 cpus, and you mentioned Zen4.
I was concerned this would happen to them, given how much power was being pushed through their chips to keep them competitive. I get the impression their innovation has either truly slowed down, or AMD thought enough 'moves' ahead with their tech/marketing/patents to paint them into a corner.
I don't think Intel is done though, at least not yet.
The first two require a lot more effort in video editing than creating a forum post. Plus, it’s just going to be digested and regurgitated for the masses by people much better at communicating technical information.
Based on what I know about corporations, it's entirely plausible that the folks posting the information don't actually have access to the communication channels you are referring to. I don't even know how I would issue an official communication at my own company if the need ever came up... so you go with what you have.
The amount of current their chips pull on full boost is pretty crazy. It would definitively not surprise me if some could get damaged by extensive boosting.
I built a system last fall with an i9-13900K and have been having the weirdest crashing problems with certain games that I never had problems with before. NEVER been able to track it down, no thermal issues, no overclocking, all updated drivers and BIOS. Maybe this is finally the answer I've been looking for.
It was for me. Check for BIOS updates - most motherboard vendors have them. Look for and enable something labeled Intel Baseline Profile and then check. That cured it for me.
I'll try that, thanks. Although the current cohort of games I play seems more stable now. If I ever go back to EVE Online then it'd be more of an issue - that thing crashed constantly.
Dumb question: let’s say I am in charge of procurement for a significant amount of machines, do I not have the option of ordering machines from three generations back? Are older (proven reliable) processors just not available because they’re no longer made, like my 1989 Camry?
Nice that Intel acknowledges there are problems with that CPU generation. If I read this right, the CPUs have been supplied with a too-high voltage across the board, with some tolerating the higher voltages for longer, others not so much.
Curious to see how this develops in terms of fixing defective silicon.
Very true and that's why it is odd that microcode has been mentioned here. Surely they mean PCU software (Pcode), or code for whatever they are calling the PCU these days.
Well, do they? The operating system can provide microcode updates to a running CPU. Can the operating system patch the PCU, too?
When I look at a "BIOS update" it usually seems to include UEFI, peripheral option ROMs, ME updates, and microcode. So if the PCU is getting patched I would think of it as a BIOS update. I think the ergonomics will be indistinguishable for end users.
Good for Intel to finally "figure it out" but I'm not 100% sure microcode is 100% of the problem. As in everything complex enough, the "problem" can actually be many compounded problems, MB vendors "special" tune comes to mind.
But this is already a mess very hard to clean since I feel many of these CPUs will die in an year or 2 because of these problems today but by then nobody will remember this and an RMA will be "difficult" to say the least.
You're right - at least partly. If the issue is that Intel was too aggressive with voltages, they can use microcode updates as 1) an excuse to rejigger the power levels and voltages the BIOS uses as part of the update, and 2) they can have the processor itself be more conservative with the voltages and clocking it calculates itself.
Anything Intel announces, in my experience, is half true, so I'm interested to see what's actually true and what Intel will just forget to mention or will outright hide.
Is there any info on how to diagnose this problem? Having just put together a computer with the 14900KF, I really don't want to swap it out if not necessary.
There is no reliable way to diagnose this issue with the 14th gen, the chip slowly degrades over time and you start getting more and more (usually gpu driver under windows) crashes. I believe the easy way might be to run decompression stress tests if I remember correctly from Wendell's (Level1Techs) video.
I highly recommend going into your motherboard right now and manually setting your configurations to the current intel recommendation to prevent it from degrading to the point where you'd need to RMA it. I have a 14900K and it took about 2.5 months before it started going south and it was getting worse by the DAY for me. Intel has closed my RMA ticket since changing the bios settings to very-low-compared-to-what-the-original-is has made the system stable again, so I guess I have a 14900K that isn't a high end chip anymore.
Below are the configs intel provided to me on my RMA ticket that have made my clearly degraded chip stable again:
Tbh, XMP is probably the cause of most modern crashes on gaming rigs. It does not guarantee stability. After finding a stable cpu frequency, enable xmp and roll back the memory frequency until you have no errors in occp. The whole thing can be done in 20 minutes and your machine will have 24/7/365 uptime.
This is good advice for overclocking, but how does it help with the 13th/14th Gen issue? The issue is not due to clocks, or at least doesn't appear to be.
it’s also a terrible stability test these days for the same reasons Wendell talks about with cinebench in his video with Ian (and Ian agrees too). Doesn’t work like 90% of the chip - it’s purely a cache/avx benchmark. You can have a completely unstable frontend and it’ll just work fine because prime95 fits in icache and doesn’t need the decoder, and it’s just vector op, vector op, vector op forever.
You can have a system that’s 24/7 prime95 stable that crashes as soon as you exit out, because it tests so very little of it. That’s actually not uncommon due to the changes in frequency state that happen once the chip idles down… and it’s been this way for more than a decade, speedstep used to be one of the things overclockers would turn off because it posed so many problems vs just a stable constant frequency load.
Prime95 also completely ignores the high-clock domain btw so it can also be completely prime95 stable yet fail completely on a routine task that boosts up! So it’s technically not even a full test of core stability either.
If I didn’t just recently invest in 128gb of DDR4 I’d jump ship to AMD/AM5. My 13900k has been (knock on wood) solid though - with 24/7 uptime since July 2023.
I guess you’re lucky. I own 2 machines for small scale CNN training, one 13900k and one 14900k. I have to throttle the CPU performances to 90% for stable running. This cost me about 1 hour / 100 hours of training.
Are you using any motherboard overclocking stuff? A lot of mobo’s are pushing these chips pretty hard right out of the box.
I have mine at a factory setting that Intel would suggest, not the asus multi core enhancement crap. noctua dh15 cooler. It’s really been a stable setup.
My 13900k has definitely degraded over time. I was running bus defaults for everything and the pc was fine for several months. When I started getting crashes it took me a long time to diagnose it as a CPU problem. Changing the mobo vdroop setting made the problem go away for a while, but it came back. I then got it stable again by dropping the core multipliers down to 54x, but then a couple months later I had to drop to 53x. I just got an rma replacement and it had made it 12 hours without issue.
I evaluated ddr4 vs ddr5 a year ago, and it wasn’t worth it. Chasing FPS and the cost to hit the same speed in ddr5 was just too high, and I’m glad I did. I’m on a 13700k and I’m also very stable. However, with the stock XMP profile for my ram I was very much not stable and getting errors and bsods within minutes on an occp burn in test. All I had to do was roll back the memory clock speed a few hundred mhz.
We've already seen examples of this happening on non-OC'd server-style motherboards that perfectly adhere to the intel spec. This isn't like ASUS going 'hur dur 20% more voltage' and frying chips. If that's all it was it would be obvious.
Lowering voltage may help mitigate the problem, but it sure as shit isn't the cause.
It's worth noting that W680 boards are not a server board, they're a workstation board, and often times they're overclockable (or even overclocked by default). Wendell actually showed the other day that the ASUS W680 board was feeding 253W into a 35W (106W boost) 13700T CPU by default[1].
Supermicro and ASRock Rack do sell W680 as a server (because it took Intel a really long time to release C266), but while they're strictly to the spec, some boards are really not meant for K CPUs. For example, the Supermicro MBI-311A-1T2N is only certified for a non-TVB E/T CPUs, and trying to run the K CPU on these can result in the board plumbing 1.55V into the CPU during the single core load (where 1.4V would already be on the higher side)[2].
In this particular case, the "non-OC'd server-style motherboard" doesn't really mean anything (even more so in the context of this announcement).
They also admit a microcode algorithm produces incorrect requests for voltages, it doesn't sound like they're trying to shift the blame; ASUS doesn't write that microcode
Specifically I think the concerns are around idle voltage and overshoot at this point, which is indeed something configured by OEMs.
edit: BZ just put out a video talking about running Minecraft servers destroying CPUs reliably, topping out at 83C, normally in the 50s, running 3600 speeds. Which is a clear issue with low-thread loads.
A recent YouTube video by GamersNexus speculated the cause of instability might be a manufacturing issue. The employee's response follows.
Questions about manufacturing or Via Oxidation as reported by Tech outlets:
Short answer: We can confirm there was a via Oxidation manufacturing issue (addressed back in 2023) but it is not related to the instability issue.
Long answer: We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue.
For the Instability issue, we are delivering a microcode patch which addresses exposure to elevated voltages which is a key element of the Instability issue. We are currently validating the microcode patch to ensure the instability issues for 13th/14th Gen are addressed
So they were producing defective CPUs, identified & addressed the issue but didn’t issue a recall, defect notice or public statement relating to the issue?
What makes you think there was a "defective batch"? What makes you think all the CPUs affected by that production issue will fail from it?
That description sounds to me like it affected the entire production line for months. It's only worth a recall if a sufficient percent of those CPUs will fail. (I don't want to argue about what particular percent that should be.)
My CPU was unstable for months, I spent tens of hours and hundreds on equipment to troubleshoot (I _never_ thought my CPU would be the cause). Had I of known this, I would have scrutinised the cpu a lot faster than what I did.
Intel not making a public statement about potentially defective products could have been done with good PR spin ‘we detected an issue, believe the defect rate will be < 0.25%, here’s a test suite you can run, call if you think you’re one of the .25!’ But they didn’t.
I’m never buying an intel product again. Fuck intel.
This comment chain is talking about the oxidation in particular, and specifically the situation where the oxidation is not the cause of the instability in the title. That's the only way they "identified & addressed the issue but didn’t issue a recall".
Do you have a reason to think the oxidation is the cause of your problems?
Did you not read my first post trying to clarify the two separate issues?
Oxidisation in the context of CPU fabrication sounds pretty bad, I find it hard to believe it would have no impact on CPU stability regardless of what Intels PR team to say while minimizing any actual impacts caused.
Edit: it sounds like intel have been aware of stability issues for some time and have said nothing, I’m not sure we have any reason to trust anything they say moving forward, relating to oxidisation or any other claims they make.
Well they didn't notice it for a good while, so it's really hard to say how much impact it had.
And at a certain point if you barely believe anything they say, then you shouldn't be using their statement to get mad about. The complaint you're making depends on very particular parts of their statement being true but other very particular parts being not true. I don't think we have the evidence to do that right now.
> Well they didn't notice it for a good while, so it's really hard to say how much impact it had.
That negates any arguments you had related to failure rates.
> The complaint you're making depends on very particular parts of their statement being true but other very particular parts being not true
Er, I’m not even sure how to respond to this. GamersNexus has indicated they know about the oxidisation issue, intel *subsequently* confirm it was known internally but no public statement was made until now. I’m not unreasonably cherry picking parts of their statement and then drawing unreasonable conclusions. Intel have very clearly demonstrated they would have preferred to not disclose an issue in fabrication processes which very probably caused defective CPUs, they have demonstrated untrustworthy behaviour related to this entire thing (L1techs and GN are breaking the defective cpu story following leaks from major intel clients who have indicated that intel is basically refusing to cooperate).
Intel has known about these issues for some time and said nothing. They have cost organisations and individuals time and money. Nothing they say now can be trusted unless it involves them admitting fault.
> That negates any arguments you had related to failure rates.
I mean it's hard for us to say, without sufficient data. But Intel might have that much data.
Also what argument about failure rates? The one where I said "if" about failure rates?
> Er, I’m not even sure how to respond to this. GamersNexus has indicated they know about the oxidisation issue, intel subsequently confirm it was known internally but no public statement was made until now.
GamersNexus thinks the oxidation might be the cause of the instability everyone is having. Intel claims otherwise.
Intel has no reason to lie about this detail. It doesn't matter if the issue is oxidation versus something else.
Also the issue Intel admits to can't be the problem with 14th gen, because it only happened to 13th gen chips.
> Intel has known about these issues for some time and said nothing. Nothing they say now can be trusted unless it involves them admitting fault.
If you don't trust what Intel said today at all, then you can't make good claims about what they knew or didn't know. You're picking and choosing what you believe to an extent I can't support.
Intel cannot afford to be anything but outstanding in terms of customer experience right now. They are getting assaulted on all fronts and need to do a lot to improve their image to stay competitive.
Intel should take a page out of HP's book when it came to dealing with a bug in the HP-35 (first pocket scientific calculator):
> The HP-35 had numerical algorithms that exceeded the precision of most mainframe computers at the time. During development, Dave Cochran, who was in charge of the algorithms, tried to use a Burroughs B5500 to validate the results of the HP-35 but instead found too little precision in the former to continue. IBM mainframes also didn't measure up. This forced time-consuming manual comparisons of results to mathematical tables. A few bugs got through this process. For example: 2.02 ln ex resulted in 2 rather than 2.02. When the bug was discovered, HP had already sold 25,000 units which was a huge volume for the company. In a meeting, Dave Packard asked what they were going to do about the units already in the field and someone in the crowd said "Don't tell?" At this Packard's pencil snapped and he said: "Who said that? We're going to tell everyone and offer them, a replacement. It would be better to never make a dime of profit than to have a product out there with a problem". It turns out that less than a quarter of the units were returned. Most people preferred to keep their buggy calculator and the notice from HP offering the replacement.
I wonder if Mr. Packard's answer would have been different if a recall would have bankrupted the company or necessitated layoff of a substantial percentage of staff.
I can't speak for Dave Packard (or Bill Hewlett) - but I will try to step in to their shoes:
1) HP started off in test and measurement equipment (voltmeters, oscilloscopes etc.) and built a good reputation up. This was their primary business at the time.
2) The customer base of the HP-35 and test and measurement equipment would have a pretty good overlap.
Suppose the bug had been covered up, found, and then the news about the cover up came to light? Would anyone trust HP test and measurement equipment after that? It would probably destroy the company.
I assume they're referring to Steve Jobs' comments in this (Robert Cringely IIRC) interview: https://www.youtube.com/watch?v=l4dCJJFuMsE (not a great copy, but should be good enough)
Oh yeah, this got rehashed as builders versus talkers too. Yeah, there's a lot of this creative vibe type dividing. It's pretty complicated, I don't even think individual people operate the same when placed in a different context. Usually their output is a result of their incentives, so typically management failure or technical architect failure.
I would argue the fabrication process people at Intel are core to their business. Without the ability to reliably manufacture chips, they're dead in the water.
Without those weirdos do you think Intel would be doing anything about this in public?
And tell us how customers that bought the most expensive part from their lineup should feel about knowing that their cpu has been over voltaged from day one of operation..
Yeah sure, calling out Intel for lack of any good updates over the crashing laptop/desktop CPUs and demanding a recall after giving them such a long time to come up with a reasonable solution is definitely "weirdo" territory.
FWIW, I have connections who splurged on these only to deal with BSODs all the friggin' time. Some of them even work at Intel.
"Linus Tech Tips" for the gaming crowd situation (loss of "paid for" premium performance) and Torvalds for the hardware vendor lack of transparency with the community.
Saying "elevated voltage causes damage" is not attributing blame to anyone. In the very next sentence, they then attribute the reason for that elevated voltage to their own microcode, and so it is responsible for the damage. I literally do not know how they could be any clearer on that.
So it's a 737 MAX problem: the software is running a control loop that doesn't have deflection limits. So it tells the stabilizer (or voltage reg in this case) to go hard nose down.
The voltage supplied by the motherboard isn't supposed to be constant. The CPU is continuously varying the voltage it's requesting, based primarily on the highest frequency any of the CPU cores are trying to run at. The motherboard is supposed to know what the resistive losses are from the VRMs to the CPU socket, so that it can deliver the requested voltage at the CPU socket itself. There's room for either party to screw up: the CPU could ask for too much voltage in some scenarios, or the motherboard's voltage regulation could be poorly calibrated (or deliberately skewed by overclocking presets).
On top of all this mess: these products were part of Intel's repeated attempts to move the primary voltage rail (the one feeding the CPU cores) to use on-die voltage regulators (DLVR). They're present in silicon but unused. So it's not entirely surprising if the fallback plan of relying solely on external voltage regulation wasn't validated thoroughly enough.
Modern CPU's are incredibly complex machines with a ridiculously large amount of possible configuration states (too large to exhaustively test after manufacture or sim during design), e.g. a vector multiply in flight with an AES encode in flight with x87 sincos, etc. Each operation is going to draw a certain amount of current. It is impractical to guarantee each functional unit with the required current but the supply rails are sized for a "reasonable worst case".
Perhaps an underestimate was mistakenly made somewhere and not caught until recently. Therefore the fix might be to modify the instruction dispatcher (via microcode) to guarantee that certain instruction configurations cannot happen (e.g. let the x87 sincos stall until the vector multiply is done) to reduce pressure on the voltage regulator.
It's worse than that, thermal management is part of the puzzle. Think of that as heat generation happening across three dimensions (X + Y + time) along with diffusion in 3D through the package.
The claim seems to be that the microcode on the CPU is in certain circumstances requesting the wrong (presumably too high) voltage from the motherboard. If that is the case fixing the microcode will solve the issue going forward but won’t help people whose chips have already been damaged by excessive voltage.
I find it hard to believe that it actually is a microcode issue.
Mostly because Intel has way too much motivation to pass it off as a microcode issue, as they can fix a microcode issue for free, by pushing out a patch. If it's an actual hardware issue, then Intel will be forced to actually recall all the faulty CPUs, which could cost them billions.
The other reason, is that it took them way too long to give details. If it's as simple as a buggy microcode requesting an out-of-spec voltage from the motherboard, they should have been able to diagnose the problem extremely quickly and fix it in just a few weeks. They would have detected the issue as soon as they put voltage logging on the motherboard's VRM. And according to some sources, Intel have apparently been shipping non-faulty CPUs for months now (since April, from memory), and those don't have an updated microcode.
This long delay and silence feels like they spent months of R&D trying to create a workaround, create a new voltage spec to provide the lowest voltage possible. Low enough to work around a hardware fault on as many units as possible, without too large of a performance regression, or creating new errors on other CPUs because of undervolting.
I suspect that this microcode update will only "fix" the crashes for some CPUs. My prediction is that in another month Intel will claim there are actually two completely independent issues, and reluctantly issue a recall for anything not fixed by the microcode.
As I understand it, there are multiple voltages inside the CPU, so just monitoring the motherboard VRM won't cut it.
That said I too am very skeptical. I just issued a moratorium on the purchase of anything Intel 13th/14th gen in our company and waiting for some actual proof that the issue is fully resolved.
It's complicated.
On Raptor lake, there are a few integrated voltage regulators to which provide new voltages for specialised uses (like the E core's L2 cache, parts of DDR memory IO, PCI-E IO), but the current draw on those regulators is pretty low. The bulk of the power comes directly from motherboard VRMs on one of several rails with no internal regulation. Most of the power draw is grouped onto just two rails, VccGT for the GPU, and VccCore (also known as VccIA in other generations) which powers all the P-cores, all the E-cores and, the ring bus and the last-level cache.
Which means all cores share the same voltage, and it's trivial to monitor externally.
I guess it's possible the bug could be with only of the integrated voltage regulators, but those seem to only power various IO devices, and I struggle to see how they could trigger this type of instability.
What's special about the E core's L2 cache such that it gets on-chip regulated voltage?
I suspect it's for one of the low power modes.
Keep in mind that the L2 cache is the last level cache for the E cores, and is shared by the entire cluster of four E cores. (One of the two clusters connects to the ring bus and shares the main L3, the other goes directly to main memory)
I'm guessing Intel can shut down VccCore entirely (which wipes every other cache), while keeping just enough voltage to maintain the E core L2 cache. By keeping valid data in L2, they can resume execution on an E core much quicker.
And as long as the reason for waking is a small periodic housekeeping task, they don't even need to wake up main memory. All the data fits in the 2MB of L2 cache. This makes resuming even faster and saves even more power. Finally, quick resumes allow the task to complete quicker and shut down VccCore again, which saves even more power.
This extreme level of power saving isn't really useful for desktops, but very useful for laptops and tablets. BTW, I'm not talking about a sleep mode here, the CPU will ideally be able to enter this mode anytime there is no tasks to run for at least the next millisecond, so it can save power even when the user is actively using the system.
It's most likely both a hardware issue and a microcode issue.
Making CPUs is kind-of like sorting eggs. When they're made, they all have slightly different characteristics and get placed into bins (IE, "binned") based on how they meet the specs.
To oversimplify, the cough "better" chips are sold at higher prices because they can run at higher clock speeds and/or handle higher voltages. If there's a spec of dust on the die, a feature gets turned off and the chip is sold for a lower price.
In this case, this is most likely an edge case that would not be a defect if shipping microcode already handled it. (Although it is appropriate to ask if it would result in effected chips going into a lower-price bin if they are effected.)
> If there's a spec of dust on the die, a feature gets turned off and the chip is sold for a lower price.
Do you mean that if a 13900KS CPU has a manufacturing defect, it gets downgraded and sold as 13900F or something else according to the nature of the defect?
It's way more extreme than that.
For any named product (such as Raptor Lake) intel only make 1-3 unique silicon dies. Any product in the Alder Lake only had two dies, 8P+8E and 6P+0E [1]. Every single SKU comes from those two dies, if it has E cores, it's the 8P+8E die. Which means Alder Lake-N is actually the 8P+8E dies with all the P cores disabled.
The laptop versions, Alder Lake-P (20w) and Alder Lake-U (9 and 15w) are also the 8P+8E die, they couldn't use the 6P+0E die, because it has no E cores at all.
Raptor Lake is only one die with 8 P cores and 16 E cores, which they sell as every i9 and i7, along with the two top i5 designs. In the 13th generation, the remaining i5s are the Alder Lake 8P+8E die and the i3s are all Alder Lake 6P+0E dies.
The manufacturing defects aren't binary, it's not a simple pass/fail. It's all very analog: Some dies are simply able to reach higher clock speeds, or use more or less power. They test every single die and bin it based on its capabilities. The ones with the best power consumption go to the P and U SKUs. The ones which can reach the highest clock speeds are labeled as 13900KS, dies which just miss that get sold as 13900K, the rest get spread over all remaining SKUs based on their capabilities.
Intel couldn't decide to exclusively make 13900KS dies if they wanted to, because they are simply the top 0.1% of dies. They are forced to make 1000 dies, use the best one and sell the rest as lower SKUs.
[1] Wikichip has photos of the two dies: https://en.wikichip.org/wiki/intel/microarchitectures/alder_...
It's been almost 20 years since I worked in the industry, so I don't want to make assumptions about specific products.
When I was in the industry, it would be things like disabling caches, disabling cores, ect. I don't remember specific products, though.
Likewise, some die can handle higher voltages, clock speeds, ect.
Yes. It’s called the silicon lottery.
Silicon lottery was when you as a customer could get dies of varying degrees, some of which could be clocked higher than others. For the manufacturer it's not a lottery at all because the scales make the yields for various bins mostly predictable. Binning also means that you as a customer are much less likely to get a chip that is significantly better than specced although it still happens when chips sold as a lower bin for market segmentation purposes.
The months of R&D to create a workaround could simply be because the subset of motherboards which trigger this issue are doing something borderline/unexpected with their voltage management, and finding a workaround for that behaviour in CPU microcode is non-trivial. Not all motherboard models appear to trigger the fault, which suggests that motherboard behaviour is at least a contributing factor to the problem.
I think this issue was sort of cracked-open and popularized recently by this particular video from Level1Techs: https://www.youtube.com/watch?v=QzHcrbT5D_Y
Towards the middle of the video it brings up some very interesting evidence, from online game server farms that use 13900 and 14900 variants for their high single-core performance for the cost, but with server-grade motherboards and chipsets that do not do any overclocking, and would be considered "conservative". But these environments show a very high statistical failure rate for these particular CPU models. This suggests that some high percentage of CPUs produced are affected, and it's long run-time over which the problem can develop, not just enthusiast/gamer motherboards pushing high power levels.
All modern CPUs come out of the factory with many many bugs. The errata you see published are only the ones that they find after shipping (if you're lucky, they might not even publish all errata). Many bugs are fixed in testing and qualification before shipping.
That's how CPU design goes. The way that is done is by pushing as much to firmware as possible, adding chicken switches and fallback paths, and all sorts of ways to intercept regular operation and replace it with some trap to microcode or flush or degraded operation.
Applying fixes and workaround might cost quite a bit of performance (think spectre disabling of some kinds of branch predictors for an obvious very big one). And in some cases you even see in published errata they leave some theoretical correctness bugs unfixed entirely. Where is the line before accepting returns? Very blurry and unclear.
Almost certainly, huge parts of their voltage regulation (which goes along with frequency, thermal, and logic throttling) will be highly configurable. Quite likely it's run by entirely programmable microcontrollers on chip. Things that are baked into silicon might be voltage/droop sensors, temperature sensors, etc., and those could behave unexpectedly, although even then there might be redundancy or ways to compensate for small errors.
I don't see they "passed it off" as a microcode issue, just said that a microcode patch could fix it. As you see it's very hard from the outside to know if something can be reasonably fixed by microcode or to call it a "microcode issue". Most things can be fixed with firmware/microcode patches, by design. And many things are. For example if some voltage sensor circuit on the chip behaved a bit differently than expected in the design but they could correct it by adding some offsets to a table, then the "issue" is that silicon deviates from the model / design and that can not be changed, but firmware update would be a perfectly good fix, to the point they might never bother to redo the sensor even if they were doing a new spin of the masks.
On the voltage issue, they did not say it was requesting an out of spec voltage, they said it was incorrect. This is not necessarily detectable out of context. Dynamic voltage and frequency scaling and all the analog issues that go with it are fiendishly complicated, voltage requested from a regulator is not what gets seen at any given component of the chip, loads, switching, capacitance, frequency, temperature, etc., can all conspire to change these things. And modern CPUs run as close to absolute minimum voltage/timing guard bands as possible to improve efficiency, and they boost up to as high voltages as they can to increase performance. A small bug or error in some characterization data in this very complicated algorithm of many variables and large multi dimensional tables could easily cause voltage/timing to go out of spec and cause instability. And it does not necessarily leave some nice log you can debug because you can't measure voltage from all billion components in the chip on a continuous basis.
And some bugs just take a while to find and fix. I'm not a tester per se but I found a logic bug in a CPU (not Intel but commercial CPU) that was quickly reproducible and resulted in a very hard lockup of a unit in the core, but it still took weeks to find it. Imagine some ephemeral analog bug lurking in a dusty corner of their operating envelope.
Then you actually have to develop the fix, then you have to run that fix through quite a rigorous testing process and get reasonable confidence that it solves the problem, before you would even make this announcement to say you've solved it. Add N more weeks for that.
So, not to say a dishonest or bad motivation from Intel is out of the question. But it seems impossible to make such speculations from the information we have. This announcement would be quite believable to me.
I agree with most of what you said, so cherry picking one thingy to reply to isn't my intention, but
"And some bugs just take a while to find and fix."
I think it's less that it took awhile to find the bug/etc, more so that they've been pretty much radio silent for six months. AMD had the issue with burning 7 series CPUs, they were quick to at least put out a statement that they'll make customers whole again.
Well as it comes to Intel executive management and PR, I'm entirely unqualified to make any educated comment or speculation about it. I can't say I'm aware of Intel ever having great renown for its handling of product defects though.
Oh, I'm certainly the same, just some rando enjoying my popcorn.
> As you see it's very hard from the outside to know if something can be reasonably fixed by microcode or to call it a "microcode issue
They claimed:
> a microcode algorithm resulting in incorrect voltage requests to the processor.
I was responding in context of OP's theory that their statement may not be entirely truthful.
The thing is, "incorrect" implies the existence of a static "correct". Which I interpret as a static spec which a microcode bug violated and could be fixed back to that static spec with a simple microcode update.
I do find your suggested scenario to be very plausible. That Intel have discovered their original voltage algorithm was flawed, leading to instability. And it is very feasible that simply updating the microcode is the correct fix for such an issue.
If Intel had explicitly stated that the original voltage algorithm spec was wrong, and the new one fixes the issue, I'd be pretty willing to believe them, and probably wouldn't have written that comment.
I'm not saying your integration of "incorrect voltage" as meaning "voltage that we now know causes instability" is wrong. It's an ambiguous statement and either interpretation is valid. But I have experience working with PR people, they know how to avoid ambiguous statements.
PR people are also experts at using ambiguous statements to their advantage. Crafting statements where not only are there multiple possible interoperation, but statements where the average reader will tend to interpret in the best possible way. I have experience in helping PR people to craft such statements. There are a few other examples of "ambitious statements" in that statement, which leads me to question the honesty of the whole thing.
I believe that the waters may be muddied enough that they wont have to do a full recall and only if you 'provide evidence' the system is still crashing.
> I find it hard to believe that it actually is a microcode issue.
They learned a lot from the Pentium disaster, even if it's a hardware issue, they can address it with microcode at least, which is just as good.
Except normally the result of a microcode workaround is that the chip no longer performs at its claimed/previously-measured level. Not "as good" by any standard.
For example, Intel CPU + Spectre mitigation is not "as good" as a CPU that didn't have the vulnerability in the first place.
Microcode changes don't have to affect performance negatively. Do you have any evidence this one will? If it's a voltage algorithm failure, then I would expect that they could run it as advertised with corrected microcode. Unstable power is a massive issue for electronics like this and I have no problem believing their explanation. Bad power causes all sorts of weird issues.
If it was a microcode bug to begin with, fixing the bug wouldn't need to degrade performance. If it was e.g. a bad sensor, that you can "correct" well enough by postprocessing, it doesn't need to degrade performance. But if it's essentially incorrect binning -- the hardware can't function as they thought it would, use microcode to limit e.g. voltage to the range where it works right -- then that will degrade performance.
> If it was a microcode bug to begin with, fixing the bug wouldn't need to degrade performance.
This is both a completely untrue statement, and a judgement on a fix that hasn't been released yet.
At least with spectre applying the mitigation was a choice. You could turn it off and game at full speed, while turning it on for servers and web browsing for safety.
This is busted or working.
https://scholar.harvard.edu/files/mickens/files/theslowwinte...
"Unfortunately for John, the branches made a pact with Satan and quantum mechanics [...] In exchange for their last remaining bits of entropy, the branches cast evil spells on future genera- tions of processors. Those evil spells had names like “scaling- induced voltage leaks” and “increasing levels of waste heat” [...] the branches, those vanquished foes from long ago, would have the last laugh."
"John was terrified by the collapse of the parallelism bubble, and he quickly discarded his plans for a 743-core processor that was dubbed The Hydra of Destiny and whose abstract Platonic ideal was briefly the third-best chess player in Gary, Indiana. Clutching a bottle of whiskey in one hand and a shot- gun in the other, John scoured the research literature for ideas that might save his dreams of infinite scaling. He discovered several papers that described software-assisted hardware recovery. The basic idea was simple: if hardware suffers more transient failures as it gets smaller, why not allow software to detect erroneous computations and re-execute them? This idea seemed promising until John realized THAT IT WAS THE WORST IDEA EVER. Modern software barely works when the hardware is correct, so relying on software to correct hardware errors is like asking Godzilla to prevent Mega-Godzilla from terrorizing Japan. THIS DOES NOT LEAD TO RISING PROP- ERTY VALUES IN TOKYO. It’s better to stop scaling your transistors and avoid playing with monsters in the first place, instead of devising an elaborate series of monster checks- and-balances and then hoping that the monsters don’t do what monsters are always going to do because if they didn’t do those things, they’d be called dandelions or puppy hugs."
I haven't read this piece before but I just knew it was going to be written by Mickens about halfway through your comment.
The "mickens" in the URL on the first line was a dead giveaway :-)
> According to my dad, flying in airplanes used to be fun... Everybody was attractive ....
this is how I feel about electric car supercharging stations at the moment. There is a definitely a privilege aspect, which some attractive people are beneficiaries of in a predictable way, as well as other expensive maintenance for their health and attraction.
so I could see myself saying the same thing to my children
I'm ruining that trend by charging my E-Transit in nice places and dressing poorly.
Thanks - it is rather funny.
Remains to be seen how the microcode patch affects performance, and how these CPUs that have been affected by over-voltage to the point of instability will have aged in 6 months, or a few years from now.
More voltage generally improves stability, because there is more slack to close timing. Instability with high voltage suggests dangerous levels. A software patch can lower the voltage from this point on, but it can't take back any accumulated fatigue.
> Remains to be seen how the microcode patch affects performance
intel is claiming 4% performance hit on the final patch https://youtu.be/wkrOYfmXhIc?t=308
I was recently looking at building and buying a couple systems. I've always liked Intel. I went AMD this time.
It seemed like the base frequencies vs boost frequencies were much farther apart on Intel than with most of the AMDs. This was especially true on the laptops were cooling is a larger concern. So I suspect they were pushing limits.
Also, the performance core vs efficiency core stuff seemed kind of gimmicky with so few performance cores and so many efficiency cores. Like look at this 20 core processor! Oh wait, it's really an 8 core when it comes to performance. Hard to compare that to a 12 core 3D cached Ryzen with even higher clock...
I will say, it seems intel might still have some advantages. It seems AMD had an issue supporting ECC with the current chipsets. I almost went Intel because of it. I ended up deciding that DDR5 built in error correction was enough for me. The performance graphs also seem to indicate a smoother throughput suggesting more efficient or elegant execution (less blocking?). But on the average the AMDs seem to be putting out similar end results even if the graph is a bit more "spikey".
> It seems AMD had an issue supporting ECC with the current chipsets.
AMD has the advantage with regards to ECC. Intel doesn't support ECC at all on consumer chips, you need to go Xeon. AMD supports it on all chips, but it is up to the motherboard vendor to (correctly) implement. You can get consumer-class AM4/5 boards that have ECC support.
> AMD supports [ECC RAM] on all chips
There was a strange happening with AMD laptop CPUs (“APUs”): the non-soldered DDR5 variants of the 7x40’s were advertised to support ECC RAM on AMD’s website up until a couple months before any actual laptops were sold, then that was silently changed and ECC is only on the PRO models now. I still don’t know if this is a straightforward manufacturing or chipset issue of some kind or a sign of market segmentation to come.
(I’m quite salty I couldn’t get my Framework 13 with ECC RAM because of this.)
> AMD supports it on all chips
Unfortunately not. I can't say for current gen, but the 5000 series APUs like the 5600G do not support ECC. I know, I tried...
But yes, most Ryzen CPUs do have ECC functionality, and have had it since the 1000 series, even if not officially supported. Official support for ECC is only on Ryzen PRO parts.
You need W680 boards (starting at around 500 bucks) for ECC on desktop intel chips.
I was seeing them around $400 (still expensive).
Actually some of the 13th and 14th gen Intel Core processors support ECC.
Intel has always had randomly supported ECC on desktop CPUs. Sometimes it was just a few low end SKUs, sometimes higher end SKUs. 14th gen it appears i9s and i7s do, didn't check i5s, but i3s did not.
My understanding is that it's screwed up for multiple vendors and chipsets. The boards might say they support it, but there are some updates saying it's not. It seemed extremely hard to find any that actually supported it. It was actually easier to find new Intel boards supporting ECC.
yeah wendell put out a video a few weeks ago exploring a bunch of problems with asrock rack-branded server-market B650 motherboards and basically the ECC situation was exactly what everyone warns about: the various BIOS versions wandered between "works, but doesn't forward the errors", "doesn't work, and doesn't forward the errors", and (excitingly) "doesn't work and doesn't even post". We are a year and a half after zen4 launched and there barely are any server-branded boards to begin with, and even those boards don't work right.
https://youtu.be/RdYToqy05pI?t=503
I don't know how many times it has to be said but "doesn't explicitly disable" is not the same thing as "support". There are lots of other enablement steps that are required to get ECC to work properly, and they really need to be explicitly tested with each release (which if it is "not explicitly disabled", it's not getting tested). Support means you can complain to someone when it doesn't work right.
AMD churns AGESA really, really hard and it breaks all the time. Partners have to try and chase the upstream and sometimes it works and sometimes it doesn't. Elmor (Asus's Bios Guy) talked about this on Overclock.net back around 2017-2018 when AMD was launching X399 and talked about some of the troubles there and with AM4.
That said, the current situation has seemingly lit a fire under the board partners, with Intel out of commission and all these customers desperate for an alternative to their W680/raptor lake systems (which do support ecc officially, btw) in these performance-sensitive niches or power-limited datacenter layouts, they are finally cleaning up the mess like, within the last 3 weeks or so. They've very quickly gone from not caring about these boards to seeing a big market opportunity.
https://www.youtube.com/watch?v=n1tXJ8HZcj4
can't believe how many times I've explained in the last month that yes, people do actually run 13700Ks in the datacenter... with ECC... and actually it's probably some pretty big names in fact. A previous video dropped the tidbit that one of the major affected customers is Citadel Capital - and yeah, those are the guys who used to get special EVEREST and BLACK OPS skus from intel for the same thing. Client platform is better at that, the very best sapphire rapids or epyc -F or -X3D sku is going to be like 75% of the performance at best. It's also the fastest thing available for serving NVMe flash storage (and Intel specifically targeted this, the Xeon E-2400 series with the C266 chipset can talk NVMe SAS natively on its chipset with up to 4 slimsas ports...)
it's somewhere in this one I think: https://www.youtube.com/watch?v=5KHCLBqRrnY
The new EPYC processors for AM5 though look like they'll be ok for ECC ram though, at least in the coming months onwards.
Yeah I think that’s the bright spot, now that there’s a branded offering for server-flavored Ryzen now maybe there is a permanent justification for doing proper validation.
I just feel vindicated lol, it always comes up that “well works fine for me!” and the reality is it’s a total crapshoot with even server-branded boards often not working. There is zero chance your gigabyte UD3 or whatever is going to be consistently supported across bios and often it will not be.
And AMD is really really tied to AGESA releases, so it’s fairly important on that side. Although I guess maybe we’re seeing now what happens if you let too much be abstracted away… but on the other hand partners were blowing up AMD chips last year too.
If you’re comfortable always testing, and always having the possibility of there being some big AGESA problem and ecc being broken on the new versions… ok I guess.
There is a reason the i3 chips were perennial favorites for edge servers and NASs. And I think it's really, really hard to overstate the long-term damage from reputation loss here. Intel, meltdown aside, was always no-drama in terms of reliability. Other than C2000/C3000, I guess.
...and puma and i-225V chipsets.
or at least... maybe on the CPU side they were no-drama. Other than C2000/C3000. Granted the powervr graphics on the atoms way back did suck... and meltdown... and avx-512 being rolled back... /phillip j fry counting on his fingers
maybe "blue-chip coded" is a better way to express it ig
but like, there is a notable decline in the quality of execution of intel overall, pretty much across the board, and cpu was always their core vertical, right? That was their business redoubt. intel is blue chip chips, especially CPUs. And now it's falling - really it's been falling for a while. Meltdown I can generally excuse (yes, shush), nobody appreciated sidechannels back then even if they were theoretically known. C2000/C3000 is another fuckup. yeah it's the super-io/serial bus controller... technically not their IP but it happens to be in a critical path, on their node, killing their processor. They fucked up the validation there, evidently.
I-225V had three steppings and I-226V is still not fully fixed (windows/linux have just turned off the EEE/802.11az feature instead). Puma was a god damned mess.
Sapphire rapids was late, still a huge mess, and actually the -W platform had not only insane power draw, but also insaner transients. 750W average, spiking up to 1500W under load, with pretty steep holdup requirements. And actually that was locked behind a "water cooled" bios option, the processor just "refused to all-core turbo" otherwise. And Intel didn't wanna actually say that the "water cooled" behavior was the spec or intentional turbo limits etc. In hindsight hmmm, that all took a bit of a different tone, didn't it?
Supposedly there is going to be a SPR-W refresh with a new stepping to fix this... emerald rapids is also very power-hungry and there were some unconfirmed murmurs suggesting it might have the same crash problems.
(yes, yes, please just listen to the guest here.) https://www.youtube.com/watch?v=_HJu5xt43iQ&t=3603s
https://wccftech.com/intel-xeon-w-3500-w-2500-sapphire-rapid...
Intel's in some real danger especially with AMD ascendant like this. Like it doesn't take very long of this real damage to customers etc and that "we're blue-chip!" thing will cease to be, and that is the last prop keeping intel's finances above the water here. Sure, it will take a while to fully wind down but... this is a great example of how intel's fuckups are driving their clients literally into the arms of the competition. A month or two ago, Asrock Rack didn't give a shit about the B650-2L2T or whatever. Guess what? Now Epyc Mini exists and oems are going to be paying attention to that. Oops.
> I-226V is still not fully fixed
Damn, didn't realise that was still being problematic too. :(
And yeah, Intel's current stumble with 13th/14th gen cpus seems like worst possible timing for such an extreme fuck up. That's not going to go well for future planning/purchase decisions by business customers.
ECC support wasn't good initially on AM5, but there are now Epyc branded chips for the AM5 socket which officially support ECC DDR5. They come in the same flavors as the Ryzen 7xx0 chips, but are branded as Epyc.
More E-core is reasonable for multi threaded application performance. It's efficient for power and die area as the name indicates, so they can implement more E-cores than P-cores for the same power/area budget. It's not suitable for who need many single threaded performance cores like VM server, but I don't know is there any major consumer usage requires such performance.
> but I don't know is there any major consumer usage requires such performance.
Gaming.
There are some games that will benefit from greater single-core performance.
I can sort of see that. The way I saw it explained as them being much lower clock and having a pretty small shared cache. I could see E cores as being great for running background processes and stuff. All the benchmarks seem to show the AMDs with 2/3rd the cores being around the same performance and with similar power draw. I'm not putting them down. I'm just saying it seems gimmicky to say "look at our 20 core!" with the implicit idea that people will compare that with an AMD 12 core seeing 20>12, but not seeing the other factors like cost and benchmarks.
It's the megahertz wars all over again!
Computers have taught is the rubric of truth: Numbers go Up.
> so few performance cores and so many efficiency cores
I was baffled by this too but what they don't make clear is the performance cores have hyperthreading the efficiency cores do not.
So what they call 2P+4E actually becomes an 8 core system as far as something like /proc/cpuinfo is concerned. They're also the same architecture so code compiled for a particular architecture will run on either core set and can be moved from one to the other as the scheduler dictates.
> They're also the same architecture so code compiled for a particular architecture will run on either core set and can be moved from one to the other as the scheduler dictates.
I don't know if that has done more good than harm, since they ripped AVX-512 out for multiple generations to ensure parity.
A major differentiator is that Intel CPUs with E cores don’t allow the use of AVX-512, but all current AMD CPUs do. The new Zen 5 chips will run circles around Intel for any such workload. Video encoding, 3D rendering, and AI come to mind. For developers: many database engines can use AVX-512 automatically.
> Like look at this 20 core processor! Oh wait, it's really an 8 core when it comes to performance.
The E cores are about half as fast as the P cores depending on use case, at about 30% of the size. If you have a program that can use more than 8 cores, then that 8P+12E CPU should approach a 14P CPU in speed. (And if it can't use more than 8 cores then P versus E doesn't matter.) (Or if you meant 4P+16E then I don't think those exist.)
> Hard to compare that to a 12 core 3D cached Ryzen with even higher clock...
Only half of those cores properly get the advantage of the 3D cache. And I doubt those cores have a higher clock.
AMD's doing quite well but I think you're exaggerating a good bit.
> If you have a program that can use more than 8 cores, then that 8P+12E CPU should approach a 14P CPU in speed
Only if you use work stealing queues or (this is ridiculously unlikely) run multithreaded algorithms that are aware of the different performance and split the work unevenly to compensate.
Or if you use a single queue... which I would expect to be the default.
Blindly dividing work units across cores sounds like a terrible strategy for a general program that's sharing those cores with who-knows-what.
It’s a common strategy for small tasks where the overhead of dispatching the task greatly exceeds the computation of it. It’s also a better way to maximize L1/L2 cache hit rates by improving memory locality.
Eg you have 100M rows and you want to cluster them by a distance function (naively), running dist(arr[i], arr[j]) is crazy fast, the problem is just that you have so many of them. It is faster to run it on one core than dispatch it from one queue to multiple cores, but best to assign the work ahead of time to n cores and have them crunch the numbers.
It has always been a bad idea to dispatch so naively and dispatch to the same number of threads as you have cores. What if a couple cores are busy, and you spend almost twice as much time as you need waiting for the calculation to finish? I don't know how much software does that, and most of it can be easily fixed to dispatch half a million rows at a time and get better performance on all computers.
Also on current CPUs it'll be affected by hyperthreading and launch 28 threads, which would probably work out pretty well overall.
> What if a couple cores are busy
If you don't pin them to cores, the OS is still free to assign threads to cores as it pleases. Assuming the scheduler is somewhat fair, threads will progress at roughly the same rate.
I would not assume it's sufficiently fair to make that a good algorithm.
Even a small bias could turn a 5 minute calculation into a 6 or 7 minute calculation as the stragglers finish up.
> run multithreaded algorithms that are aware of the different performance and split the work unevenly to compensate.
This is what the Intel Thread Director [0] solves.
For high-intensity workloads, it will prioritize assigning them to P-cores.
[0] https://www.intel.com/content/www/us/en/support/articles/000...
Then you no longer have 14 cores in this example, but only len(P) cores. Also most code written in the wild isn’t going to use an architecture-specific library for this.
The P cores being presented as two logical cores and E cores presented as a single logical core results in this kind of split already.
Yeah, the 20 core Intels are benchmarking about the same as the 12 core AMD X3Ds. But many people just see 20>12. Either one is more than fine for most people.
"Oh wait, it's really an 8 core when it comes to performance [cores]". So yes, should not be an 8 core all together, but like you said about 14 cores, or 12 with the 3D cache.
"And I doubt those cores have a higher clock."
I'm not sure what we're comparing them to. They should be capable of higher clock than the E cores. I thought all the AMD cores had the ability to hit the max frequency (but not necessarily at the same time). And some of the cores might not be able to take advantage of the 3D cache, but that doesn't limit their frequency, from my understanding.
It’s kind of funny and reminiscent of the AMD bulldozer days where they had a ton of cores compared to the contemporary Intel chips, especially at low/mid price points but the AMD chips were laughably underwhelming for single core performance which was even more important then.
I can’t speak to the Intel chips because I’ve been out of the Intel game for a long time but my 5700X3D does seem to happily run all cores at max clock speed.
> I'm not sure what we're comparing them to. They should be capable of higher clock than the E cores.
Oh, just higher clocked than the E cores. Yeah that's true, but if you're using that many cores at once you probably only care about total speed.
You said 12 core with higher clock versus 8, so I thought you were comparing to the performance cores.
> I thought all the AMD cores had the ability to hit the max frequency (but not necessarily at the same time).
The cores under the 3D cache have a notable clock penalty on existing CPUs.
> And some of the cores might not be able to take advantage of the 3D cache, but that doesn't limit their frequency, from my understanding.
Right, but my point is it's misleading to call out higher core count and the advantages of 3D stacking. The 3D stacking mostly benefits the cores it's on top of, which is 6-8 of them on existing CPUs.
"The cores under the 3D cache have a notable clock penalty on existing CPUs."
Interesting. I can't find any info on that. It seems that makes sense though since the 7900X is 50 TDP higher than the 7900X3D.
"Right, but my point is it's misleading to call out higher core count and the advantages of 3D stacking"
Yeah, that makes sense. I didn't realize there was a clock penalty on some of the cores with the 3D cache and that only some cores could use it.
It's due to the stacked cache being harder to cool and not supporting as high of a voltage. So the 3D CCD clocks lower, but for some workloads it's still faster (mainly ones dealing with large buffers, like games, most compute heavy benchmarks fit in normal caches and the non 3D V-Cache variants take the win).
Maybe a stretch - but this reminds me of blood sugar regulation for people with type 1 diabetes.
Too low is dangerous because you lose rational thought, and the ability to maintain consciousness or self-recover. However, despite not having the immediate dangers of being low, having high blood sugar over time is the condition which causes long-term organ damage.
I think it's telling that they are delaying the microcode patch until after all the reviewers publish their Zen5 reviews and the comparisons of those chips against current Raptorlake performance.
Why even publish a comparison? Raptor Lake processors aren't a functioning product to benchmark against.
Because the benchmarks will still exist on the sites after the microcode is released and a lot of the sites won't bother to go back and update them with the accurate performance level.
Because if publishers don't publish then they don't make money.
Reminds me of Sudden Northwood Death Syndrome, 2002.
Looks like history may be repeating itself, or at least rhyming somewhat.
Back then, CPUs ran on fixed voltages and frequencies and only overclockers discovered the limits. Even then, it was rare to find reports of CPUs killed via overvolting, unless it was to an extreme extent --- thermal throttling, instability, and shutdown (THERMTRIP) seemed to occur before actual damage, preventing the latter from happening.
Now, with CPU manufacturers attempting to squeeze all the performance they can, they are essentially doing this overclocking/overvolting automatically and dynamically in firmware (microcode), and it's not surprising that some bug or (deliberate?) ignorance that overlooked reliability may have pushed things too far. Intel may have been more conservative with the absolute maximum voltages until recently, and of course small process sizes with higher potential for electromigration are a source of increased fragility.
Also anecdotal, but I have an 8th-gen mobile CPU that has been running hard against the thermal limits (100C) 24/7 for over 5 years (stock voltage, but with power limits all unlocked), and it is still 100% stable. This and other stories of CPUs in use for many years with clogged or even detached heatsinks seem to contribute to the evidence that high voltage is what kills CPUs, and neither heat nor frequency.
Edit: I just looked up the VCore maximum for the 13th/14th processors - the datasheet says 1.72V! That is far more than I expected for a 10nm process. For comparison, a 1st-gen i7 (45nm) was specified at 1.55V absolute maximum, and in the 32nm version they reduced that to 1.4V; then for the 22nm version it went up slightly to 1.52V.
> Back then, CPUs ran on fixed voltages and frequencies and only overclockers discovered the limits. Even then, it was rare to find reports of CPUs killed via overvolting, unless it was to an extreme extent --- thermal throttling, instability, and shutdown (THERMTRIP) seemed to occur before actual damage, preventing the latter from happening.
Oh the memories. I had a Thunderbird-core Athlon with a stock frequency of (IIRC) 1050Mhz. It was stable at 1600Mhz, and I ran it that way for years. I was able to get it to 1700Mhz, but then my CPU's stability depended on ambient temperatures. When the room got hot in the summer my workstation would randomly kernel panic.
Interesting, I hadn’t heard about the Pentium overlocking issues. My theory on the current issue that running chips for long periods of time at 100C is not good for chip longevity, but voltages could also be an issue. I came up with this theory last summer when I built my rig with a 13900k, though I was doing it with the intention of trying to set things up so the CPU could last 10 years.
Anecdotally, my CPU has been a champ and I haven’t noticed any stability issues despite doing both a lot of gaming and a lot of compiling on it. I lost a bit of performance but not much setting a power limit of 150W.
I believe the first round of Intel excuses here blamed the motherboard manufacturers for trying to "auto" overclock these CPUs.
There was recently[1] some talk about how the 13th/14th gen mobile chips also had similar issues, though Intel insisted it's something else.
Will be interesting to see how that pans out.
[1]: https://news.ycombinator.com/item?id=41026123
The mobile issue seems more anecdote than data? Almost as if people on Reddit heard the 13/14 CPUs were bad, then their laptop crashed, and they decided "it happened to me too".
Well it's not just[1] redditors from what I can gather:
Now Alderon Games reports that Raptor Lake crashes impact Intel's 13th and 14th-Gen processors in laptops as well.
"Yes we have several laptops that have failed with the same crashes. It's just slightly more rare then the desktop CPU faults," the dev posted.
These are the guys who publicly claimed[2] Intel sold defective chips based on the desktop chips crashing.
[1]: https://www.tomshardware.com/pc-components/cpus/dev-reports-...
[2]: https://www.tomshardware.com/pc-components/cpus/game-publish...
The problem may exist, but Alderon Games' report on the mobile chip is more of an anecdote here because there's not enough data points (unlike their desktop claims), and the only SKU they give (13900HX) is actually a desktop chip in a mobile package (BGA instead of LGA, so we're back into the original issue). So in the end, even with Alderon's claims, there's really not enough data points to come to a conclusion on the mobile side of things.
Why are you downplaying it too?
> "The laptops crash in the exact same way as the desktop parts including workloads under Unreal Engine, decompression, ycruncher or similar. Laptop chips we have seen failing include but not limited to 13900HX etc.," Cassells said.
> "Intel seems to be down playing the issues here most likely due to the expensive costs related to BGA rework and possible harm to OEMs and Partners," he continued. "We have seen these crashes on Razer, MSI, Asus Laptops and similar used by developers in our studio to work on the game. The crash reporting data for my game shows a huge amount of laptops that could be having issues."
https://old.reddit.com/r/hardware/comments/1e13ipy/intel_is_...
I'm not denying that the problem exists, but I don't think Alderon provided enough data to come to a conclusion, unlike on the desktop, where it's supported by other parties in addition to Alderon's data (where you can largely point to 14900KS/K/non-K/T, 13900KS/K/non-K/T, 14700K, and 13700K being the one affected)
Right now, the only example given is HX (which is a repackaged desktop chip[^], as mentioned), so I'm not denying that the problem is happening on HX based on their claims (and it makes a lot of sense that HX is affected! See below), but what about H CPUs? What about P CPUs? What about U CPUs? The difference in impact between "only HX is impacted" and "HX/H/P/U parts are all affected" is a few orders of magnitude (a very top-end 13th Gen mobile SKUs versus every 13th Gen mobile SKUs). Currently, we don't have enough data how widespread the issue is, and that makes it difficult to assess who is impacted by this issue from this data alone.
[^]: HX is the only mobile CPU with B0 stepping, which is the same as desktop 13th/14th Gen, while the mobile H/P/U family are J0 and Q0, which are essentially a higher clocked 12th Gen (i.e., using Golden Cove rather than Raptor Cove)
Alderon are the people claiming 100% of units fail which doesn’t seem supported by anyone else either. Wendell and GN seem to have scoped the issue to around 10-25% across multiple different sources.
Like they are the most extreme claimants at this point. Are they really credible?
Ok, I take it back, this looks pretty indicative of a low-load problem and evidently failure rates are much higher in that scenario.
https://www.youtube.com/watch?v=yYfBxmBfq7k
I haven't watched the video in its entirely, but it does feel like single core boost might be the main culprit in this scenario. That actually makes a lot of sense why game servers are the one that's affected the most. Though that makes me wonder about HX CPUs failing, since these SKUs doesn't have TVB, but given Wendell's 10-25% failure rate on all-core load, it does seem like Intel may actually have multiple issues with RPL here. Things definitely doesn't look good.
For server CPUs there's not a similar problem or they realize server purchasers may be less willing to tolerate it? I'm not all that thrilled with the prospect of buying Intels especially when wondering about waiting to 5 year out replacement compared to a few generations ago, but AMD server choices can be a bit limited and I'm not really sure how to evaluate if there may be increasing surprises more across the board.
Are you talking about Xeon Scalable? Although they share the same core design as the desktop counterpart (Xeon Scalable 4th Gen shares the same Golden Cove as 12th Gen, Xeon Scalable 5th Gen shares the same Raptor Cove as 13th/14th Gen), they're very different from the desktop counterpart (monolithic vs tile/EMIB-based, ring bus vs mesh, power gate vs FIVR), and often running in a more conservative configuration (lower max clock, more conservative V/F curves, etc.). There has been a rumor about Xeon Scalable 5th Gen having the same issue, but it's more of a gossip rather than a data point.
The issue does happen with desktop chips that are being used in a server context when pairing with workstation chipset such as W680. However, there haven't been any reports of Xeon E-2400/E-3400 (which is essentially a desktop chip repurposed as a server) with C266 having these issues, though it may be because there hasn't been a large deployment of these chips on the server just yet (or even if there are, it's still too early to tell).
Do note that even without this particular issue, Xeon Scalable 4th Gen (Sapphire Rapids) is not a good chip (speaking from experience, I'm running w-3495x). It has plenty of issues such as slow clock ramp, high latency, high idle power draw, and the list goes on. While Xeon Scalable 5th Gen (Emerald Rapids) seems to have fixed most of these issues, Zen 4 EPYC is still a much better choice.
After watching https://youtube.com/watch?v=gTeubeCIwRw and some related content, I personally don't believe it's an issue fixable with microcode. I guess we'll see.
Because HN doesn't provide link previews, I'd recommend adding some information about the content to your comment. Otherwise we have to click through to YouTube for the comment to make any sense.
That said, the video is the GamersNexus one where they talk about an unverified claim that this is a fabrication process issue caused by oxidation between atomic deposition layers. If that's the case, then yeah, microcode can only do so much. But like Steve says in the video, the oxidation theory has yet to be proven and they're just reporting what they have so far ahead of the Zen 5 reviews coming soon.
GN mentioned shipping a few samples to a lab (number dependent on the price quote from said lab), so I hope we’ll have some closure regarding this hypothesis.
Hopefully Intel ships them, and allows them to, test and publish benchmarks with the current pre-release microcode revision for review comparison.
Are the CPUs that received elevated operating voltage permanently damaged?
This is the most pressing question. If it was just a microcode issue a cooloff and power cycle ought to at least reset things but according to Wendel from Level 1 Tech, that doesn't seem to always be the case.
The problem is that running at too high of a voltage for sustained periods can cause physical degradation of the chip in some cases. Hopefully not here!
> can cause physical degradation of the chip in some cases.
Not in some cases. Chips always physically degrade regardless of voltage. Higher voltages will make it happen faster.
Yes, but usually this happens slowly enough that the chip will be long obsolete before the degredation becomes an issue.
Why do chis degrade? Is this due to the whiskers I’ve heard about?
> Why do chis degrade? Is this due to the whiskers I’ve heard about?
No, tin whiskers are a separate issue, which happens mostly outside the chips. The keyword you're looking for is electromigration (https://en.wikipedia.org/wiki/Electromigration).
Not instantly it seems, but there have been reports of degradation over time. It will be a case-by-case thing.
Possible electromigration damage, yes.
Just want to say, I'm incredibly happy with my 7800X3D. It runs ~70C max like Intel chips used to and with a $35 air cooler and it's on average the fastest chip for gaming workloads right now.
I'm also very happy with my 5800X3D, it was wonderful value back when AM5 had just released and DDR5/Motherboards still cost an arm and a leg.
The energy efficiency is much appreciated in the UK with our absurd price of electricity.
Same, in my BIOS I can activate a "ECO Mode", which lets me decide if I want to run my 7950x on full 170W TDP, 105W TDP or 60W TDP.
I benchmarked it, the difference between 170 and 105 is basically zero, and the difference to 60W is just a few percent of a performance hit, but way worth it, as it's ~0.3€/kWh over here.
(if you are running windows)
you might want to check a tool called PBO2Tunner (https://www.cybermania.ws/apps/pbo2-tuner/), you can tweak values like EDC,TDC and PPT (power limit) from the GUI, and it also accepts command line commands so you can automate those tasks.
I made scripts that "cap" the power consumption of the cpu based on what applications are running. (i.e. only going all in on certain games, dynamically swaping between 65-90-120-180w handmade profiles)
i made with power saving in mind given the idle power consumption is rather high on modern ryzens.
edit: actually made a mistake given that PBO2Tunner is for Zen3 cpus, and you mentioned Zen4.
I was concerned this would happen to them, given how much power was being pushed through their chips to keep them competitive. I get the impression their innovation has either truly slowed down, or AMD thought enough 'moves' ahead with their tech/marketing/patents to paint them into a corner.
I don't think Intel is done though, at least not yet.
Curious why Intel announced this on their community forums, rather than somewhere more official.
That’s probably where people are mostly likely to understand it. A lot of companies do this, especially while they’re still learning things.
These days people are more likely to see the announcement on YouTube, TikTok, or Twitter.
The first two require a lot more effort in video editing than creating a forum post. Plus, it’s just going to be digested and regurgitated for the masses by people much better at communicating technical information.
Sobweird hearing high noise channels as the prefwrred distributiom
they did that too
https://youtu.be/wkrOYfmXhIc
Optics / stock price
Based on what I know about corporations, it's entirely plausible that the folks posting the information don't actually have access to the communication channels you are referring to. I don't even know how I would issue an official communication at my own company if the need ever came up... so you go with what you have.
Note how they mentioned its still going to be tested with various partners before released.
Ie we think this might solve it, but if it doesn't we can roll back with the least amount of PR attention.
The amount of current their chips pull on full boost is pretty crazy. It would definitively not surprise me if some could get damaged by extensive boosting.
I built a system last fall with an i9-13900K and have been having the weirdest crashing problems with certain games that I never had problems with before. NEVER been able to track it down, no thermal issues, no overclocking, all updated drivers and BIOS. Maybe this is finally the answer I've been looking for.
It was for me. Check for BIOS updates - most motherboard vendors have them. Look for and enable something labeled Intel Baseline Profile and then check. That cured it for me.
For Asus: https://www.pcgamer.com/hardware/motherboards/asus-adds-inte...
I'll try that, thanks. Although the current cohort of games I play seems more stable now. If I ever go back to EVE Online then it'd be more of an issue - that thing crashed constantly.
Dumb question: let’s say I am in charge of procurement for a significant amount of machines, do I not have the option of ordering machines from three generations back? Are older (proven reliable) processors just not available because they’re no longer made, like my 1989 Camry?
Yeah, 12th gen is probably still available.
Nice that Intel acknowledges there are problems with that CPU generation. If I read this right, the CPUs have been supplied with a too-high voltage across the board, with some tolerating the higher voltages for longer, others not so much.
Curious to see how this develops in terms of fixing defective silicon.
They already tried bios updates when they pushed out the "intel defaults" a couple months ago...
Except they didn't. https://www.pcworld.com/article/2326812/intel-is-not-recomme...
Firmware and microcode aren't the same thing.
Very true and that's why it is odd that microcode has been mentioned here. Surely they mean PCU software (Pcode), or code for whatever they are calling the PCU these days.
I assume Intel's "microcode" updates include the PCU code, maybe some ME code, and whatever other little cores are hiding in the chip.
Well, do they? The operating system can provide microcode updates to a running CPU. Can the operating system patch the PCU, too?
When I look at a "BIOS update" it usually seems to include UEFI, peripheral option ROMs, ME updates, and microcode. So if the PCU is getting patched I would think of it as a BIOS update. I think the ergonomics will be indistinguishable for end users.
firmware can include microcode though
Good for Intel to finally "figure it out" but I'm not 100% sure microcode is 100% of the problem. As in everything complex enough, the "problem" can actually be many compounded problems, MB vendors "special" tune comes to mind.
But this is already a mess very hard to clean since I feel many of these CPUs will die in an year or 2 because of these problems today but by then nobody will remember this and an RMA will be "difficult" to say the least.
You're right - at least partly. If the issue is that Intel was too aggressive with voltages, they can use microcode updates as 1) an excuse to rejigger the power levels and voltages the BIOS uses as part of the update, and 2) they can have the processor itself be more conservative with the voltages and clocking it calculates itself.
Anything Intel announces, in my experience, is half true, so I'm interested to see what's actually true and what Intel will just forget to mention or will outright hide.
> Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages.
That’s great news for intel. If that’s correct. If not that’ll be a PR bloodbath
Is there any info on how to diagnose this problem? Having just put together a computer with the 14900KF, I really don't want to swap it out if not necessary.
There is no reliable way to diagnose this issue with the 14th gen, the chip slowly degrades over time and you start getting more and more (usually gpu driver under windows) crashes. I believe the easy way might be to run decompression stress tests if I remember correctly from Wendell's (Level1Techs) video.
I highly recommend going into your motherboard right now and manually setting your configurations to the current intel recommendation to prevent it from degrading to the point where you'd need to RMA it. I have a 14900K and it took about 2.5 months before it started going south and it was getting worse by the DAY for me. Intel has closed my RMA ticket since changing the bios settings to very-low-compared-to-what-the-original-is has made the system stable again, so I guess I have a 14900K that isn't a high end chip anymore.
Below are the configs intel provided to me on my RMA ticket that have made my clearly degraded chip stable again:
CEP (Current Excursion Protection)> Enable. eTVB (Enhanced Thermal Velocity boost)> Enable. TVB (Thermal Velocity boost)>Enable. TVB Voltage Optimization> Enable. ICCMAX Unilimited bit>Disable. TjMAX Offset> 0. C-States (Including C1E) >Enable. ICCMAX> 249A. ICCMAX_APP>200A. Power limit 1 (PL1)>125W. Power limit 2 (PL2)>188W
OCCP burn in test with AVX and XMP disabled.
Tbh, XMP is probably the cause of most modern crashes on gaming rigs. It does not guarantee stability. After finding a stable cpu frequency, enable xmp and roll back the memory frequency until you have no errors in occp. The whole thing can be done in 20 minutes and your machine will have 24/7/365 uptime.
This is good advice for overclocking, but how does it help with the 13th/14th Gen issue? The issue is not due to clocks, or at least doesn't appear to be.
Running a full memtest overnight and a day of Prime95 with validation is the traditional way of sussing out instability.
it’s also a terrible stability test these days for the same reasons Wendell talks about with cinebench in his video with Ian (and Ian agrees too). Doesn’t work like 90% of the chip - it’s purely a cache/avx benchmark. You can have a completely unstable frontend and it’ll just work fine because prime95 fits in icache and doesn’t need the decoder, and it’s just vector op, vector op, vector op forever.
You can have a system that’s 24/7 prime95 stable that crashes as soon as you exit out, because it tests so very little of it. That’s actually not uncommon due to the changes in frequency state that happen once the chip idles down… and it’s been this way for more than a decade, speedstep used to be one of the things overclockers would turn off because it posed so many problems vs just a stable constant frequency load.
Prime95 also completely ignores the high-clock domain btw so it can also be completely prime95 stable yet fail completely on a routine task that boosts up! So it’s technically not even a full test of core stability either.
Hmm, mid August is after the new Ryzens are out, I wonder how bad of a performance hit this microcode update will bring?
And will it actually fix the issue?
https://www.youtube.com/watch?v=QzHcrbT5D_Y
(updated from other post about mobile crashes)
Related:
Complaints about crashing 13th,14th Gen Intel CPUs now have data to back them up
https://news.ycombinator.com/item?id=40962736
Intel is selling defective 13-14th Gen CPUs
https://news.ycombinator.com/item?id=40946644
Intel's woes with Core i9 CPUs crashing look worse than we thought
https://news.ycombinator.com/item?id=40954500
Warframe devs report 80% of game crashes happen on Intel's Core i9 chips
https://news.ycombinator.com/item?id=40961637
That one is mobile, this one is desktop, which they claim are different causes.
Not a dupe.
If I didn’t just recently invest in 128gb of DDR4 I’d jump ship to AMD/AM5. My 13900k has been (knock on wood) solid though - with 24/7 uptime since July 2023.
I guess you’re lucky. I own 2 machines for small scale CNN training, one 13900k and one 14900k. I have to throttle the CPU performances to 90% for stable running. This cost me about 1 hour / 100 hours of training.
Are you using any motherboard overclocking stuff? A lot of mobo’s are pushing these chips pretty hard right out of the box.
I have mine at a factory setting that Intel would suggest, not the asus multi core enhancement crap. noctua dh15 cooler. It’s really been a stable setup.
I didn’t setup anything in BIOS. But my motherboard are from asus. I will look into this. Thanks for your suggestion.
Make sure you update the BIOS and enable the Intel Baseline Profile - cleared up crashing issues I was having with my i914k
https://www.pcgamer.com/hardware/motherboards/asus-adds-inte...
My 13900k has definitely degraded over time. I was running bus defaults for everything and the pc was fine for several months. When I started getting crashes it took me a long time to diagnose it as a CPU problem. Changing the mobo vdroop setting made the problem go away for a while, but it came back. I then got it stable again by dropping the core multipliers down to 54x, but then a couple months later I had to drop to 53x. I just got an rma replacement and it had made it 12 hours without issue.
I evaluated ddr4 vs ddr5 a year ago, and it wasn’t worth it. Chasing FPS and the cost to hit the same speed in ddr5 was just too high, and I’m glad I did. I’m on a 13700k and I’m also very stable. However, with the stock XMP profile for my ram I was very much not stable and getting errors and bsods within minutes on an occp burn in test. All I had to do was roll back the memory clock speed a few hundred mhz.
by "microcode" i assume they meant "pcode" for the PCU? (but they decided not to make that distinction here for whatever reason?)
"Elevated operating voltage" my foot.
We've already seen examples of this happening on non-OC'd server-style motherboards that perfectly adhere to the intel spec. This isn't like ASUS going 'hur dur 20% more voltage' and frying chips. If that's all it was it would be obvious.
Lowering voltage may help mitigate the problem, but it sure as shit isn't the cause.
It's worth noting that W680 boards are not a server board, they're a workstation board, and often times they're overclockable (or even overclocked by default). Wendell actually showed the other day that the ASUS W680 board was feeding 253W into a 35W (106W boost) 13700T CPU by default[1].
Supermicro and ASRock Rack do sell W680 as a server (because it took Intel a really long time to release C266), but while they're strictly to the spec, some boards are really not meant for K CPUs. For example, the Supermicro MBI-311A-1T2N is only certified for a non-TVB E/T CPUs, and trying to run the K CPU on these can result in the board plumbing 1.55V into the CPU during the single core load (where 1.4V would already be on the higher side)[2].
In this particular case, the "non-OC'd server-style motherboard" doesn't really mean anything (even more so in the context of this announcement).
[1]: https://x.com/tekwendell/status/1814329015773086069
[2]: https://x.com/Buildzoid1/status/1814520745810100666
They also admit a microcode algorithm produces incorrect requests for voltages, it doesn't sound like they're trying to shift the blame; ASUS doesn't write that microcode
Specifically I think the concerns are around idle voltage and overshoot at this point, which is indeed something configured by OEMs.
edit: BZ just put out a video talking about running Minecraft servers destroying CPUs reliably, topping out at 83C, normally in the 50s, running 3600 speeds. Which is a clear issue with low-thread loads.
https://m.youtube.com/watch?v=yYfBxmBfq7k
An Intel employee is posting on reddit: https://www.reddit.com/r/intel/comments/1e9mf04/intel_core_1...
A recent YouTube video by GamersNexus speculated the cause of instability might be a manufacturing issue. The employee's response follows.
Questions about manufacturing or Via Oxidation as reported by Tech outlets:
Short answer: We can confirm there was a via Oxidation manufacturing issue (addressed back in 2023) but it is not related to the instability issue.
Long answer: We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue.
For the Instability issue, we are delivering a microcode patch which addresses exposure to elevated voltages which is a key element of the Instability issue. We are currently validating the microcode patch to ensure the instability issues for 13th/14th Gen are addressed
So they were producing defective CPUs, identified & addressed the issue but didn’t issue a recall, defect notice or public statement relating to the issue?
Good to know.
It sounds like their analysis is that the oxidation issue is comfortably below the level of "defective".
No product will ever be perfect. You don't need to do a recall for a sufficiently rare problem.
And in case anyone skims, I will be extra clear, this is based on the claim that the oxidation is separate from the real problem here.
They could recall the defective batch. All of the cpus with that defect will fail from it. The seem to have been content to hope no one noticed.
What makes you think there was a "defective batch"? What makes you think all the CPUs affected by that production issue will fail from it?
That description sounds to me like it affected the entire production line for months. It's only worth a recall if a sufficient percent of those CPUs will fail. (I don't want to argue about what particular percent that should be.)
My CPU was unstable for months, I spent tens of hours and hundreds on equipment to troubleshoot (I _never_ thought my CPU would be the cause). Had I of known this, I would have scrutinised the cpu a lot faster than what I did.
Intel not making a public statement about potentially defective products could have been done with good PR spin ‘we detected an issue, believe the defect rate will be < 0.25%, here’s a test suite you can run, call if you think you’re one of the .25!’ But they didn’t.
I’m never buying an intel product again. Fuck intel.
This comment chain is talking about the oxidation in particular, and specifically the situation where the oxidation is not the cause of the instability in the title. That's the only way they "identified & addressed the issue but didn’t issue a recall".
Do you have a reason to think the oxidation is the cause of your problems?
Did you not read my first post trying to clarify the two separate issues?
Am I misunderstanding something?
Oxidisation in the context of CPU fabrication sounds pretty bad, I find it hard to believe it would have no impact on CPU stability regardless of what Intels PR team to say while minimizing any actual impacts caused.
Edit: it sounds like intel have been aware of stability issues for some time and have said nothing, I’m not sure we have any reason to trust anything they say moving forward, relating to oxidisation or any other claims they make.
Well they didn't notice it for a good while, so it's really hard to say how much impact it had.
And at a certain point if you barely believe anything they say, then you shouldn't be using their statement to get mad about. The complaint you're making depends on very particular parts of their statement being true but other very particular parts being not true. I don't think we have the evidence to do that right now.
> Well they didn't notice it for a good while, so it's really hard to say how much impact it had.
That negates any arguments you had related to failure rates.
> The complaint you're making depends on very particular parts of their statement being true but other very particular parts being not true
Er, I’m not even sure how to respond to this. GamersNexus has indicated they know about the oxidisation issue, intel *subsequently* confirm it was known internally but no public statement was made until now. I’m not unreasonably cherry picking parts of their statement and then drawing unreasonable conclusions. Intel have very clearly demonstrated they would have preferred to not disclose an issue in fabrication processes which very probably caused defective CPUs, they have demonstrated untrustworthy behaviour related to this entire thing (L1techs and GN are breaking the defective cpu story following leaks from major intel clients who have indicated that intel is basically refusing to cooperate).
Intel has known about these issues for some time and said nothing. They have cost organisations and individuals time and money. Nothing they say now can be trusted unless it involves them admitting fault.
> That negates any arguments you had related to failure rates.
I mean it's hard for us to say, without sufficient data. But Intel might have that much data.
Also what argument about failure rates? The one where I said "if" about failure rates?
> Er, I’m not even sure how to respond to this. GamersNexus has indicated they know about the oxidisation issue, intel subsequently confirm it was known internally but no public statement was made until now.
GamersNexus thinks the oxidation might be the cause of the instability everyone is having. Intel claims otherwise.
Intel has no reason to lie about this detail. It doesn't matter if the issue is oxidation versus something else.
Also the issue Intel admits to can't be the problem with 14th gen, because it only happened to 13th gen chips.
> Intel has known about these issues for some time and said nothing. Nothing they say now can be trusted unless it involves them admitting fault.
If you don't trust what Intel said today at all, then you can't make good claims about what they knew or didn't know. You're picking and choosing what you believe to an extent I can't support.
GN called out the lack of further details relating to oxidisation, fyi.
It is the Pentium FDIV drama all over again! [1]. It is even in chapter 4 of the Andrew Grove's book!
[1] https://en.wikipedia.org/wiki/Pentium_FDIV_bug
Dude's gonna be canned so hard.
Intel cannot afford to be anything but outstanding in terms of customer experience right now. They are getting assaulted on all fronts and need to do a lot to improve their image to stay competitive.
Intel should take a page out of HP's book when it came to dealing with a bug in the HP-35 (first pocket scientific calculator):
> The HP-35 had numerical algorithms that exceeded the precision of most mainframe computers at the time. During development, Dave Cochran, who was in charge of the algorithms, tried to use a Burroughs B5500 to validate the results of the HP-35 but instead found too little precision in the former to continue. IBM mainframes also didn't measure up. This forced time-consuming manual comparisons of results to mathematical tables. A few bugs got through this process. For example: 2.02 ln ex resulted in 2 rather than 2.02. When the bug was discovered, HP had already sold 25,000 units which was a huge volume for the company. In a meeting, Dave Packard asked what they were going to do about the units already in the field and someone in the crowd said "Don't tell?" At this Packard's pencil snapped and he said: "Who said that? We're going to tell everyone and offer them, a replacement. It would be better to never make a dime of profit than to have a product out there with a problem". It turns out that less than a quarter of the units were returned. Most people preferred to keep their buggy calculator and the notice from HP offering the replacement.
https://www.hpmuseum.org/hp35.htm
I wonder if Mr. Packard's answer would have been different if a recall would have bankrupted the company or necessitated layoff of a substantial percentage of staff.
I can't speak for Dave Packard (or Bill Hewlett) - but I will try to step in to their shoes:
1) HP started off in test and measurement equipment (voltmeters, oscilloscopes etc.) and built a good reputation up. This was their primary business at the time.
2) The customer base of the HP-35 and test and measurement equipment would have a pretty good overlap.
Suppose the bug had been covered up, found, and then the news about the cover up came to light? Would anyone trust HP test and measurement equipment after that? It would probably destroy the company.
Or potential of killing couple hundred passengers, or few astronauts. Oh, wait...
Their acquisition of Altera seemed to harm both companies irreparably.
Any company can reach a state where the Process people take over, and the Product people end up at other firms.
Intel could have grown a pair, and spun the 32 core RISC-V DSP SoC + gpu for mobile... but there is little business incentive to do so.
Like any rotting whale, they will be stinking up the place for a long time yet. =)
Could you elaborate on the process people versus product people?
I assume they're referring to Steve Jobs' comments in this (Robert Cringely IIRC) interview: https://www.youtube.com/watch?v=l4dCJJFuMsE (not a great copy, but should be good enough)
Partly true, Steve Jobs had a charismatic tone when describing these problems in public.
Have a great day, =3
Oh yeah, this got rehashed as builders versus talkers too. Yeah, there's a lot of this creative vibe type dividing. It's pretty complicated, I don't even think individual people operate the same when placed in a different context. Usually their output is a result of their incentives, so typically management failure or technical architect failure.
I would argue the fabrication process people at Intel are core to their business. Without the ability to reliably manufacture chips, they're dead in the water.
You mean manufacturing "working chips" is supposed to be their business.
It is just performance art with proofing wafers unless the designs work =3
It is an old theory that accurately points out Marketing/Sales division people inevitably out-compete product innovation people in a successful firm.
https://en.wikipedia.org/wiki/Competitive_exclusion_principl...
And yes, the Steve Jobs interview does document how this almost destroyed Apples core business. =)
Just to clarify do you mean employees marketing and selling their innovation skills or people literally in marketing and sales?
Shameless self-promotion is usually not a problem in most commercial settings. Sad, but true... lol =)
Letting Marketing/Finance people set technological product trajectories sooner or later becomes detrimental to large firms.
i.e. the product line becomes disconnected from the consumers actual experience of utility, novelty, and perceived scarcity. =)
[flagged]
Without those weirdos do you think Intel would be doing anything about this in public?
And tell us how customers that bought the most expensive part from their lineup should feel about knowing that their cpu has been over voltaged from day one of operation..
Yeah sure, calling out Intel for lack of any good updates over the crashing laptop/desktop CPUs and demanding a recall after giving them such a long time to come up with a reasonable solution is definitely "weirdo" territory.
FWIW, I have connections who splurged on these only to deal with BSODs all the friggin' time. Some of them even work at Intel.
We have yet to see
- How much lifespan of these CPUs has already been lost and cannot be recovered by the microcode patch.
- How much of a performance hit these CPUs will get after applying the patch.
Wonder what Linus has to say on this. Dude knows how to rip into crappy Intel products
Torvalds or the Youtube guy?
Yes
I can imagine both will bash intel a bit.
"Linus Tech Tips" for the gaming crowd situation (loss of "paid for" premium performance) and Torvalds for the hardware vendor lack of transparency with the community.
So on one hand they are saying it's voltage (i.e. something external, not their fault, bad mainboard manufacturers!).
On the other hand they are saying they will fix it in microcode. How is that even possible?
Are they saying that their CPUs are signaling the mainboards to give them too much voltage?
Can someone make sense of this? It reminds me of Steve Jobs' You Are Holding It Wrong moment.
Saying "elevated voltage causes damage" is not attributing blame to anyone. In the very next sentence, they then attribute the reason for that elevated voltage to their own microcode, and so it is responsible for the damage. I literally do not know how they could be any clearer on that.
> Are they saying that their CPUs are signaling the mainboards to give them too much voltage?
Yes that's exactly what they said.
So it's a 737 MAX problem: the software is running a control loop that doesn't have deflection limits. So it tells the stabilizer (or voltage reg in this case) to go hard nose down.
lol what a stretch of an analogy
The voltage supplied by the motherboard isn't supposed to be constant. The CPU is continuously varying the voltage it's requesting, based primarily on the highest frequency any of the CPU cores are trying to run at. The motherboard is supposed to know what the resistive losses are from the VRMs to the CPU socket, so that it can deliver the requested voltage at the CPU socket itself. There's room for either party to screw up: the CPU could ask for too much voltage in some scenarios, or the motherboard's voltage regulation could be poorly calibrated (or deliberately skewed by overclocking presets).
On top of all this mess: these products were part of Intel's repeated attempts to move the primary voltage rail (the one feeding the CPU cores) to use on-die voltage regulators (DLVR). They're present in silicon but unused. So it's not entirely surprising if the fallback plan of relying solely on external voltage regulation wasn't validated thoroughly enough.
My guess is something like the following:
Modern CPU's are incredibly complex machines with a ridiculously large amount of possible configuration states (too large to exhaustively test after manufacture or sim during design), e.g. a vector multiply in flight with an AES encode in flight with x87 sincos, etc. Each operation is going to draw a certain amount of current. It is impractical to guarantee each functional unit with the required current but the supply rails are sized for a "reasonable worst case".
Perhaps an underestimate was mistakenly made somewhere and not caught until recently. Therefore the fix might be to modify the instruction dispatcher (via microcode) to guarantee that certain instruction configurations cannot happen (e.g. let the x87 sincos stall until the vector multiply is done) to reduce pressure on the voltage regulator.
It's worse than that, thermal management is part of the puzzle. Think of that as heat generation happening across three dimensions (X + Y + time) along with diffusion in 3D through the package.
It's an interesting idea, but there's a caveat: time flows in just one direction.
The claim seems to be that the microcode on the CPU is in certain circumstances requesting the wrong (presumably too high) voltage from the motherboard. If that is the case fixing the microcode will solve the issue going forward but won’t help people whose chips have already been damaged by excessive voltage.
The “you’re holding it wrong!”angle is all your take. They don’t make that claim.
"OK, great, let’s give everybody a case" lives on