My desktop PC seems to have recently trapped itself in Machine Check Heck, so I come to the Cemetech hivemind for opinions on what to do now.

My desktop PC is based on a Skylake Core i7-6700K CPU, running Windows 10 with 32GB of DDR4-2133 memory and a Radeon Vega 56 GPU. Despite being fairly old at this point, the performance is satisfactory for my gaming and other needs, but yesterday it bluescreened on me while doing some light gaming and videoconferencing with a WHEA_UNCORRECTABLE_ERROR stop code. Today it's been doing that with much greater frequency (4ish times today as of this writing?). Since WHEA_UNRECOVERABLE_ERROR usually corresponds to a machine check exception indicating some kind of hardware fault and the onset seems to have been rapid and not corresponding to any particular software changes, I'm inclined to think it's time to seriously consider upgrading my 6-year-old CPU to something more modern.


To further investigate the bluescreens I loaded up a memory dump in WinDbg on another machine (an even-older Thinkpad T430s) and got the following information out:
Code:
Windows 10 Kernel Version 19041 MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Edition build lab: 19041.1.amd64fre.vb_release.191206-1406
Machine Name:
Kernel base = 0xfffff801`44800000 PsLoadedModuleList = 0xfffff801`4542a190
Debug session time: Sat Dec  4 16:11:03.329 2021 (UTC - 8:00)
System Uptime: 0 days 0:02:02.986
Loading Kernel Symbols
...............................................................
................................................................
................................................................
.........................................
Loading User Symbols

Loading unloaded module list
.........
For analysis of this file, run !analyze -v
2: kd> !analyze
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon. Try !errrec Address of the WHEA_ERROR_RECORD structure to get more details.
Arguments:
Arg1: 0000000000000010, Error Source Type
Arg2: ffff8c0665cf9028
Arg3: ffff8c064a0a7bcc
Arg4: ffff8c064e12e1a0

Debugging Details:
------------------


BUGCHECK_CODE:  124

BUGCHECK_P1: 10

BUGCHECK_P2: ffff8c0665cf9028

BUGCHECK_P3: ffff8c064a0a7bcc

BUGCHECK_P4: ffff8c064e12e1a0

PROCESS_NAME:  System

MODULE_NAME: GenuineIntel

IMAGE_NAME:  GenuineIntel.sys

FAILURE_BUCKET_ID:  0x124_16_GenuineIntel__UNKNOWN_IMAGE_GenuineIntel.sys

FAILURE_ID_HASH:  {37af9407-4a3e-0b08-acdd-dadffdc34c3c}

Followup:     MachineOwner
---------

2: kd> !whea
Error Source Table @ fffff801454daed8
5 Error Sources
Error Source 0 @ ffff8c064a0a7b40
   Notify Type      : Unknown
   Type             : 0x10 (Invalid)
   Error Count      : 1
   Record Count     : 1
   Record Length    : 29f8
   Error Records    : wrapper @ ffff8c064a0a8000  record @ ffff8c064a0a8028
   Descriptor       : @ ffff8c064a0a7ba0
      Length                     : 3cc
      Max Raw Data Length        : d2c
      Num Records To Preallocate : 1
      Max Sections Per Record    : 3
      Error Source ID            : 0
      Flags                      : 00000000
Error Source 1 @ ffff8c065004b920
   Notify Type      : MCE (INT18)
   Type             : 0x0 (MCE)
   Error Count      : 0
   Record Count     : 8
   Record Length    : de8
   Error Records    : wrapper @ ffff8c065005f000  record @ ffff8c065005f028
                    : wrapper @ ffff8c065005fde8  record @ ffff8c065005fe10
                    : wrapper @ ffff8c0650060bd0  record @ ffff8c0650060bf8
                    : wrapper @ ffff8c06500619b8  record @ ffff8c06500619e0
                    : wrapper @ ffff8c06500627a0  record @ ffff8c06500627c8
                    : wrapper @ ffff8c0650063588  record @ ffff8c06500635b0
                    : wrapper @ ffff8c0650064370  record @ ffff8c0650064398
                    : wrapper @ ffff8c0650065158  record @ ffff8c0650065180
   Descriptor       : @ ffff8c065004b980
      Length                     : 3cc
      Max Raw Data Length        : 141
      Num Records To Preallocate : 8
      Max Sections Per Record    : 8
      Error Source ID            : 1
      Flags                      : 80000000
Error Source 2 @ ffff8c065004a920
WHEA_NOTIFICATION_DESCRIPTOR @ 0xffff8c065004a9b0
   Notify Type      : CMCI (Local Interrupt)
   Type             : 0x1 (CMC)
   Error Count      : 0
   Record Count     : 8
   Record Length    : de8
   Error Records    : wrapper @ ffff8c06500c7000  record @ ffff8c06500c7028
                    : wrapper @ ffff8c06500c7de8  record @ ffff8c06500c7e10
                    : wrapper @ ffff8c06500c8bd0  record @ ffff8c06500c8bf8
                    : wrapper @ ffff8c06500c99b8  record @ ffff8c06500c99e0
                    : wrapper @ ffff8c06500ca7a0  record @ ffff8c06500ca7c8
                    : wrapper @ ffff8c06500cb588  record @ ffff8c06500cb5b0
                    : wrapper @ ffff8c06500cc370  record @ ffff8c06500cc398
                    : wrapper @ ffff8c06500cd158  record @ ffff8c06500cd180
   Descriptor       : @ ffff8c065004a980
      Length                     : 3cc
      Max Raw Data Length        : 141
      Num Records To Preallocate : 8
      Max Sections Per Record    : 8
      Error Source ID            : 2
      Flags                      : 80000000
Error Source 3 @ ffff8c0650049920
   Notify Type      : NMI (INT2)
   Type             : 0x3 (NMI)
   Error Count      : 0
   Record Count     : 1
   Record Length    : 6c0
   Error Records    : wrapper @ ffff8c06500d4720  record @ ffff8c06500d4748
   Descriptor       : @ ffff8c0650049980
      Length                     : 3cc
      Max Raw Data Length        : 100
      Num Records To Preallocate : 1
      Max Sections Per Record    : 3
      Error Source ID            : 3
      Flags                      : 80000000
Error Source 4 @ ffff8c0650048920
   Notify Type      : Polled
   Type             : 0x7 (BOOT)
   Error Count      : 0
   Record Count     : 0
   Record Length    : 0
   Error Records    :    Descriptor       : @ ffff8c0650048980
      Length                     : 3cc
      Max Raw Data Length        : 1000
      Num Records To Preallocate : 1
      Max Sections Per Record    : 8
      Error Source ID            : 4
      Flags                      : 80000000

The value of 0x10 for BUGCHECK_P1 indicates an error flagged by a device driver, in this case it was GenuineIntel.sys. I'm not entirely clear on how to to interpret multiple sources in the decoded error source table, but it looks like something triggered an NMI that was then interpreted to be a machine check exception. I also tried using errrec! to decode the individual MCE records, but the debugger spat out thousands of records that didn't seem to have anything interesting in them- since an Intel driver seems to have flagged the MCE, I'm guessing that's a huge pile of information that might be interesting to somebody who knows more about the Intel-specific bits going on, but Windows doesn't ship with that information.

At a guess, I think this is most likely to indicate power delivery problems in my system- if something in the motherboard's power delivery subsystems has deteriorated and just started going out of spec, then rapid onset of symptoms at fairly high frequency seems plausible.


Having established that it seems I have hardware problems (but please do comment if you disagree or have other ideas!), the further question is what I should do to bring the system back into good working order. The options seem to be either
  • Replace the motherboard and hope that fixes it
  • Replace the motherboard and CPU, assuming the rest of the system is okay

Replacing the motherboard alone could be a more inexpensive option but may not actually fix my issues, and it might be difficult to get a new board that is compatible with a Skylake CPU. According to Ark, [ZHQ][12]70 chipsets support it, where my current board is based on the Z170 platform.

Browsing PCPartPicker, the few compatible boards that anybody has in stock run more than $200. There are a few compatible boards listed from various AliExpress sellers as well that run down to as low as $130, but that doesn't really seem any better after accounting for potential concerns around shipping time and product quality.

For upgrading the whole lot, there are twoish options:
  • AMD Zen 3 (Ryzen 5000)
  • Intel Alder Lake ("12 generation Core")
  • Alder Lake with DDR5
In either case I need a Micro ATX board, and would like 2.5-gigabit Ethernet (or faster!) to enable better transfer speeds to and from my home server at whatever time I opt to upgrade that and my network switching. As this is a machine used mostly for desktop applications, minimizing idle power isn't critical (it suspends when not in use) nor is minimizing power under load (it doesn't spend much time doing major number crunching either). By a similar token, single-threaded performance is probably more useful than a large improvement in throughput for multithreaded workloads, since I don't spend a lot of time doing big number crunching.

For AMD, I'm thinking a Ryzen 7 5700G (8C/16T) paired with a B550 motherboard for total cost of around $660. While I could gain a little CPU performance at slightly higher cost with a 5800X (with a 100 Mhz higher boost clock and twice as much L3 cache), I like the idea of having integrated graphics available should I need them at any time.

For Alder Lake, I like the look of the i5-12600K (6+4C/16T) and a Z690 motherboard, which run just under $800 (with the motherboard accounting for the majority of the cost increase over AMD). If I wanted to go with DDR5 memory, there's only one board matching my needs and that one costs $200 more than the others- factoring in that I would also need to buy new RAM for that, it seems like far too dear a price for a pretty small improvement over DDR4.

I'm leaning towards Alder Lake right now, since the performance is a little higher than Zen 3 although power consumption and cost are also higher. But what do you think is the better option, gallant readers?
You could try some other troubleshooting options first. Well, maybe you already did, but you didn't write about it Smile

Besides the PSU, my first suggestion would be to double-check for thermal and dust issues. Thermal issues can cause MCEs - they did, multiple times, on an Athlon II X4 640 I recently replaced. At first, they only occurred about twice a year, every time the heatsink was full of dust. Later, the issue occurred even when there was little to no dust, so I changed the thermal paste with Arctic MX-2 (that's all the local computer shop had, it's a quite good choice anyway; I later bought a larger batch of the better MX-4 and now there's the MX-5), which fixed the MCEs.
You could unseat, dust up and seat the RAM sticks back - and if you're changing the thermal paste, that's a good opportunity to unseat and reseat the processor.

As for replacing the CPU + MB + RAM: you're right, DDR5 is way too expensive for now, with a minimal performance gain, partially due to the much higher latency, which the bandwidth increase can't hide. Should be better by 14th-gen Intel Core / Zen 5 in 2023-2024.
I recently went for the AMD option, with Ryzen 7 Pro 5750GE (8C, 35W TDP, integrated graphics), Gigabyte B550M DS3H (4 memory slots) and... uh, DDR4-2133 SO-DIMMs mounted onto DIMM adapters (!), pending the delivery of a specific model of DDR4-3600 CL14 RAM, one of the few part numbers which appear on the MB's QVL as supported in 4 sticks mode, should I need it in the future. It's already been two months since my order...
Being powered on 24/7, I wanted a low-power option for a change: Athlon II X4 640 has 95W TDP. The 8 cores are consistently running above 4 GHz on distributed computing at World Community Grid, I disabled SMT. I also shut down another old computer running 24/7 thanks to the new computer, which further decreased the power consumption.
As you indicated, Intel has marginally higher performance for significantly higher power consumption over the 5700G / 5800X... and has nothing that matches the performance / consumption ratio of the "GE" versions.

For your use case, I'd probably still choose the 5700G over the i5-12600K, for TCO reasons, although I understand well that power consumption isn't as much of an issue for you as it is for me. For light enough usage, the possibly slightly lower single core performance shouldn't be a really noticeable issue. Both models will be running in circles around your current Skylake equipment anyway Smile

Oh, I forgot: if your current computer's woes are somehow fixed, for half a year at least, by improving thermal management or changing the PSU, then you may be able to wait for the Zen 4 series. I haven't checked whether it's supposed to keep compatibility with DDR4, but I surely hope so, given the situation with DDR5.
I expect to get some poking-and-prodding type debugging done today, but I'm not confident it'll help. Yesterday was mostly consumed by trying to make some backups (and then being foiled by BSODs) and doing non-computer things.
Lionel Debroux wrote:
Besides the PSU, my first suggestion would be to double-check for thermal and dust issues. Thermal issues can cause MCEs - they did, multiple times, on an Athlon II X4 640 I recently replaced.
I don't think thermals are likely to be a problem, since I haven't experienced any problems until very recently and the issues seem to be occurring even under light load. I think I'll boot it up and watch temperatures however, and indeed try reseating things (including the CPU if I can dig up some thermal paste).

I booted the system (which had stayed off overnight) and started a backup again while watching temperatures, which never exceeded 40 degrees. So thermals seem perfectly fine.

Quote:
As for replacing the CPU + MB + RAM: you're right, DDR5 is way too expensive for now, with a minimal performance gain, partially due to the much higher latency, which the bandwidth increase can't hide. Should be better by 14th-gen Intel Core / Zen 5 in 2023-2024.
In DDR5's favor, each DIMM is now two 32-bit channels rather than one 64-bit channel which helps improve performance on multithreaded workloads; but yes the higher latency hurts singlethreaded tasks slightly.

Quote:
I recently went for the AMD option, with Ryzen 7 Pro 5750GE (8C, 35W TDP, integrated graphics),
I'm surprised you found such a chip, since AFAICT the -GE variants are reduced-TDP configurations of the -G chips sold only to OEMs. Apparently the same characteristics can be achieved with a -G chip simply by applying a power limit in firmware.

Quote:
For your use case, I'd probably still choose the 5700G over the i5-12600K, for TCO reasons, although I understand well that power consumption isn't as much of an issue for you as it is for me. For light enough usage, the possibly slightly lower single core performance shouldn't be a really noticeable issue.
When I do load the system down it is at least some of the time in singlethreaded workloads, and load power consumption isn't terribly important since those heavy tasks are mostly interactive. For your use case the nice and efficient Zen-with-low-TDP configuration sounds like a great choice, but is indeed not so important to me.

It's encouraging to note that Anandtech's latest suggestions include either of a AMD 5800X or Intel i5-12600K. I don't expect to do any major upgrades to the system in the near future (so compatibility of the platform is not important) so those considerations are kind of moot but it certainly seems that either the AMD or Intel choice will serve me well.

Quote:
Oh, I forgot: if your current computer's woes are somehow fixed, for half a year at least, by improving thermal management or changing the PSU, then you may be able to wait for the Zen 4 series. I haven't checked whether it's supposed to keep compatibility with DDR4, but I surely hope so, given the situation with DDR5.
My understanding is that Zen 4 is expected to adopt a new socket (AM5) and DDR5, and that's one of the reasons some commentators suggest waiting to upgrade right now: an AM5 board or CPU is expected to be supported by new counterparts for a long time following its release.
ACK about thermal issues being unlikely (EDIT: I replied to the first version of your post, I see that your later tests pretty much ruled them out - at least if the CPU temperature is correctly reported, which it wasn't on my Athlon II X4 - I routinely got temperatures well below ambient temperature at idle, and 40 was precisely around the danger zone).
Few models of thermal paste last even 5 years or more (the Arctic pastes I mentioned supposedly last 8 years, but they're an exception), so if Skylake was reasonably current when your computer was built, it's more likely than not that your CPU's thermal paste is dead anyway, although it may very well not be the cause of your issues, due to how brutally they started happening and how frequently they now appear.

I'm thinking now that if your computer were used 24/7, the fan might be a reasonable culprit, as many fans are rated for only 50K hours, which is less than 6 years, and a sudden fan failure could explain both the sudden appearance of CPU thermal issues and the computer failing even under relatively light load. But your computer has a much lower duty cycle, and a well-behaved computer should warn you about the CPU fan being out of order...
A year ago or so, a work computer's PSU failed after 8 years of 24/7 usage. On that occasion, I noticed that both the case and CPU fans were dead. As long as they kept spinning, we could have noticed only after a while, since that computer was relatively infrequently rebooted; however, spinning back up was a nother matter: they would spin only a handful of degrees after being manually thrusted. As we know, fans in proper condition run for seconds on their own after being launched by a finger. The thermal paste was completely dead, too - replaced by MX-4.

A couple months ago, there was no lack of sellers selling -GE variants in brand-new condition, in purchase units of 1, by splitting larger kits, indeed. Some of these sellers were out of stock when I tried to buy my CPU, but the one from which I ended up buying had dozens.

Yup, Zen 4 will bring a new socket.
I'm confident the thermal sensors are accurate (they're reading slightly above ambient at power-on), and whatever thermal paste is on there is some kind of aftermarket solution- possibly even Arctic Silver; I have an aftermarket tower-style cooler on it.

I'm continuing to run some diagnostics, but nothing very useful thus far. I'll update this list as I go.
  • Windows memory test passed
  • Intel Processor Diagnostic Tool passed (and temperatures stayed low throughout)
  • Removing one stick of RAM and running on a single one in the slot nearest the CPU socket didn't do anything (still crashed)
  • The sensor on my AVCC3 power supply rail shows large variance on that 3.3V rail anywhere from 0 to 6V, but I believe that's simply an incorrect reading rather than some actual problem. If the 3.3V rail were that far out of spec I'd expect more obvious problems.
  • Swapping the other RAM stick into slot 0 (nearest the CPU socket) seemed to work; it lasted 3 hours while I ran a backup
  • Installing the other stick (that failed when run alone) into slot 2 (which is the other memory channel) crashed quickly
  • Moving the DIMM from slot 2 to slot 1 (same memory channel as slot 0, where the other DIMM is installed) seems to be working okay, but I'll give it more time

Since moving memory around seems to have some effect, I'll call the DIMMs X and Y, with DIMM slots 0 and 2 as memory channel A while 1 and 3 are channel B. Summarizing configurations I've tested:
  • X0: seems stable
  • Y0: unstable
  • X0Y2: unstable
  • X0Y1: seems stable
  • X0Y2: seems stable? That I believe this configuration failed quickly before suggests the intermittence is harder to pin down than thought or I didn't accurately record the earlier test that I thought was this configuration.
Do you have other RAM that toy can run - or perhaps some RAM testing software besides the windows stuff?

Might also pay to blow out your GPU heatsink as well if there is dust in it.

On a super rare occasion I had a metal shaving of some kind on the board that was crossing some pins, might be worth a gently motherboard clean too.
I kind of suspect the RAM is something of a red herring and perhaps messing with it changed behavior just because I was pressing down on the board to reinstall it which may have shifted something else slightly.

After the last test of moving RAM around I haven't had any bluescreens (in around a day and a half) so things seem okay for now but I'm somewhat tempted to upgrade the system anyway since this seems like a good excuse to do so and I'm not confident anything that might have been broken is actually fixed.
At least, good to see that there was a change for the better, even if it might, or might not, last...

At this stage, unless the computer breaks down in the meantime, and even if you buy Alder Lake or Zen 3 hardware, you could wait for the sales at the beginning of the year.
I came back to the machine this evening and it was off rather than simply suspended, prompting me to look in the event log and find that it had crashed with the same 0x124 bugcheck, three times in about 30 minutes:
Representative logged event wrote:
The computer has rebooted from a bugcheck. The bugcheck was: 0x00000124 (0x0000000000000010, 0xffffcc0c624ea028, 0xffffcc0c36a8809c, 0xffffcc0c38b301a0). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: eb4964b5-7e0b-4aa0-a0c3-f904bc7d12df.


Perhaps this is related to the ambient temperature since today was somewhat cooler than yesterday was, but suffice to say I can now tell the problem is not fixed.
The tests do seem to point towards RAM but at the same time the X0Y1 test seeming stable seems to point towards it maybe being something else.

Have you tried doing any dedicated memory or stress testing such as memtest86 or prime95 memory heavy testing?

As for it you do go for the upgrade, the AMD Ryzen 7 5700G does seem like the the best option but last I saw the Cyber Monday deals on the -G series processors caused them to be out of stock at many places, though stock may return fairly quickly.
I mentioned above that I ran the Windows memory diagnostic (which reboots and does a full memory test) as well as Prime95 memory-heavy FFTs, neither of which had any problems. I'm convinced it's some motherboard problem.
Any news, Tari? Are you thinking of a new mobo + CPU, which seems like the most reasonable first step to me?
Still haven't been able to pin down a cause for the crashes, or even situations in which they seem to occur. There was one evening last week where it was very unstable (crashed multiple times on me) so I just left it turned off, but then it was fine after I left if off for a few hours. I though it might be temperature-dependent but the data from the temperature sensor that happens to be fairly near the computer doesn't seem conclusive either:

The low temperatures on the 10th was around when I had a number of issues, but I also had problems on the 5th and 8th where it was warmer. Humidity doesn't seem to relate much either.

I do have a new (well, used; but you get the idea) Z270 motherboard on the way that wasn't terribly expensive so I can see if that might correct any problems, though if I'm not experiencing anything it'll be hard to verify if a motherboard replacement helps. I'm still strongly considering doing an upgrade anyway (at which point I hopefully have a working motherboard+CPU that's being replaced to find a home for), but it doesn't seem urgent right now.
Perhaps heat expansion+contraction is at play somewhere?
In ambivalent news, I had another bluescreen today with a continued absence of obvious triggers. The timeline looks like this now:

Updated: there was another one not long after I posted this.

In more useful news, a Chinese Z270 motherboard arrived today so I can experiment with a motherboard swap.
I'm sure I need to point out that you are currently in Australia (I think you are still?) where we use a more sane date format ... Razz

But yeah I think there is definitely something sinister at play so hopefully the new mobo will rectify the problem.
I did some surgery last night, so fingers crossed that I don't see any more crashes; it'll be difficult to say with any certainty that it's fixed though, given I never figured out a way to cause crashes reliably:

New board is at the bottom; you can tell from the dust on the other one. It's a bit of an upgrade since it has two M.2 slots (rather than just one) and a USB-C port, but fewer USB-A ports on the rear panel.

While doing that I was somewhat amused to observe that for all that my day job involves literally building computers, it's been a while since I "built a computer" in the more conventional sense.
Fingers crossed that that does indeed do the trick! How's it looking after the first week?
No issues with the new board so far, so I'm optimistic. The earlier data includes an 11-day interval though, so I don't think I'll be prepared to declare anything fixed until it's been stable for 3 weeks or so.
Fingers crossed, also it didn't come with a backup battery?
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 2
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement