* A new REX-like prefix that extends the number of addressable GPRs to 32 from 16. This only supports instructions that have one-byte opcodes or the 0f prefix, so recent GPR instructions like ADCX or BLSR aren't supported with this format, except.
* The EVEX prefix (used for AVX-512) is also extended to be usable for GPR instructions instead of just vector instructions. This allows three-address instructions to be defined.
* The EVEX prefix for GPR also has a dedicated bit for "do you want to set flags as a result of this instruction."
* New instructions that push/pop 2 GPRs at once
* New instructions that let you conditionally set flags (basically you can do OR/AND in the hardware flags, this sounds useful for compilers).
* New instructions for predicated loads.
* New 64-bit absolute jump instruction
* Also, implementation of the predicated stuff in AVX-512, but for 256-bit vectors. With this note:
> A “converged” version of Intel AVX10 with maximum vector lengths of 256 bits and 32-bit opmask registers will be supported across all Intel processors, while 512-bit vector registers and 64-bit opmasks will continue to be supported on some P-core processors.
MMX didn't add new registers. It uses the same FP registers as those present all the way back to the Intel 8087 in 1980. You could say the MMX registers are older than the 32-bit x86 architecture, and therefore older than EAX, EBX, ESP etc.
Not exactly, especially in the context of this issue. You could use the MMX registers as 'traditional' registers (or at least a scratchpad), not quite like the x87 stack.
There are 8 64-bit MMX registers. To avoid having to add new registers, they were made to overlap with the FPU stack register. This means that the MMX instructions and the FPU instructions cannot be used simultaneously. MMX registers are addressed directly, and do not need to be accessed by pushing and popping in the same way as the FPU registers. [1]
Apparently the Intel's marketing had forgotten that they already had a product called iAPX (standing for “Advanced Performance arCHitecture”) and it did not go exactly well :)
> extends the number of addressable GPRs to 32 from 16
I have always been curious as to why the number of GPRs were limited for so long on X86 given that the instruction set is already variable length, and the CPU have typically a very large number of internal arch-register that could be cheaply addressed.
Having looked at the pain of developing a good register allocate in LLVM, and how critical memory access can me in hot/tight loops i would have loved to have even more register something closer to 64 or 128, and let the cpu manage the spilling internally.
> I have always been curious as to why the number of GPRs were limited for so long on X86 given that the instruction set is already variable length, and the CPU have typically a very large number of internal arch-register that could be cheaply addressed.
Because it's hard-ish to wrench in the encoding bits. x86 registers are addressed via the ModR/M byte, which gives you 4 addressing modes × 2 register operands (if you have 8 registers). Note that non-register-register addressing modes include a second SIB byte which means you have up to three registers encoded in an instruction. x86-64 extended the number to 16 by adding a prefix that has 4 bits, which encoded a 32-bit/64-bit selector and 1 bit for each of the three possible registers (R, X, B). Note that to keep the prefix to only a 1-byte length, they had to reclaim 16 possible opcodes, whose space is pretty limited.
Encoding large register numbers gets pretty chonky in instruction space (3 register ids of 5 bits each is 15 bits of your instruction for operands, and that's before you start considering possible opcodes). It also increases the size of context switches (especially thread contexts), since you have to spill all of those registers even if they're not filled with useful data.
8 registers is definitely too few; it's not clear to me if the extra working set size afforded by 32 registers is worth it over the fatter instructions.
They're not really limited since the instruction decoder is actually doing a bunch of work to assign to physical registers. The 16 GPRs are an abstraction, iirc.
Even though it's only one extra bit to index 16 registers vs 8, it takes an entire extra byte in the encoding (you have to add the REX prefix with the R, X, and B bits set appropriately for the MSB of the registers you're modifying in the MODR/M byte). And you only get 4 bits to use out of that byte because the leading four are a way to disambiguate it (also iirc, this stuff is terribly documented).
And paying an extra byte per GRP instruction is actually kind of expensive. If you look at your compiler output you'll see a bunch of Exx instructions even in 64 bit long mode when optimizing for instruction decode/size.
It turns out a variable length encoding is really hard to extend without getting weird with it.
> And you only get 4 bits to use out of that byte because the leading four are a way to disambiguate it (also iirc, this stuff is terribly documented).
No, it has been extremely well documented since 2000 (three years before AMD had the first 64-bit chip ready).
Well, originally there where 7 (well, 6 really as frame pointers were pretty much required) general purpose registers. AMD64 bumped that to 14-15. That's a huge change with clear performance benefits. going to 32 is already into diminishing results for most use cases and has non-trivial encoding, function calling and context-switching implications.
Shared libs on the 386 stole a register more, because IP-relative addressing wasn't a thing until AMD64 (except for jmp/call). Shared libs need to refer to their own data + they may need to form pointers to code within them.
(Shared libraries need to have a pointer to their code and data. This is done by creating a small table of entry code in front of the "real" code that sets a register to point to the library code/data memory before jumping to the real code. This code will look different for each process that has loaded the shared library so it gets its own little memory block. The rest of the code/data will be the same in all processes so it can be shared between them -- unless/until some of the data is written to, of course.)
So code in shared libs really only had 5 registers unless the frame pointer was fomitted. That's why it was necessary to tell the compiler whether to build normal code or shared library code.
The AMD64 not only had more registers, it also has an RIP-relative addressing mode so it doesn't need to steal that register for shared libraries.
These registers are caller-saved, so no need to save them for system calls. Thread switching might be performance neutral if the OS uses the xsave / xrstor instructions.
So .... when will it ship? No mention of physical products anywhere.
I wonder why this long delays are still necessary. In the old days yes as there were so many parties to coordinate but nowadays, in theory, Intel could release hardware, the new ISA and compiler/OS patches and binaries on the same day.
> I wonder why this long delays are still necessary. In the old days yes as there were so many parties to coordinate but nowadays, in theory, Intel could release hardware, the new ISA and compiler/OS patches and binaries on the same day.
Most new ISA support requires at least minimal support from the OS to use correctly (e.g., something like saving state on context switching, or reporting hardware support bits correctly). Releasing patches on the same day you release hardware means the new features are literally unusable for all of your customers, and it generally takes several months to go from a patch to a usable OS release. You could in theory release the patches on the same day as the new ISA documentation, but in practice, it's likely to take some time because the people writing the patches aren't the people writing the new ISA whitepapers and approving their publication and it takes time to go from "oh, I can talk about this now" to actually doing so.
Taking a little time with these sorts of things isn't bad. This will essentially be a new epic where AMD and Intel may end up being incompatible, at least for a while, there are tooling changes to make, OS support, all sorts of things. If they're serious about dropping 16bit and 32bit from the architecture, doing it during a change like this might make some sense. Do it wrong and they could Osborn themselves, but an Apple like migration strategy would be nice and I'd appreciate it.
They announce it in time so that llvm, gcc, MSVC, Java, .NET and the browser JS engine vendors can update their stacks to be ready when the silicon ships.
After all why spend a shitload of money on developing a feature you'll want to market when there won't be any reason for customers to buy your new hardware?
Intel employee people who work on Java and LLVM at least, so they could release their own versions of those. I guess V8 wouldn't be hard to add.
The main reason to do a coordinated release would be competitive advantage, I guess, plus to ensure that there are actually compilers that support their new instructions available right from the start. Intel already release their own Linux distro optimized for their chips because other vendors were targeting the lowest common denominator
Making sure the thing actually works? At least I hope they're still doing that... and it's ironic to see this comment at the same time a rather horrible bug was discovered in AMD's CPUs:
> "In addition, legacy integer instructions now can also use EVEX to encode a dedicated destination register operand – turning them into three-operand instructions and reducing the need for extra register move instructions."
Overall, APX is providing 10% fewer instructions, 10% fewer loads and more than 20% fewer stores.
Also adding pop2/push2 instructions for moving state faster.
And adding more powerful conditional instructions (loads/stores/compares) and flag-suppression.
Oh missed your comment and posted essentially the same. These are all interesting changes, predication certainly, but the thing that actually got me the most excited was the press release comment about:
"The processor tracks these new instructions internally and fast-forwards register data between matching PUSH2 and POP2 instructions without going through memory."
I wonder if this implies that pushes don't have to commit to memory if they are popped soon enough? It has always bothered me that we have these huge physical register files but force all the spill and restore to go through memory because of silly anachronistic processor semantics. With a more flexible PUSH/POP semantics we could essentially get the register windows for free.
? That was not my question nor my point. They hit memory and, as I have since learned, they still do after this. The wording in the press release was ambiguous. In other words, the news here is just being able to push/pop two in one µop.
10% fewer instructions but average instruction is longer, so code density is the same, they claim. This still leaves their ISA with the worst code density of any non-obsolete ISA.
I've only read one study [1] and its follow-up [2] with code-density benchmarks but according to them (one source), x86-64 is actually one of the denser contemporary ISA's ... provided that the compiler/programmer is smart enough to adapt to the ISA's quirks.
> "legacy integer instructions now can also use EVEX to encode a dedicated destination register operand – turning them into three-operand instructions"
x86 ISA is growing more RISC-like. Definitely saving on stack spilling is a Good Thing™
So it sounds like (among other things) they're adding 3-address integer instructions to an instruction encoding only used for vector instructions today.
I was not familiar with the AVX vector instructions at this level of detail.
Have you noticed that some of the bits in the EVEX prefix (after the 62h) byte are inverted?
That's because it hides inside the BOUND instruction in a way that allows it to be used outside of 64-bit mode code. The BOUND instruction never existed in 64-bit mode, so Intel was free to do whatever they wanted with the 62h opcode, but they thought it would be enable EVEX in 16/32-bit code too.
The BOUND instruction must take a memory operand -- so the MOD bits can never be 11b (which would specify a register operand). The MOD bits are the upper two bits of the modrm byte (the byte right after the 62h opcode). So, if we don't allow the "upper" registers in 32-bit mode and only the "lower" 8 registers AND if we invert bit 3 of their register numbers and put those extra register specifier bits in bit 7/6/5 of the byte after the 62h opcode THEN we can fit EVEX into 32-bit mode, because those bits will always be 111b in 32-bit mode!
So outside of 64-bit mode, if we have a 62h opcode with a modrm byte with MOD!=11b: it's a BOUND instruction. If MOD=11b: it's an EVEX prefix.
+ Two-byte REX2 prefix that replaces the REX prefix (and 0F prefix).
REX2 is D5 + a byte that is a lot like the lower nibble of a REX prefix, only twice as big.
It has two extra bits for the up to three registers that can be named in normal x86 instructions: either two registers or a register operand and a memory operand that can use two registers + a displacement for the memory address.
It also contains the W bit like REX does (64-bit).
It also has a bit called M0 that indicates whether the instruction is in opcode map 0 (primary opcode map) or opcode map 1 (0F opcode map).
That means that a REX2 instruction from opcode map 1 takes the same number of bytes as a REX instruction from opcode map 1. Instructions from opcode map 0 are one byte longer with REX2 than with REX.
Some of the normal ALU instructions also get an EVEX encoding (in map 4). That allows for a different data destination than before (separate from source the operand(s)). It also allows for ALU instructions that don't change the flags, which must be really nice for the out of order/data forwarding circuitry.
No, the E-cores will implement only a 256-bit subset of AVX-512, which halves the size of the vector registers to 256-bit and the size of the mask registers to 32-bit. The same subset will be implemented on the P-cores combined with E-cores.
This subset AVX10/256, is the reason for this new specification. It is the Intel response to AMD Zen 4.
When their competitor supports AVX-512 on all products, Intel had to do something to remain competitive. Because they believe that supporting the full AVX-512 on their E-cores is too expensive, they have created a subset of AVX-512, including only the instructions with an operand size up to 256 bits.
Even if since Skylake server the AVX-512 ISA includes scalar, 128-bit vector, 256-bit vector and 512-bit vector instructions, it was not possible to implement any subset that did not include the 512-bit vector instructions, because there were no means for a program to discover that the 512-bit instructions are missing.
Now, a different method has been defined for discovering through CPUID which AVX-512 a.k.a. AVX10 features are implemented, so only now it has become possible to implement an up to 256-bit subset.
Moreover when 128-bit vector and 256-bit vector instructions have been added to AVX-512, 2 bits from the EVEX prefix that were previously used for rounding control have been reused to encode the length of the vector operands.
Because of this, only the 512-bit vector instructions and the scalar instructions can specify the rounding control. So if the 512-bit vector instructions are deleted, there is no longer any way to specify the rounding control for vector instructions.
To solve this problem, in the first CPUs that will implement the 256-bit subset of AVX-512, new encodings will be used for the 256-bit instructions with rounding control.
Also the XSAVE and XRSTOR instructions had to be modified to save and restore correctly the new vector registers and mask registers.
So implementing the 256-bit subset of a AVX-512, a.k.a. AVX10/256, is not so straightforward as a microcode update, it requires changes in the instruction decoders and in the structure of the CPUID registers and other smaller changes.
If I read the note correctly, P-cores won't have 512-bit vector registers, but they will have the other fancy stuff added by AVX-512 (namely, vector predication stuff, static rounding mode instructions, new vector instructions like complex multiply or half-precision float, and 32 vector registers), just only for 128-bit and 256-bit vectors. Which, to be fair, is arguably the more useful parts of AVX-512 anyways; the maximum vector length being upped isn't all that interesting.
> Part of making AVX10 suitable for both P and E cores is that the converged version has a maximum vector length of 256-bits and found with the E cores while P cores will have optional 512-bit vector use.
512-bit support will be optional, so maybe every P-core won’t have it… maybe it’ll be restricted to higher end processors? But it sounds like some will have it, or it wouldn’t be an option at all.
presumably it will be the server chips that are all p cores that have it. to me it seems like a dumb choice, but it is consistent with what Intel is doing now
"A “converged” version of Intel AVX10 with maximum vector lengths of 256 bits and 32-bit opmask registers will be supported across all Intel processors, while 512-bit vector registers and 64-bit opmasks will continue to be supported on some P-core processors."
So all future Intel CPUs starting in 2025 will support a 256-bit subset of AVX-512, where AVX-512 is rebranded as AVX10.
Only some P-core processors will support the full 512-bit AVX-512 a.k.a. AVX10, which is to be understood that only those server CPUs that contain only P-cores, i.e. the successors of Granite Rapids and Granite Rapids D, will support 512-bit registers and instructions (and 64-bit mask registers instead of 32-bit mask registers).
It is clear this is at least 2 - 3 years in the making. It doesn't seems any of the 2024 products on Intel Roadmap will have APX ( This could be wrong ). So I assume the earliest being 2025.
The question is when will AMD adopt it. Zen 5 is done and Zen 6 may be too late for these changes. Zen 6 is already looking at 2026. If they waited til Zen 7 it will be at least 2028.
Intel is still 35% behind Apple in terms of Pref / Clock on Geekbench.
> Intel® APX demonstrates the advantage of the variable-length instruction encodings of x86 – new features enhancing the entire instruction set can be defined with only incremental changes to the instruction-decode hardware. This flexibility has allowed Intel® architecture to adapt and flourish over four decades of rapid advances in computing – and it enables the innovations that will keep it thriving into the future.
"Intel® APX demonstrates the advantage of the variable-length instruction encodings of x86 – new features enhancing the entire instruction set can be defined with only incremental changes to the instruction-decode hardware."
In other words, their initial sloppiness was a Feature™: our initial mess was so bad that changes like these don't make it any worse!
I don't know much about CPUs, but isn't this going to increase decoder complexity and binary size? The reduction in memory accesses seems great, but can anybody tell me if there are better ways of achieving this? x86 gets more complex each day.
These changes reduce the amount of instructions you need in the first place - less stack shuffling with more registers and push2/pop2, less register shuffling with nondestructive operations
So, in about 30 years when the majority of the CPUs have this, we can use it. Assuming intel does not gate this just to XEON for no reason whatsoever, like they did to AVX512?
Performance-intensive code (hot loops) could be compiled multiple times for each architecture extension and switched with CPUID. That's how all the various SSE and AVX extensions were rolled out in multimedia code.
AFAIK there was AVX512 in higher-end desktop SKUs, but not the latest E-core designs in Intel's client chips. The main problems are that:
- Operations at 512-bit register widths have significant power draw. Intel chips that support AVX512 have to downclock themselves on AVX512 workloads until their voltage regulators have boosted up to a higher voltage.
- The AVX512 register file is too big to physically fit in the E-core[0] footprint.
Incidentally I do remember Linus Torvalds specifically complaining that AVX512 was being used to implement memcpy in gcc, because it meant running certain programs would lower system performance. So these new architectures tend to be used a lot sooner than the time it takes for it to be safe to make them your minimum compile target.
[0] The BIOS on my Framework laptop refers to these as "Atom cores" - no clue if the current E-core design is derived from Atom or if this is a miscommunication or nickname AMI picked.
No, operations at 512-bit register widths do not have significant power draw, as demonstrated by AMD Zen 4.
What has significant power draw is the use of double 512-bit floating-point multipliers, as implemented in the Intel server CPUs (though one core with such multipliers draws significantly less power than two cores having the same throughput).
AMD uses only double 256-bit floating-point multipliers and in general it uses exactly the same execution units for both 256-bit and 512-bit operations, so the AVX-512 operations do not increase the power draw even when they use 512-bit registers.
Also the AVX512 register file is not too big to physically fit in the E-core. Even if the AVX512 register file is 4 times greater than the AVX register file, the E-cores have much a much larger register file used to rename the architecturally visible registers.
Despite these facts, Intel still believes that implementing the full AVX-512 ISA in the E-cores is too expensive, so they have created this new specification of AVX10/256, which is just a subset of AVX-512 including the instructions with an operand size up to 256 bits, and which will be implemented in all future E-cores after some date, perhaps starting in 2025.
Moreover, besides their history, the E-cores continue to be sold using the Atom brand, for instance "Intel Atom® x7425E" (the industrial variant of the Intel N100, one of the Alder Lake N models).
Modern compilers already allow for conditional code execution depending on CPU sets, and at least in what concerns JVM implementations and the CLR, their JITs are clever enough to already use parts of AVX512, while they are not perfect, it is better than not using them at all.
This is Gentoo's whole shtick. It's the core feature. It's been supported since day 0, over 20 years ago.
I'm honestly a little amazed that more people don't either use Gentoo or adopt its model, given the smorgasbord of mutually incompatible instruction set extensions. It seems very strange that people will buy a CPU that has 32 64-byte ZMM registers with 3 operand instructions, and then use that CPU to run code that operates on 8 16-byte XMM registers with 2 operand instructions.
Because building from source is extremely time/CPU consuming and also unreliable. Time is valuable. In my last Gentoo attempt, I had to manually fix a few build recipes before I threw in the towel.
This also means "riskier" methods (like LTO) have to be omitted by default.
Gentoo is great for libre software, security, manual patches, embedded computing and such. But for pure desktop performance, the Clear Linux way is best: aggressive compilation flags/libraries, tested by the package maintainers, shipped in 3-4 tiers. And as the Clear Linux devs said, most of the native instructions dont even matter, as the compilers can't use them.
Because most people have better shit to do than fix their now-broken system every other time they upgrade their packages. I used to run Gentoo, was told on IRC after like the 9,000th such breakage to go use something else if I didn't like their perpetually broken free distribution, and I took their advice.
On linux distros, the package manager downloads different binaries based on your CPU. Skylake would be x86-64-v3, Zen 4 would be x86-64-v4, for example.
And there are different schemes for multiple architectures in the same program, like hwcaps.
The extensions can be kinda broken down into 4 levels. Basically ancient, old (SSE 4.2), reasonably new (AVX2, Haswell/Zen 1 and up), and baseline AVX512.
There is discussion of a fifth level. Someone in the Intel Clear Linux IRC said a fifth level wasn't "worth it" for Sapphire Rapids because most of the new AVX512 extensions were not autovectorized by compilers, but that a new level would be needed in the future. Perhaps they were thinking of APX, but couldn't disclose it.
Work out what it would cost to compile - say - a terabyte of C code at typical cloud spot prices.
A large VM with 128 cores can compile the 100 MB Linux kernel source tree in about 30 seconds. So… 200 MB/minute or 12 GB/hour. This would take 80 hours for a terabyte.
A 120 core AMD server is about 50c per hour on Azure (Linux spot pricing).
So… about $40 to compile an entire distro. Not exactly breaking the bank.
you'd have to separate out compiling and linking at a bare minimum to get even a semi accurate model. plus a lot of userspace is c++, which is much, much slower.
in the end it will be like any other modern hardware appliance:
the hardware is the same design for cost saving purposes, but different features are unlocked for $$$ by a software license key.
You want AVX-512? pay up and unlock feature in your CPU and you can now use the feature. This could also enable pay-as-you-go license scheme for CPUs, creating recurring revenue for Intel
from the hardware perspective - the same silicon, but different features sold separately
Yup. It's one of their theoretical advantages that's about to become a lot less theoretical. Historically it hasn't made much difference because optional instructions were hard for JIT compilers for most languages to use (in particular high level JITd languages tend not to support vector instructions very well). But a doubling of registers is the sort of extension that any kind of code can immediately profit from.
Arguably it will be only JITd languages that benefit from this for quite a while. These sorts of fundamental changes are basically a new ISA and the infrastructure isn't really geared up to make doing that easy. Everyone would have to provide two versions of every app and shared library to get the most benefit, maybe even you get combinatorial complexity if people want to upgrade the inter-library calling conventions too. For native AOT compiled code it's going to just be a mess.
It seems like most of these new instructions and registers correspond to the original armv8 base isa. I'm going to go out on a limb here and suppose that's not an accident. Does anyone know why Intel thinks x86 needs them?
Is the goal here to increase the decode bandwidth of Intel CPUs?
Is the goal to reduce demands on load-store units by increasing the number of registers?
Are they hoping to make it easier to port or JIT armv8 asm to Intel CPUs?
Most new instructions are not inspired by Armv8, they just implement the traditional 3-address format for 32 registers, which predates Armv8 by a few decades.
Nevertheless, there are a few instructions inspired by Armv8, mainly PUSH2 and POP2, which correspond to the load register pair and store register pair of Aarch64.
The new CCMP conditional-compare instruction is also equivalent to ARMv8’s instruction of the same name. If a condition passes, compare two registers; if not, set the condition bits to an arbitrary value.
On ARM that instruction is a pain in the ass when reading disassemblies, because the on-fail condition bits are just specified as a number from 0 to 15; the disassembler doesn’t bother to label which bits are specified, let alone what conditions they correspond to. Unfortunately it seems like Intel is doing the same thing in their assembly syntax, at least if I’m reading the document correctly.
yeah, exactly. instruction fusing can turn mov+op2 into 3-reg operation, or push+push into push2. but adding new instructions allows to increase the frontend throughput too.
* A new REX-like prefix that extends the number of addressable GPRs to 32 from 16. This only supports instructions that have one-byte opcodes or the 0f prefix, so recent GPR instructions like ADCX or BLSR aren't supported with this format, except.
* The EVEX prefix (used for AVX-512) is also extended to be usable for GPR instructions instead of just vector instructions. This allows three-address instructions to be defined.
* The EVEX prefix for GPR also has a dedicated bit for "do you want to set flags as a result of this instruction."
* New instructions that push/pop 2 GPRs at once
* New instructions that let you conditionally set flags (basically you can do OR/AND in the hardware flags, this sounds useful for compilers).
* New instructions for predicated loads.
* New 64-bit absolute jump instruction
* Also, implementation of the predicated stuff in AVX-512, but for 256-bit vectors. With this note:
> A “converged” version of Intel AVX10 with maximum vector lengths of 256 bits and 32-bit opmask registers will be supported across all Intel processors, while 512-bit vector registers and 64-bit opmasks will continue to be supported on some P-core processors.