How many registers does an x86-64 CPU have? (2020)

(blog.yossarian.net)

71 points | by tosh 7 hours ago ago

40 comments

  • noelwelsh 4 hours ago ago

    This is how many registers the ISA exposes, but not the number of registers actually in the CPU. Typical CPUs have hundreds of registers. For example, Zen 4 's integer register file has 224 registers, and the FP/vector register file has 192 registers (per Wikipedia). This is useful to know because it can effect behavior. E.g. I've seen results where doing a register allocation pass with a large number of registers, followed by a pass with the number of registers exposed in the ISA, leads to better performance.

  • rep_lodsb an hour ago ago

    Nitpick (footnote 3): "64-bit kernels can run 32-bit userspace processes, but 64-bit and 32-bit code can’t be mixed in the same process. ↩"

    That isn't true on any operating system I'm aware of. If both modes are supported at all, there will be a ring 3 code selector defined in the GDT for each, and I don't think there would be any security benefit to hiding the "inactive" one. A program could even use the LAR instruction to search for them.

    At least on Linux, the kernel is perfectly fine with being called from either mode. FASM example code (with hardcoded selector, works on my machine):

        format elf executable at $1_0000
        entry start
        
        segment readable executable
        
        start:  mov     eax,4                   ;32-bit syscall# for write
                mov     ebx,1                   ;handle
                mov     ecx,Msg1                ;pointer
                mov     edx,Msg1.len            ;length
                int     $80
        
                call    $33:demo64
        
                mov     eax,4
                mov     ebx,1
                mov     ecx,Msg3
                mov     edx,Msg3.len
                int     $80
                mov     eax,1                   ;exit
                xor     ebx,ebx                 ;status
                int     $80
        
        use64
        demo64: mov     eax,1                   ;64-bit syscall# for write
                mov     edi,1                   ;handle
                lea     rsi,[Msg2]              ;pointer
                mov     edx,Msg2.len            ;length
                syscall
                retfd                           ;return to caller in 32 bit mode
    
        Msg1    db      "Hello from 32-bit mode",10
        .len=$-Msg1
        
        Msg2    db      "Now in 64-bit mode",10
        .len=$-Msg2
        
        Msg3    db      "Back to 32 bits",10
        .len=$-Msg3
  • dang 37 minutes ago ago

    Related. Others?

    How many registers does an x86-64 CPU have? (2020) - https://news.ycombinator.com/item?id=36807394 - July 2023 (10 comments)

    How many registers does an x86-64 CPU have? - https://news.ycombinator.com/item?id=25253797 - Nov 2020 (109 comments)

  • JonChesterfield 7 hours ago ago

    Good post! Stuff I didn't know x64 has. Sadly doesn't answer the "how many registers are behind rax" question I was hoping for, I'd love to know how many outstanding writes one can have to the various architectural registers before the renaming machinery runs out and things stall. Not really for immediate application to life, just a missing part of my mental cost model for x64.

  • Someone 3 hours ago ago

    FTA: “For design reasons that are a complete mystery to me, the MMX registers are actually sub-registers of the x87 STn registers”

    I think the main argument for doing that was that it meant that existing OSes didn’t need changes for the new CPU. Because they already saved the x87 registers on context switch, they automatically saved the MMX registers, and context switches didn’t slow down.

    It also may have decreased the amount of space needed, but that difference can’t have been very large, I think

  • fuhsnn 7 hours ago ago

    Intel's next gen will add 16 more general purpose registers. Can't wait for the benchmarks.

    • woadwarrior01 3 hours ago ago

      I looked it up. It's called APX (Advanced Performance Extensions)[1].

      [1]: https://www.intel.com/content/www/us/en/developer/articles/t...

    • Joker_vD 6 hours ago ago

      So every function call will need to spill even more call-clobbered registers to the stack!

      Like, I get that leaf functions with truly huge computational cores are a thing that would benefit from more ISA-visible registers, but... don't we have GPUs for that now? And TPUs? NPUs? Whatever those things are called?

      • cvoss 4 hours ago ago

        With an increase in available registers, every value that a compiler might newly choose to keep in a register was a value that would previously have lived in the local stack frame anyway.

        It's up to the compiler to decide how many registers it needs to preserve at a call. It's also up to the compiler to decide which registers shall be the call-clobbered ones. "None" is a valid choice here, if you wish.

      • jandrewrogers 6 hours ago ago

        Most function calls are aggressively inlined by the compiler such that they are no longer "function calls". More registers will make that even more effective.

        • burnt-resistor 4 hours ago ago

          That depends on if something like LTO is possible and a function isn't declared to use one of the plethora of calling conventions. What it means is that new calling conventions will be needed and that this new platform will be able to use pass by register for higher arity functions.

      • throwaway17_17 6 hours ago ago

        Why does having more more registers lead to spilling? I would assume (probably) incorrectly, that more registers means less spill. Are you talking about calls inside other calls which cause the outer scope arguments to be preemptively spilled so the inner scope data can be pre placed in registers?

        • BeeOnRope 5 hours ago ago

          More registers leads to less spilling not more, unless the compiler is making some really bad choices.

          Any easy way to see that is that the system with more registers can always use the same register allocation as the one with fewer, ignoring the extra registers, if that's profitable (i.e. it's not forced into using extra caller-saved registers if it doesn't want to).

        • Joker_vD 5 hours ago ago

          So, let's take a function with 40 alive temporaries at a point where it needs to call a helper function of, say, two arguments.

          On a 16 register machine with 9 call-clobbered registers and 7 call-invariant ones (one of which is the stack pointer) we put 6 temporaries into call-invariant registers (so there are 6 spills in the prologue of this big function), another 9 into the call-clobbered registers; 2 of those 9 are the helper function's arguments, but 7 other temporaries have to be spilled to survive the call. And the rest 25 temporaries live on the stack in the first place.

          If we instead take a machine with 31 registers, 19 being call-clobbered and 12 call-invariant ones (one of which is a stack pointer), we can put 11 temporaries into call-invariant registers (so there are 11 spills in the prologue of this big function), and another 19 into the call-clobbered registers; 2 of those 19 are the helper function's arguments, so 17 other temporaries have to be spilled to survive the call. And the rest of 10 temporaries live on the stack in the first place.

          So, there seems to be more spilling/reloading whether you count pre-emptive spills or the on-demand-at-the-call-site spills, at least to me.

          • csjh 4 hours ago ago

            You’re missing the fact that the compiler isn’t forced to fill every register in the first place. If it was less efficient to use more registers, the compiler simply wouldn’t use more registers.

            The actual counter proof here would be that in either case, the temporaries have to end up on the stack at some point anyways, so you’d need to look at the total number of loads/stores in the proximity of the call site in general.

          • dahart 3 hours ago ago

            This argument doesn’t make sense to me. Generally speaking, having more registers does not result in more spilling, it results in less spilling. Obviously, if you have 100 registers here, there’s no spilling at all. And think through what happens in your example with a 4 register machine or a 1 register machine, all values must spill. You can demonstrate the general principle yourself by limiting the number of registers and then increasing it using the ffixed-reg flags. In CUDA you can set your register count and basically watch the number of spills go up by one every time you take away a register and go down by one every time you add a register.

          • mjevans 4 hours ago ago

            I recalled there were some new instructions added that greatly help with this. Unfortunately I'm not finding any good _webpages_ that describe the operation generally to give me a good overview / refresher. Everything seems to either directly quote published PDF documents or otherwise not actually present the information in it's effective for end use form. E.G. https://www.felixcloutier.com/x86/ -- However availability is problematic for even slightly older silicon https://en.wikipedia.org/wiki/X86-64

            - XSAVE / XRSTOR

            - XSAVEOPT / XRSTOR

            - XSAVEC / XRSTOR

            - XSAVES / XRSTORS

          • cv5005 4 hours ago ago

            A good compiler will only do that if the register spilling is more efficient than using more stack varibles, so I don't really see the problem.

        • CamelCaseCondo 6 hours ago ago

          op is probably referring to the push all/pop all approach.

          • Joker_vD 5 hours ago ago

            No, I don't. I use a common "spill definitely reused call-invariant registers at the prologue, spill call-clobbered registers that need to survive a call at precisely the call site" approach, see the sibling comment for the arithmetic.

      • bjourne 4 hours ago ago

        Most modern compilers for modern languages do an insane amount of inlining so the problem you're mentioning isn't a big issue. And, basically, GPUs and TPUs can't handle branches. CPUs can.

    • vaylian 4 hours ago ago

      Those general purpose registers will also need to grow to twice their size, once we get our first 128bit CPU architecture. I hope Intel is thinking this through.

      • SAI_Peregrinus 3 hours ago ago

        That's a ways out. We're not even using all bits in addresses yet. Unless they want hardware pointer tagging a la CHERI there's not going to be a need to increase address sizes, but that doesn't expose the extra bits to the user.

        Data registers could be bigger. There's no reason `sizeof int` has to equal `sizeof intptr_t`, many older architectures had separate address & data register sizes. SIMD registers are already a case of that in x86_64.

        • hinkley 3 hours ago ago

          You can do a lot of pointer tagging in 64 bit pointers. Do we have CPUs with true 64 bit pointers yet? Looks like the Zen 4 is up to 57 bits. IIRC the original x86_64 CPUs were 48 bit addressing and the first Intel CPUs to dabble with larger pointers were actually only 40 bit addressing.

      • rwmj 4 hours ago ago

        There's a first time for everything.

    • BobbyTables2 5 hours ago ago

      How are they adding GPRs? Won’t that utterly break how instructions are encoded?

      That would be a major headache — even if current instruction encodings were somehow preserved.

      It’s not just about compilers and assemblers. Every single system implementing virtualization has a software emulation of the instruction set - easily 10k lines of very dense code/tables.

      • toast0 3 hours ago ago

        x86 is broadly extendable. APX adds a REX2 prefix to address the new registers, and also allows using the EVEX prefix in new ways. And there's new conditional instructions where the encoding wasn't really described on the summary page.

        Presumably this is gated behind cpuid and/or model specific registers, so it would tend to not be exposed by virtualization software that doesn't support it. But yeah, if you decode and process instructions, it's more things to understand. That's a cost, but presumably the benefit outweighs the cost, at least in some applications.

        It's the same path as any x86 extension. In the beginning only specialty software uses it, at some point libraries that have specialized code paths based on processor featurses will support it, if it works well it becomes standard on new processors, eventually most software requires it. Or it doesn't work out and it gets dropped from future processors.

      • Joker_vD 5 hours ago ago

        The same way AMD added 8 new GPRs, I imagine: by introducing a new instruction prefix.

  • jsrcout 4 hours ago ago

    Tried to answer this question years back for just the "basic" x86 registers. Quickly realized there was never going to be any single answer until I had mastered the entire ISA. Oh well.

  • diffuse_l 4 hours ago ago

    Some minor nitpicks, but hey, we're counting registers, it's already quite nitpicky :)

    Add far as I van remember, you can't access the high/low 8 bits of si, di, sp. ip isn't accessible directly at all.

    The ancestry of x86 can actually be traced back to 8 bit cpus - the high/low bits of registers are remenants of an even older arch - but I'm not sure about that from the top of my head.

    I think most of the "weird" choices mentioned there boil down to limitations that seem absurd right now, but were real constraints - x87 stack can probably traced back to exposing minimal interface to the host processor - 1 register instead of 8 can save quite a few data line - although a multiplexer can probably solve this - so just a wild guess. MMX probably reused the register file of x87 to save die space.

    • rep_lodsb 3 hours ago ago

      The low 8 bits of SI, DI, BP and SP weren't accessible before, but now they are in 64-bit mode.

      The earliest ancestor of x86 was the CPU of the Datapoint 2200 terminal, implemented originally as a board of TTL logic chips and then by Intel in a single chip (the 8008). On that architecture, there was only a single addressing mode for memory: it used two 8-bit registers "H" and "L" to provide the high and low byte of the address to be accessed.

      Next came the 8080, which provided some more convenient memory access instructions, but the HL register pair was still important for all the old instructions that took up most of the opcode space. And the 8086 was designed to be somewhat compatible with the 8080, allowing automatic translation of 8080 assembly code.

      16-bit x86 didn't yet allow all GPRs to be used for addressing, only BX or BP as "base", and SI/DI as "index" (no scaling either). BP, SI and DI were 16-bit registers with no equivalent on the 8080, but BX took the place of the HL register pair, that's why it can be accessed as high and low byte.

      Also the low 8 bits of the x86 flag register (Sign,Zero,always 0,AuxCarry,always 0,Parity,always 1,Carry) are exactly identical to those of the 8080 - that's why those reserved bits are there, and why the LAHF and SAHF instructions exist. The 8080 "PUSH PSW" (Z80 "PUSH AF") instruction pushed the A register and flags to the stack, so LAHF + PUSH AX emulates that (although the byte order is swapped, with flags in the high byte whereas it's the low byte on the 8080).

  • burnt-resistor 4 hours ago ago

    Conservatively though, another answer could be when not considering subset registers as distinct:

    16 GP

    2 state (flags + IP)

    6 seg

    4 TRs

    11 control

    32 ZMM0-31 (repurposes 8 FPU GP regs)

    1 MXCSR

    6 FPU state

    28 important MSRs

    7 bounds

    6 debug

    8 masks

    8 CET

    10 FRED

    =========

    145 total

    And don't forget another 10-20 for the local APIC.

    "The answer" depends upon the purpose and a specific set of optional extensions. Function call, task switching between processes in an OS, and emulation virtual machine process state have different requirements and expectations. YMMV.

    Here's a good list for reference: https://sandpile.org/x86/initial.htm

  • jcalvinowens 4 hours ago ago

    Heh, am I the only one who was expecting an article about register renaming?

  • 1011101001000 5 hours ago ago

    x86-64 ISA general-purpose register containers: low-er 8 to 16 bits of the 64 bit GPR.

  • sylware 7 hours ago ago

    Don't forget x86_64 like ARM is IP-locked, RISC-V is not.

    • dlcarrier 4 hours ago ago

      Fun fact: the AMD64 patents have expired, with AMD-V patents expiring this year, so there really isn't a need for an x86 license to do anything useful. All that's still protected is various AVX instruction sets, but those are generally used in heavily optimized software, like emulators and video encoders, that tend to be compiled to the specific processor instruction set anyway.

      • sylware 3 hours ago ago

        As far as I can remember, it is not only a "patent" issue. It seems there are other legal mechanisms.

        That said, I would not use a x86_64 CPU without AVX nowadays.

        • dlcarrier 2 hours ago ago

          As far as intellectual property protections go, You wouldn't be able to copy the layout of an old AMD or Intel processor copyright infringement, not that anyone would want to, because it wouldn't be cost effective to use the exact same process decades later. There's no trademark protection, as AMD was unable to register the x86-64 trademark (https://tsdr.uspto.gov/#caseNumber=76032083)

          Other than protections against industrial espionage, that exhausts all forms of intellectual property rights in the US.

          • kens an hour ago ago

            The microcode is protected by copyright; see NEC Corporation v. Intel Corporation.