Why does Windows64 use a different calling convention from all other OSes on x86-64?

AMD has an ABI specification that describes the calling convention to use on x86-64. All OSes follow it, except for Windows which has it’s own x86-64 calling convention. Why?

Does anyone know the technical, historical, or political reasons for this difference, or is it purely a matter of NIHsyndrome?

I understand that different OSes may have different needs for higher level things, but that doesn’t explain why for example the register parameter passing order on Windows is rcx - rdx - r8 - r9 - rest on stack while everyone else uses rdi - rsi - rdx - rcx - r8 - r9 - rest on stack.

P.S. I am aware of how these calling conventions differ generally and I know where to find details if I need to. What I want to know is why.

Edit: for the how, see e.g. the wikipedia entry and links from there.

  • 4

    Well, just for the first register: rcx: ecx was the “this” parameter for the msvc __thiscall x86 convention. So probably just to ease porting their compiler to x64, they started with rcx as the first. That everything else would then be different too was just a consequence of that initial decision.

    – 

  • 1

    @Chris: I’ve added a reference to the AMD64 ABI supplement document (and some explanations what it actually is) below.

    – 

  • 1

    I haven’t found a rationale from MS but I found some discussion here

    – 

Choosing four argument registers on x64 – common to UN*X / Win64

One of the things to keep in mind about x86 is that the register name to “reg number” encoding is not obvious; in terms of instruction encoding (the MOD R/M byte, see http://www.c-jump.com/CIS77/CPU/x86/X77_0060_mod_reg_r_m_byte.htm), register numbers 0…7 are – in that order – ?AX, ?CX, ?DX, ?BX, ?SP, ?BP, ?SI, ?DI.

Hence choosing A/C/D (regs 0..2) for return value and the first two arguments (which is the “classical” 32bit __fastcall convention) is a logical choice. As far as going to 64bit is concerned, the “higher” regs are ordered, and both Microsoft and UN*X/Linux went for R8 / R9 as the first ones.

Keeping that in mind, Microsoft’s choice of RAX (return value) and RCX, RDX, R8, R9 (arg[0..3]) are an understandable selection if you choose four registers for arguments.

I don’t know why the AMD64 UN*X ABI chose RDX before RCX.

Choosing six argument registers on x64 – UN*X specific

UN*X, on RISC architectures, has traditionally done argument passing in registers – specifically, for the first six arguments (that’s so on PPC, SPARC, MIPS at least). Which might be one of the major reasons why the AMD64 (UN*X) ABI designers chose to use six registers on that architecture as well.

So if you want six registers to pass arguments in, and it’s logical to choose RCX, RDX, R8 and R9 for four of them, which other two should you pick ?

The “higher” regs require an additional instruction prefix byte to select them and therefore have a bigger instruction size footprint, so you wouldn’t want to choose any of those if you have options. Of the classical registers, due to the implicit meaning of RBP and RSP these aren’t available, and RBX traditionally has a special use on UN*X (global offset table) which seemingly the AMD64 ABI designers didn’t want to needlessly become incompatible with.
Ergo, the only choice were RSI / RDI.

So if you have to take RSI / RDI as argument registers, which arguments should they be ?

Making them arg[0] and arg[1] has some advantages. See cHao’s comment.
?SI and ?DI are string instruction source / destination operands, and as cHao mentioned, their use as argument registers means that with the AMD64 UN*X calling conventions, the simplest possible strcpy() function, for example, only consists of the two CPU instructions repz movsb; ret because the source/target addresses have been put into the correct registers by the caller. There is, particularly in low-level and compiler-generated “glue” code (think, for example, some C++ heap allocators zero-filling objects on construction, or the kernel zero-filling heap pages on sbrk(), or copy-on-write pagefaults) an enormous amount of block copy/fill, hence it’ll be useful for code so frequently used to save the two or three CPU instructions that’d otherwise load such source/target address arguments into the “correct” registers.

So in a way, UN*X and Win64 are only different in that UN*X “prepends” two additional arguments, in purposefully chosen RSI/RDI registers, to the natural choice of four arguments in RCX, RDX, R8 and R9.

Beyond that …

There are more differences between the UN*X and Windows x64 ABIs than just the mapping of arguments to specific registers. For the overview on Win64, check:

http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx

Win64 and AMD64 UN*X also strikingly differ in the way stackspace is used; on Win64, for example, the caller must allocate stackspace for function arguments even though args 0…3 are passed in registers. On UN*X on the other hand, a leaf function (i.e. one that doesn’t call other functions) is not even required to allocate stackspace at all if it needs no more than 128 Bytes of it (yes, you own and can use a certain amount of stack without allocating it … well, unless you’re kernel code, a source of nifty bugs). All these are particular optimization choices, most of the rationale for those is explained in the full ABI references that the original poster’s wikipedia reference points to.

IDK why Windows did what they did. See the end of this answer for a guess. I was curious about how the SysV calling convention was decided on, so I dug into the mailing list archive and found some neat stuff.

It’s interesting reading some of those old threads on the AMD64 mailing list, since AMD architects were active on it. e.g. Choosing register names was one of the hard parts: AMD considered renaming the original 8 registers r0-r7, or calling the new registers UAX etc.

Also, feedback from kernel devs identified things that made the original design of syscall and swapgs unusable. That’s how AMD updated the instruction to get this sorted out before releasing any actual chips. It’s also interesting that in late 2000, the assumption was that Intel probably wouldn’t adopt AMD64.


The SysV (Linux) calling convention, and the decision on how many registers should be callee-preserved vs. caller-save, was made initially in Nov 2000, by Jan Hubicka (a gcc developer). He compiled SPEC2000 and looked at code size and number of instructions. That discussion thread bounces around some of the same ideas as answers and comments on this SO question. In a 2nd thread, he proposed the current sequence as optimal and hopefully final, generating smaller code than some alternatives.

He’s using the term “global” to mean call-preserved registers, that have to be push/popped if used.

The choice of rdi, rsi, rdx as the first three args was motivated by:

  • minor code-size saving in functions that call memset or other C string function on their args (where gcc inlines a rep string operation?)
  • rbx is call-preserved because having two call-preserved regs accessible without REX prefixes (rbx and rbp) is a win. Presumably chosen because they’re the only “legacy” registers that aren’t implicitly used by any common instruction. (rep string, shift count, and mul/div outputs/inputs touch everything else).
  • None of the registers that common instructions force you to use are call-preserved (see prev point), so a function that wants to use a variable-count shift or division might have to move function args somewhere else, but doesn’t have to save/restore the caller’s value. cmpxchg16b and cpuid need RBX, but are rarely used so not a big factor. (cmpxchg16b wasn’t part of original AMD64, but RBX would still have been the obvious choice. cmpxchg8b exists but was obsoleted by qword cmpxchg)
  • We are trying to avoid RCX early in the sequence, since it is register
    used commonly for special purposes, like EAX, so it has same purpose to be
    missing in the sequence.
    Also it can’t be used for syscalls and we would like to make syscall sequence
    to match function call sequence as much as possible.

(background: syscall / sysret unavoidably destroy rcx(with rip) and r11(with RFLAGS), so the kernel can’t see what was originally in rcx when syscall ran.)

The kernel system-call ABI was chosen to match the function call ABI, except for r10 instead of rcx, so a libc wrapper functions like mmap(2) can just mov %rcx, %r10 / mov $0x9, %eax / syscall.


Note that the SysV calling convention used by i386 Linux sucks compared to Window’s 32bit __vectorcall. It passes everything on the stack, and only returns in edx:eax for int64, not for small structs. It’s no surprise little effort was made to maintain compatibility with it. When there’s no reason not to, they did things like keeping rbx call-preserved, since they decided that having another in the original 8 (that don’t need a REX prefix) was good.

Making the ABI optimal is much more important long-term than any other consideration. I think they did a pretty good job. I’m not totally sure about returning structs packed into registers, instead of different fields in different regs. I guess code that passes them around by value without actually operating on the fields wins this way, but the extra work of unpacking seems silly. They could have had more integer return registers, more than just rdx:rax, so returning a struct with 4 members could return them in rdi, rsi, rdx, rax or something.

They considered passing integers in vector regs, because SSE2 can operate on integers. Fortunately they didn’t do that. Integers are used as pointer offsets very often, and a round-trip to stack memory is pretty cheap. Also SSE2 instructions take more code bytes than integer instructions.


I suspect Windows ABI designers might have been aiming to minimize differences between 32 and 64bit for the benefit of people that have to port asm from one to the other, or that can use a couple #ifdefs in some ASM so the same source can more easily build a 32 or 64bit version of a function.

Minimizing changes in the toolchain seems unlikely. An x86-64 compiler needs a separate table of which register is used for what, and what the calling convention is. Having a small overlap with 32bit is unlikely to produce significant savings in toolchain code size / complexity.

Remember that Microsoft was initially “officially noncommittal toward the early AMD64 effort” (from “A History of Modern 64-bit Computing” by Matthew Kerner and Neil Padgett) because they were strong partners with Intel on the IA64 architecture. I think that this meant that even if they would have otherwise been open to working with GCC engineers on a ABI to use both on Unix and Windows, they wouldn’t have done so as it would mean publicly supporting the AMD64 effort when they hadn’t yet officially done so (and would have probably upset Intel).

On top of that, back in those days Microsoft had absolutely no leanings toward being friendly with open source projects. Certainly not Linux or GCC.

So why would they have cooperated on an ABI? I’d guess that the ABIs are different simply because they were designed at more or less the same time and in isolation.

Another quote from “A History of Modern 64-bit Computing”:

In parallel with the Microsoft collaboration, AMD also engaged the
open source community to prepare for the chip. AMD contracted with
both Code Sorcery and SuSE for tool chain work (Red Hat was already
engaged by Intel on the IA64 tool chain port). Russell explained that
SuSE produced C and FORTRAN compilers, and Code Sorcery produced a
Pascal compiler. Weber explained that the company also engaged with
the Linux community to prepare a Linux port. This effort was very
important: it acted as an incentive for Microsoft to continue to
invest in the AMD64 Windows effort, and also ensured that Linux, which
was becoming an important OS at the time, would be available once the
chips were released.

Weber goes so far as to say that the Linux work was absolutely crucial
to AMD64’s success, because it enabled AMD to produce an end-to-end
system without the help of any other companies if necessary. This
possibility ensured that AMD had a worst-case survival strategy even
if other partners backed out, which in turn kept the other partners
engaged for fear of being left behind themselves.

This indicates that even AMD didn’t feel that cooperation was necessarily the most important thing between MS and Unix, but that having Unix/Linux support was very important. Maybe even trying to convince one or both sides to compromise or cooperate wasn’t worth the effort or risk(?) of irritating either of them? Perhaps AMD thought that even suggesting a common ABI might delay or derail the more important objective of simply having software support ready when the chip was ready.

Speculation on my part, but I think the major reason the ABIs are different was the political reason that MS and the Unix/Linux sides just didn’t work together on it, and AMD didn’t see that as a problem.

Win32 has its own uses for ESI and EDI, and requires that they not be modified (or at least that they be restored before calling into the API). I’d imagine 64-bit code does the same with RSI and RDI, which would explain why they’re not used to pass function arguments around.

I couldn’t tell you why RCX and RDX are switched, though.

Leave a Comment