zig icon indicating copy to clipboard operation
zig copied to clipboard

inline assembly improvements

Open andrewrk opened this issue 9 years ago • 48 comments

This inline assembly does exit(0) on x86_64 linux:

    asm volatile ("syscall"
        : [ret] "={rax}" (-> usize)
        : [number] "{rax}" (60),
            [arg1] "{rdi}" (0)
        : "rcx", "r11");

Here are some flaws:

  • 60 and 0 are number literals and need to be casted to a type to be valid. This causes an assertion failure in the compiler if you don't cast the number literals. Assembly syntax should include types for inputs.
  • [number], [arg1], [ret] unused, and that is awkward.
  • need multiple return values (see #83)
  • do we really need this complicated restraint syntax? maybe we can operate on inputs and outputs.
  • let's go digging into some real world inline assembly code to see the use cases.
  • ~when we get errors from parsing assembly, we don't attach them to the offset from within the assembly string.~ #2080

andrewrk avatar Nov 18 '16 06:11 andrewrk

One idea:

const result = asm volatile ("rax" number: usize, "rdi" arg1: usize, "rcx", "r11")
    -> ("rax" ret: usize)  "syscall" (60, 0);

This shuffles the syntax around and makes it more like a function call. Clobbers are extra "inputs" that don't have a name and a type. The register names are still clunky.

This proposal also operates on the assumption that all inline assembly can operate on inputs and outputs.

andrewrk avatar Nov 18 '16 06:11 andrewrk

@ofelas can I get your opinion on this proposal?

andrewrk avatar Nov 18 '16 07:11 andrewrk

Right, you really made me thinkg here, haven't done that much asm in zig yet, here are a few that I've used on x86, they primarily struggle with the issue of multiple return values, the below examples may not be correct, I always end up spending some time reading the GCC manuals when doing inline asm in C, it isn't always straight forwards.

I just skimmed through the discussion over at Rust users and Rust inline assembly, they seem to have similar discussions and it seems that the asm feature may not be used that much. If you really need highly optimized or complex asm wouldn't you break out to asm (or possibly llvm ir)?

I guess what we have to play with is what LLVM provides, at least as long as zig has a tight connection to it (It seems there are discussions on also supporting Cretonne in Rust according to the LLVM Weekly).

With the above proposal would I write the PPC eieio (and isync, sync) like this _ = asm volatile () -> () "eieio" (); and old style _ = asm volatile ("eieio");? This may typically be available as an intrinsic barrier, I guess. Think I read somewhere that the _ would be the same as Nims discard, it may not be needed as this asm didn't return anything.

Not sure I answered you question...

inline fn rdtsc() -> u64 {
    var low: u32 = undefined;
    var high: u32 = undefined;
    // ouput in eax and edx, could probably movl edx, fingers x'ed...
    low = asm volatile ("rdtsc" : [low] "={eax}" (-> u32));
    high = asm volatile ("movl %%edx,%[high]" : [high] "=r" (-> u32)); 
    ((u64(high) << 32) | (u64(low)))
}

The above obviously is a kludge, I initially hoped to write it that more like this, it does however feel strange having to specify the outputs twice, both lhs and inside the asm outputs, with the potential of mixing the order which may be important.

inline fn rdtsc() -> u64 {
    // ouput in eax and edx
    var low: u32 = undefined;
    var high:u32 = undefined;
    low, high = asm
        // no sideeffects
        ("rdtsc"
         : [low] "={eax}" (-> u32), [high] "={edx}" (-> u32)
         : // No inputs
         : // No clobbers
         );
    ((u64(high) << 32) | (u64(low)))
}

Or possibly like this, not having to undefined/zeroes/0 the output only parameters;

inline fn rdtsc() -> u64 {
    // ouput in eax and edx
    const (low: u32, high: u32) = asm
        // no sideeffects
        ("rdtsc"
         : [low] "={eax}" (-> u32), [high] "={edx}" (-> u32)
         : // No inputs
         : // No clobbers
         );
    ((u64(high) << 32) | (u64(low)))
}

I've also tinkered with the cpuid instruction which is particularly nasty;

inline fn cpuid(f: u32) -> u32 {
    // See: https://en.wikipedia.org/wiki/CPUID, there's a boatload of variations...
    var id: u32 = 0;
    if (f == 0) {
        // Multiple outputs (as an ASCII string) which we mark as clobbered and just leave untouched
        return asm volatile ("cpuid" : [id] "={eax}" (-> u32): [eax] "{eax}" (f) : "ebx", "ecx", "edx");
    } else {
        return asm volatile ("cpuid" : [id] "={eax}" (-> u32): [eax] "{eax}" (f));
    }
}

ofelas avatar Nov 18 '16 16:11 ofelas

With the proposal, rdtsc would look like this in zig:

fn rdtsc() -> u64 {
    const low, const high = asm () -> ("eax" low: u32, "edx" high: u32) "rdtsc" ();
    ((u64(high) << 32) | (u64(low)))
}

This seems like an improvement.

cpuid with the proposal. I propose that instead of naming the function after the assembly instruction, we name it after the information we want. So let's choose one of the use cases, get vendor id.

fn vendorId() -> (result: [12]u8) {
    const a: &u32 = (&u32)(&result[0 * @sizeOf(u32)]);
    const b: &u32 = (&u32)(&result[1 * @sizeOf(u32)]);
    const c: &u32 = (&u32)(&result[2 * @sizeOf(u32)]);
   *a, *b, *c = asm () -> ("ebx" a: u32, "ecx" b: u32, "edx" c: u32) "cpuid" ();
}

Once again volatile not necessary here. cpuid doesn't have side effects, we only want to extract information from the assembly.

So far, so good. Any more use cases?

andrewrk avatar Nov 18 '16 17:11 andrewrk

Yes, that ain't too shabby, so with the correct input in eax it is;

fn vendorId() -> (result: [12]u8) {
    const a: &u32 = (&u32)(&result[0 * @sizeOf(u32)]);
    const b: &u32 = (&u32)(&result[1 * @sizeOf(u32)]);
    const c: &u32 = (&u32)(&result[2 * @sizeOf(u32)]);
   // in eax=0, out: eax=max accepted eax value(clobbered/ignored), string in ebx, ecx, edx
   *a, *b, *c = asm ("eax" func: u32) -> ("ebx" a: u32, "ecx" b: u32, "edx" c: u32, "eax") "cpuid" (0);
}

Would something like this be possible, ignoring my formatting?

result = asm ( // inputs
        "=r" cnt: usize = count,
        "=r" lhs: usize = &left,
        "=r" rhs: usize = &right,
        "=r" res: u8 = result,
        // clobbers
        "al", "rcx", "cc")
        -> ( // outputs
        "=r" res)
        // multiline asm string
        \\movq %[count], %rcx
        \\1:
        \\movb -1(%[lhs], %rcx, 1), %al
        \\xorb -1(%[rhs], %rcx, 1), %al
        \\orb %al, %[res]
        \\decq %rcx
        \\jnz 1b
        // args/parameters
        (count, &left, &right, result);

ofelas avatar Nov 18 '16 21:11 ofelas

Yes, that ain't too shabby, so with the correct input in eax it is;

Ah right, nice catch.

I like putting the values of the inputs above as you did. Then we don't need them below.

Is the count arg necessary to have the movq instruction? seems like we could pass that as a register.

And then finally result should be an output instead of an input right?

So it would look like this:

const result = asm ( // inputs
        "{rcx}" cnt: usize = count,
        "=r" lhs: usize = &left,
        "=r" rhs: usize = &right,
        // clobbers
        "al", "rcx", "cc")
        -> ( // outputs
        "=r" res: u8)
        // multiline asm string
        \\1b:
        \\movb -1(%[lhs], %rcx, 1), %al
        \\xorb -1(%[rhs], %rcx, 1), %al
        \\orb %al, %[res]
        \\decq %rcx
        \\jnz 1b
);

This is a good example of why we should retain the constraint syntax, since we might want {rcx} or =r.

andrewrk avatar Nov 19 '16 00:11 andrewrk

Not too familiar with the x86 asm, I nicked that example from the Rust discussions, in this case rcx (and ecx i 32 bit) is a loop counter somewhat similar to ctr on Power PC. So the movq, decq, jnz drives the loop. So as long at that condition is met it probably doesn't matter. Maybe it could have been done with the loop instruction that decrements and tests at the same time.

result is both an input and an output, like if you were updating a cksum or similar where you would feed in an initial or intermediate value that you want to update.

Are you planning to support all the various architecture specific input/output/clobber constraints and indirect inputs/outputs present in LLVM?

ofelas avatar Nov 19 '16 09:11 ofelas

Another avenue to go down is the MSVC way of doing inline assembly. M$ does a smart augmented assembly, where you can transparently access C/C++ variables from the assembly. An example would be a memcpy implementation:

void
CopyMemory(u8* Dst, u8* Src, memory_index Length)
{
	__asm {
		mov rsi, Src
		mov rdi, Dst
		mov rcx, Length
		rep movsb
	}
}

It provides a really nice experience. However, MSVC isn't smart about the registers, so all registers used are backed up to the stack before emitting the assembly, and are then restored after the assembly. This avoids the mess of having to specify cluttered registers, but at the cost of a fair bit of performance.

The smart syntax is awesome, but it might be hard fit with a LLVM backend, if you do not want to write an entire assembler as well.

kiljacken avatar Dec 09 '16 07:12 kiljacken

As kiljacken says, I personally really, really enjoy the Intel syntax over GAS as D has done it (except for GDC, which is based on GCC). I'm only assuming it'll be harder to implement a MSVC-styled inline assembly feature.

dd86k avatar Oct 19 '17 01:10 dd86k

The end game is we will have our own assembly syntax, like D, which will end up being compiled to llvm compatible syntax. It's just a lot of work.

I at first tried to use the Intel syntax but llvm support for it is buggy and some of the instructions are messed up to the point of having silent bugs.

andrewrk avatar Oct 19 '17 01:10 andrewrk

Points 1 and 2 in the OP seem to be solved.

SamTebbs33 avatar Aug 31 '19 09:08 SamTebbs33

OUTDATED

This has been split off into #5241. This comment will no longer be updated.

New Inline Asm Syntax

asm (arches) (bindings, clobbers) (:return_register|void|noreturn) { local_labels body } (else ...)? + config? (somewhere)

Arches

An optional list of target architectures. If this is null, the block is assumed to be for all architectures (an assembler error is always a compile error). Otherwise, one of these must match builtin.arch, or an else branch must be present. This is a list rather than a single value as some architectures have mutually compatible subsets (e.g. 8086/x86/x86_64, MIPS/RISC-V).

Bindings and Clobbers

Bindings have the form "register" name: type = init_value. name can be _, if the register is desired only for initialisation. name can also be a variable in scope, in which case type and init_value are omitted, and changes to this register's value are taken as changes to the variable. init_value can be undefined, in which case type can be omitted (it doesn't matter much in assembly anyway), unless name is the return register (more on this later). Clobbers are simply "register".

Return Register

A binding can be nominated as the return value, with :name. (Allowing :"register" would cause parsing ambiguity, and this can be trivially done with a binding anyway.) void and noreturn are also allowed. Reaching the end of a noreturn block is safety-checked UB.

Local Labels

A list of local labels. Formatted as strings.

Local labels are unique to the block: %(label) matches %(label) within the block, and is guaranteed not to match anything else in the program. They are listed within the braces of the body because they really don't make sense outside that context.

Body

The assembly code itself, as a string. If this fails to assemble, it's a compile error.

The following macros are defined:

  • %[name] Register, as specified in bindings section.
  • %(label) Label, as listed in local labels section.
  • @[variable] Pre-mangled global variable name. Used to reference globals. See #5211.
  • @(function) Pre-mangled function name. Used to call functions. See #5211.

A literal % or @ is escaped with another one: %% or @@. Strictly speaking, if we're substituting text, only one of @[] and @() is needed -- but, if we want to integrate the assembler with the compiler, the distinction may be important, so I've listed both.

Else

If arches is non-null and none of the listed architectures match builtin.arch, this is compiled instead. Can be used to switch on architectures, optimise a specific architecture only, or simply @compileError. If this is not present, a target mismatch is a compile error.

N.B.: An else branch is only allowed if arches is non-null. This decision was made because, when you set arches to null, either you know execution will never reach this point on the wrong architecture, or you only care about compiling for a specific architecture. In the former case, you definitely want an unexpected architecture to be a compile error; and in the latter, to support a new architecture, the laziest thing you can do is start caring.

Config

Configuration is passed in a pragma (#5239) with the following fields:

  • impure This block has side effects.
  • stack(n) This block allocates n bytes on the stack. Defaults to 0.
  • calls(funcs) This block calls the functions listed in funcs. Defaults to .{}.

Example

const builtin = @import("builtin");

const fib_asm = fn (n: u32) u32 {
    return asm (.{ builtin.Arch.riscv64, builtin.Arch.riscv32 }) @{
        stack(12),
        calls(.{ fib_iter }),
    } (
        "a0" this  : u32 = 0,
        "a1" next  : u32 = 1,
        "=r" to_go : u32 = n,
    ) :this {
        .{ "loop", "end" }

        \\%(loop):
        \\  bez %[to_go], %(end)
        // We can do function pro/epi at callsite!
        \\  addi sp, -12
        \\  sd ra, 0(sp)
        \\  sw %[to_go], 8(sp)
        \\  call @(fib_iter)
        \\  lw %[to_go], 8(sp)
        \\  ld ra, 0(sp)
        \\  addi sp, 12
        \\  addi %[to_go], -1
        \\  j %(loop)
        \\%(end):
    } else @compileError("Your machine could be better");
};

// Actually returns two values, but the compiler has no way to express that
const fib_iter = fn @{callconv(.Naked)} (this: u32, next: u32) void {
    // No need to check architecture -- we'll only call this from fib_asm
    asm (null) @{impure} (
        "a0" this,
        "a1" next,
        "=r" temp = undefined,
    ) void {
        .{}

        \\  add %[temp], %[this], %[next]
        \\  mv %[this], %[next]
        \\  mv %[next], %[temp]
    };
};

TL;DR: Benefits over Status Quo

  • If any of the sections are missed, the compiler can detect exactly which ones
  • Order of mandatory components has a logical progression, just like function declaration
  • Option to tie to target architecture
  • Registers have types
  • Can express non-returning and valueless assembly
  • Can reference global variables and call functions
  • Won't unexpectedly jump to random points in the program
  • Communicates metadata to compiler, but does not require it
  • Provides alternative for unsupported architectures
  • Can be automatically distinguished from status quo, albeit with some lookahead
  • Can be automatically derived from status quo

ghost avatar May 01 '20 04:05 ghost

Ok, sorry, I changed it. I can't help it, I'm a perfectionist.

ghost avatar May 01 '20 09:05 ghost

Ok, it's a living document. I'll admit it.

ghost avatar May 01 '20 11:05 ghost

I've split it off into its own issue. See above.

ghost avatar May 01 '20 15:05 ghost

Hey @andrewrk -- given the emphasis on stabilisation in this release cycle, should we take the time to get this right now, so we're not stuck with it forever?

ghost avatar May 08 '20 07:05 ghost

Hey, I did a fairly major rework of #5241 recently. Now there's a more powerful constraint syntax.

ghost avatar May 10 '20 17:05 ghost

Possible inspiration from Rust: New inline assembly syntax available in nightly

andrewrk avatar Jun 09 '20 02:06 andrewrk

For those who want to look further into that, there's more here.

There's a lot of good stuff there. The two deal-breakers for me are contextually repurposed syntax (out is not a function, reg is not a variable) and behind-the-scenes non-configurable action (assigning outputs). I've updated #5241 with the good stuff.

ghost avatar Jun 10 '20 14:06 ghost

Another idea is to use the Keystone assembler framework. It is under GPLv2 + FOSS License Exception, which is compatible with MIT.

It is also based on LLVM (the MC), but has already extend it:

The section below highlights the areas where Keystone shines.

  • [...]

  • Framework: llvm-mc is a tool, but not a framework. Therefore, it is very tricky to build your own assembler tools on of LLVM, while this is the main purpose of Keystone. Keystone's API makes it easy to handle errors, report internal status of its core or change compilation mode at runtime, etc.

  • [...]

  • Flexibility: LLVM's assembler is much more restricted. For example, it only accepts assembly in LLVM syntax. On the contrary, Keystone is going to support all kind of input, ranging from Nasm, Masm, etc.

  • Capability: LLVM is for compiling & linking, so (understandably) some of its technical choices are not inline with an independent assembler like Keystone. For example, LLVM always put code and data in separate sections. However, it is very common for assembly to mix code and data in the same section (think about shellcode). Keystone is made to handle this kind of input very well.

Well, I have no ideas about the both architectures (of Zig and Keystone), so the suggestion might be ineffective.


In any case, I am all for MSVC-like syntax (to be used for different CPU archs). It is so ellegant!

sskras avatar Jan 01 '22 16:01 sskras

OK, some critics from the links given in the last comment of @EleanorNB: https://github.com/Amanieu/rfcs/blob/inline-asm/text/0000-inline-asm.md#implement-an-embedded-dsl

While this is very convenient on the user side in that it requires no specification of inputs, outputs, or clobbers, it puts a major burden on the implementation. The DSL needs to be implemented for each supported architecture, and full knowledge of the side-effect of every instruction is required.

Well, side-effects could just come from some lookup table containing a list of the registers being clobbered by every opcode. It would be interesting to dig and see if @keystone implements something similar.

This huge implementation overhead is likely one of the reasons MSVC only provides this capability for x86, while D at least provides it for x86 and x86-64. It should also be noted that the D reference implementation falls slightly short of supporting arbitrary assembly. E.g. the lack of access to the RIP register makes certain techniques for writing position independent code impossible.

Wow, the latter statement (re no RIP access) looks like a very important insight.

sskras avatar May 16 '22 16:05 sskras

... In any case, I am all for MSVC-like syntax (to be used for different CPU archs). It is so ellegant!

I would highly doubt that Zig will have MSVC-like assembly syntax. It is very ambiguous and performance-wise inefficient because the compiler doesn't know anything about what you're doing so it can't optimize anything. It's nice to call cpuid with it, but it's not nice doing loops with aesenc etc...

eLeCtrOssSnake avatar Aug 10 '22 20:08 eLeCtrOssSnake

@eLeCtrOssSnake commented 27 minutes ago:

It is very ambiguous

Ummm, any example of this (at least theoretical) ?

and performance-wise inefficient because the compiler doesn't know anything about what you're doing so it can't optimize anything.

TBH, if I go forward with inline asm, I would like the compiler to avoid doing any optimizations near to my asm code.

sskras avatar Aug 10 '22 20:08 sskras

I would like the compiler to avoid doing any optimizations near to my asm code.

You have basically said it yourself. MSVC asm syntax makes asm statement a black box and cannot efficiently call it. It has to isolate it and it is very expensive to do so. Inline asm calls to aesenc (in zig crypt) for example want optimizations because we want fast encryption, not a slow one?

eLeCtrOssSnake avatar Aug 10 '22 20:08 eLeCtrOssSnake

MSVC asm syntax makes asm statement a black box and cannot efficiently call it.

Well, for me this remains to be proved by specific, particular examples of C+asm & disasm-dump pairs/combos (maybe along with some profiling results that can be compared).

IOW, I am interested in the details that makes the box black. Eg. register clobbering can be found out from the asm-block, the the variable name mapping can be performed into the asm-block too. So the box is not exactly black.

Inline asm calls to aesenc (in zig crypt) for example want optimizations because we want fast encryption, not a slow one?

Yes, but the optimization you mean here and optimizations in general (which I [previously thought you] meant) might be a different things. I seek for something like the aforementioned detailed examples before making a further sense.

sskras avatar Aug 11 '22 07:08 sskras

MSVC asm syntax makes asm statement a black box and cannot efficiently call it.

Well, for me this remains to be proved by specific, particular examples of C+asm & disasm-dump pairs/combos (maybe along with some profiling results that can be compared).

IOW, I am interested in the details that makes the box black. Eg. register clobbering can be found out from the asm-block, the the variable name mapping can be performed into the asm-block too. So the box is not exactly black.

Inline asm calls to aesenc (in zig crypt) for example want optimizations because we want fast encryption, not a slow one?

Yes, but the optimization you mean here and optimizations in general (which I [previously thought you] meant) might be a different things. I seek for something like the aforementioned detailed examples before making a further sense.

I don't remember precisely. I remember tinkering with hw support for crc32c on x86(_64) and my inline asm implementation with cl.exe was as bad as software one. On clang I didn't have that problem. Tinkering with msvc inline asm in godbolt helps to understand it I guess?

eLeCtrOssSnake avatar Aug 11 '22 09:08 eLeCtrOssSnake

I remember tinkering with hw support for crc32c on x86(_64) and my inline asm implementation with cl.exe was as bad as software one. On clang I didn't have that problem.

That's a constructive point! Thanks:)

Tinkering with msvc inline asm in godbolt helps to understand it I guess?

Oh, I had forgotten the godbolt and surely didn't know it supports msvc there too! Thanks again.

sskras avatar Aug 12 '22 10:08 sskras

Could Zig start supporting Intel's assembler syntax like LLVM's clang does? You can see an example of how I use it here. These are the options I add to clang to compile the said syntax: clang ..... -fasm-blocks -masm=intel -fasm .....

isoux avatar Feb 16 '23 18:02 isoux

Could Zig start supporting Intel's assembler syntax like LLVM's clang does? You can see an example of how I use it here. These are the options I add to clang to compile the said syntax: clang ..... -fasm-blocks -masm=intel -fasm .....

As far as i know intel asm syntax support is broken in LLVM and gcc has incomplete implementation too. I forgot what exactly intel asm support in GCC lacks but ive stumbled on that issue and had to revert to AT&T syntax. If i remember correctly it was named inline asm constraints dereference that don't work in gcc/llvm

eLeCtrOssSnake avatar Feb 18 '23 16:02 eLeCtrOssSnake

That's right man. I got an answer that can partially help me for Zig... Like this:

asm volatile(
  \\.intel_syntax noprefix
  \\mov rax, rbx
  \\lea rax, [rax + 10]
);

isoux avatar Feb 18 '23 17:02 isoux