svd2rust icon indicating copy to clipboard operation
svd2rust copied to clipboard

Register access is not inlined during compile

Open agalakhov opened this issue 6 years ago • 11 comments

Hi,

I recently found out that the code generated by svd2rust (the one of stm32f1 crate) is very slow if compiled for STM32F103 as --release (about five times slower than expected). Disassembling the resulting code has shown that:

  • All Deref implementations for all registers are called as functions and not inlined.
  • All .ptr() methods of all registers are called as functions and not inlined.

Adding #[inline(always)] manually to these methods solved the issue (I got correct timings and well-optimized assembly code).

Perhaps svd2rust should generate #[inline] or even #[inline(always)] attribute for such calls, as otherwise the compiler sometimes fails to optimize here.

Regards, Alexey

agalakhov avatar Mar 27 '19 19:03 agalakhov

That is very odd. Can you provide some assembly for comparison?

therealprof avatar Mar 27 '19 19:03 therealprof

Of course. Relevant part of the code being compiled:

let mut led = gpioc.pc13.into_push_pull_output(&mut gpioc.crh);
loop {
    led.set_high();
    led.set_low();
}

Compiling with rustc 1.33 (stable) for thumbv7m-none-eabi with --release option. The following assembler code is generated:

80001f6:       f000 fa1d       bl      8000634 <_ZN7stm32f19stm32f1035GPIOC3ptr17he8d61d57514c9137E>
80001fa:       6105            str     r5, [r0, #16]
80001fc:       f000 fa1a       bl      8000634 <_ZN7stm32f19stm32f1035GPIOC3ptr17he8d61d57514c9137E>
8000200:       6104            str     r4, [r0, #16]
8000202:       e7f8            b.n     80001f6 <main+0xc6>

In more complex examples (involving timer delays) I also observed calls to deref instead of ptr. The today's nightly version generates the same code.

Adding #[inline] to ptr() like that:

impl GPIOC {
    #[doc = r" Returns a pointer to the register block"]
    #[inline]
     pub fn ptr() -> *const gpioa::RegisterBlock { 
        1073811456 as *const _
    }
}

resulted in the following machine code:

80001fa:       6020            str     r0, [r4, #0]
80001fc:       6025            str     r5, [r4, #0]
80001fe:       e7fc            b.n     80001fa <main+0xca>

which is the expected one. For more complex example I had to add #[inline] to Deref as well.

In my case it was enough to add just #[inline], there seems to be no need to use #[inline(always)].

agalakhov avatar Mar 28 '19 01:03 agalakhov

Upd: enabling lto fixes the problem as well, but I believe it is expected to work without lto too.

agalakhov avatar Mar 28 '19 01:03 agalakhov

The same thing happens with register bit set() and clear(): they are called as separate functions. I don't think that it's ok, because several bit-sets can be merged into one OR operation.

Disasm avatar Mar 28 '19 08:03 Disasm

@Disasm merging register access is generally not a good idea since the hardware may behave differently depending on the order of operations. I.e. pulling data high, then pulling clock high is not the same as pulling data and clock high together. Changing multiple bits at once must be done in a single function call.

agalakhov avatar Mar 28 '19 10:03 agalakhov

@agalakhov Yes, but I am talking about situation like this: https://github.com/mvirkkunen/stm32f103xx-usb/blob/57d23751367461bec5f39322727bdd65e5c2aa30/src/bus.rs#L162-L169

Disasm avatar Mar 28 '19 10:03 Disasm

Which optimization level is this happening with? I believe --release uses opt-level=3 by default, which optimizes for speed. opt-level=s or z should optimize for size. (not saying that these call shouldn't be inlined at opt-level=3, just noting that it might be worth playing around with this).

hannobraun avatar Mar 28 '19 12:03 hannobraun

@hannobraun it is opt-level=3 --release with enable_lto=false. While it is "optimizing by speed", it has problems exactly with the speed. Enabling LTO fixes the issue, but unfortunately LTO is not always applicable and off by default. I believe the reason is, these function calls reside in separate compiler units, so the compiler just does not know that they could be inlined unless we add an inline hint. @Disasm makes sense, you're observing the same issue.

agalakhov avatar Mar 28 '19 15:03 agalakhov

I'm running into the same issue. The resulting (optimized) binary contains lots of ptr and deref and quite a few other functions, that should've been inlined.

Rustc doesn't inline functions across crate boundaries, unless there are generics involved, or #[inline] was specified.

m-ou-se avatar Oct 09 '19 08:10 m-ou-se

I think there are no non-#[inline] functions left anymore after #389 and #390. All fns inside quote!{}s are now preceded by #[inline].

m-ou-se avatar Oct 09 '19 08:10 m-ou-se

Related pull request https://github.com/rust-embedded/svd2rust/pull/515

rcls avatar May 07 '21 07:05 rcls