nuttx icon indicating copy to clipboard operation
nuttx copied to clipboard

Arm64 tick timer fixes

Open jlaitine opened this issue 1 year ago • 5 comments

Summary

It has been observed in our testing with imx9 platform, that nxsem_tickwait_uninterruptible randomly wakes up too early. As this is commonly used in device drivers as a timeout timer, this causes early timeouts resulting many errors. Further investigation revealed that the tick timer in arch_alarm/oneshot_callback sometimes ticks two times at one call. This would then cause the wdog timer to timeout immediately even if it was supposed to wait for one more tick.

The issue was root-caused to these things:

  • The arm64 tick timer is drifiting due to the new compare value is being set only after the tick has been handled, and based on the timer value of that moment.
  • The current tick retrieved from the arm64 timer changes during the tick period, based on the tick start counter value, due to division rounding
  • There is a possibility for signal waking up nxsem_tickwait_uninterruptible 1 tick too early
  • arch_alarm/oneshot_callback doesn't read the current tick and start the new tick atomically w. regard to the freely running HW timer

This PR proposes solution to the issues by

  • aligning the tick start to a value which is evenly divisible with the tick time.
  • fixing the signalling wakeup issue for nxsem_tickwait_uninterruptible

The alignment of tick start fixes both the tick timer drift and current tick retrieval issues. It is then not necessary to add an interface for the oneshot alarm to support atomic get&start tick timer.

This patchset also adds a cleanup for arch_alarm oneshot_callback. This shouldn't fix any bug per-se, but as we have had that patch included in all the testing, it is included in this patchset. What it fixes in theory is, that the oneshot_callback could be called more frequently than the tick period without errors - but this is not used for anything atm.

Impact

Fixes random errors in several device drivers for imx9 platform, in longer test runs.

Testing

Tested on i.MX93 platform with some added debug instrumentation.

jlaitine avatar Sep 27 '24 10:09 jlaitine

[Experimental Bot, please feedback here]

Let's break down this pull request (PR) against the NuttX requirements:

Summary:

  • Why? This PR addresses a critical issue where the nxsem_tickwait_uninterruptible function, crucial for device driver timeouts, prematurely wakes up on the imx9 platform.
  • What? The PR tackles the root causes of this issue: timer drift in the arm64 architecture, inconsistent tick retrieval, early signal wake-ups, and non-atomic tick handling in arch_alarm/oneshot_callback.
  • How? The PR fixes the timer drift and tick retrieval by aligning the tick start. It also resolves the signal wake-up issue and cleans up the oneshot_callback.
  • Issues: The PR should ideally reference a specific NuttX issue number where this problem is discussed (e.g., "Fixes #1234").

Impact:

  • New Feature: NO
  • Existing Feature Changed: YES, the timing behavior of nxsem_tickwait_uninterruptible and the underlying timer mechanism are fixed.
  • User Impact: Potentially YES. While this fixes bugs, users relying on the previous (incorrect) timing behavior might experience changes. This should be clearly documented.
  • Build Impact: Possibly YES. If the cleanup in oneshot_callback affects its interface, there might be build implications for code using it. This needs clarification.
  • Hardware Impact: YES, specifically for the imx9 platform where the issue manifests.
  • Documentation Impact: YES, documentation for nxsem_tickwait_uninterruptible and potentially oneshot_callback should be updated to reflect the fixes and any behavioral changes.
  • Security Impact: Possibly YES. Timing issues can sometimes have security implications. While the PR aims to fix a problem, it's crucial to analyze if the changes introduce any new vulnerabilities.
  • Compatibility Impact: Potentially YES, as mentioned under User Impact. Backwards compatibility needs careful consideration.
  • Anything else? The PR mentions additional debug instrumentation used for testing. Consider removing this instrumentation or clearly marking it as such before merging.

Testing:

  • Insufficient: The PR lacks details about the build hosts and target configurations used for testing. Provide specific information like OS, CPU architecture, compiler, and the exact imx9 board and configuration.
  • Logs Missing: The PR needs to include actual testing logs from before and after the changes. These logs are essential for reviewers to understand the problem and verify the fix.

Conclusion:

This PR addresses a significant issue but requires further refinement to meet NuttX's standards:

  1. Link to NuttX Issue: Reference the relevant issue.
  2. Clarify Impact: Provide specifics on build, user, and compatibility impacts.
  3. Detail Testing: Include comprehensive testing information and logs.
  4. Security Review: Given the nature of the fix, a security review is recommended.

By addressing these points, the PR will be well-positioned for review and merging into NuttX.

nuttxpr avatar Sep 27 '24 10:09 nuttxpr

However, we were unable to reproduce the issue on qemu/x86_64 and qemu/armv8a (commit d47076). Here is the test code:


#include <fcntl.h>
#include <nuttx/config.h>
#include <stdio.h>
#include <semaphore.h>
#include <nuttx/clock.h>
/****************************************************************************
 * Public Functions
 ****************************************************************************/
int main(int argc, FAR char *argv[])
{
  sem_t sem;
  clock_t  tick;
  uint32_t wait;

  /* waittick table */
  static const uint32_t waittick_tbl[] = 
  { 
    0, 1, 10, 100, 1000, 10000, 100000, 1000000 
  };

  sem_init(&sem, 0, 0);



  for (uint32_t i = 0; i < sizeof(waittick_tbl)/sizeof(uint32_t) ; i++)
    {
      wait = waittick_tbl[i];

      tick = clock_systime_ticks();
      nxsem_tickwait(&sem, wait);
      tick = clock_systime_ticks() - tick;
      printf("tick1 %u : %lu\n", wait, tick);
      // assert(wait <= tick);

      tick = clock_systime_ticks();
      nxsem_tickwait_uninterruptible(&sem, wait);
      tick = clock_systime_ticks() - tick;
      printf("tick2 %u : %lu\n", wait, tick);
      // assert(wait <= tick);
    }

  return 0;
}

Fix-Point avatar Sep 29 '24 01:09 Fix-Point

I actually never tested whether this is reproducible in qemu; that would depend on the timer accuracy in qemu/linux. It is really easy to reproduce in real hw within approx 10 minutes of running. I can check the qemu reproducibility tomorrow. Thanks!

Oh! And the test above wouldn't even show the issue; it does count the correct number of ticks, just too fast (sometimes two ticks at once). So you need to measure the real time spent over the tickwait. So take the clock_gettime or up_timer_gettime before and after the tickwait and check that the correct amount of time has passed, not the amount of ticks :)

jlaitine avatar Sep 29 '24 06:09 jlaitine

I actually never tested whether this is reproducible in qemu; that would depend on the timer accuracy in qemu/linux. It is really easy to reproduce in real hw within approx 10 minutes of running. I can check the qemu reproducibility tomorrow. Thanks!

Oh! And the test above wouldn't even show the issue; it does count the correct number of ticks, just too fast (sometimes two ticks at once). So you need to measure the real time spent over the tickwait. So take the clock_gettime or up_timer_gettime before and after the tickwait and check that the correct amount of time has passed, not the amount of ticks :)

Thank you for providing more information. However, after changing to the clock_gettime, I still cannot reproduce the issue on qemu/aarch64. I will try to conduct further investigations on a real armv8 machine.

Here is my test code:

#include <fcntl.h>
#include <nuttx/config.h>
#include <stdio.h>
#include <semaphore.h>
#include <nuttx/clock.h>
#include <nuttx/arch.h>

static inline time_t diff_time(struct timespec *ts_start, struct timespec *ts_end){
  return (ts_end->tv_sec - ts_start->tv_sec) * NSEC_PER_SEC + (ts_end->tv_nsec - ts_start->tv_nsec);
}

/****************************************************************************
 * Public Functions
 ****************************************************************************/
int main(int argc, FAR char *argv[])
{
  sem_t sem;
  clock_t  tick;
  uint32_t wait;
  time_t diff;
  struct timespec ts_start;
  struct timespec ts_end;

  /* waittick table */
  static const uint32_t waittick_tbl[] = 
  { 
    0, 1, 10, 100, 1000, 10000, 100000, 1000000 
  };

  sem_init(&sem, 0, 0);



  for (uint32_t i = 0; i < 10000000; i++)
    {
      wait = waittick_tbl[i%2];

      tick = clock_systime_ticks();
      clock_gettime(0, &ts_start);
      nxsem_tickwait(&sem, wait);
      clock_gettime(0, &ts_end);
      tick = clock_systime_ticks() - tick;
      diff = diff_time(&ts_start, &ts_end);
      printf("tick1 %u : %lu, %ld ns\n", wait, tick, diff);
      assert(wait <= tick && diff >= 0);

      tick = clock_systime_ticks();
      clock_gettime(0, &ts_start);
      nxsem_tickwait_uninterruptible(&sem, wait);
      clock_gettime(0, &ts_end);
      tick = clock_systime_ticks() - tick;
      diff = diff_time(&ts_start, &ts_end);
      printf("tick2 %u : %lu, %ld ns\n", wait, tick, diff);
      assert(wait <= tick && diff >= 0);
    }

  return 0;
}

Fix-Point avatar Sep 29 '24 12:09 Fix-Point

issue re-produces also in qemu, at least with this simple test.

#include <nuttx/nuttx.h>
#include <nuttx/config.h>
#include <stdio.h>

#define read_sysreg(reg)                         \
  ({                                             \
    uint64_t __val;                              \
    __asm__ volatile ("mrs %0, " STRINGIFY(reg)  \
                    : "=r" (__val) :: "memory"); \
    __val;                                       \
  })

static inline uint64_t arm64_arch_timer_count(void)
{
  return read_sysreg(cntvct_el0);
}

static inline uint64_t arm64_arch_timer_get_cntfrq(void)
{
  return read_sysreg(cntfrq_el0);
}

static inline uint64_t time_us(void)
{
  return arm64_arch_timer_count() * 1000000 / arm64_arch_timer_get_cntfrq();
}

inline uint64_t minutes_in_ticks(uint64_t mins)
{
  return mins * 60 * 1000000 / CONFIG_USEC_PER_TICK;
}

int main(int argc, FAR char *argv[])
{
  sem_t sem;
  uint64_t t1;
  uint64_t t2;
  sem_init(&sem, 0, 0);

  for (int64_t i = 0; i < minutes_in_ticks(10); i++)
    {
      t1 = time_us();
      nxsem_tickwait(&sem, 1);
      t2 = time_us();

      if (i % 100 == 0)
        {
          printf("%lu %lu %lu\n",t1,t2,t2-t1);
        }

      if (t2 - t1 < CONFIG_USEC_PER_TICK)
        {
          printf("ERROR: slept %lu us instead of minimum %u\n", t2-t1, CONFIG_USEC_PER_TICK);
        }

      assert(t2-t1 > CONFIG_USEC_PER_TICK); 
    }

  return 0;
}


jlaitine avatar Sep 29 '24 14:09 jlaitine

I added a minimal test code above, which can be used in both real target and qemu; it shows the major issue typically within 1 minute.

jlaitine avatar Sep 30 '24 07:09 jlaitine

@Fix-Point Were you able to re-produce the issue with the code I pasted above?

jlaitine avatar Sep 30 '24 07:09 jlaitine