CanOpenSTM32 icon indicating copy to clipboard operation
CanOpenSTM32 copied to clipboard

Nested CO_LOCK_OD causing permanently disabled interrupts.

Open somlioy opened this issue 2 years ago • 17 comments

Hi

After the update in CanopenNode where OD is always locked on a SDO call (Always call OD lock macros on SDO operations), my code broke down; the interrupts isnt enabled again.

This change caused nested CO_LOCK_OD to happen and since there is only one variable for storing the primask in the CO_CANmodule_t struct, the interrupts will never be enabled again.

The result was that the code stopped working on a HAL_Delay (infinite while loop) in anothert part of the application since CO_LOCK/__disable_irq apparantly also disable systicks.

I noticed this during a eeprom save (0x1010) because I had CO_LOCK_OD/UNLOCK on the OD-entry to be saved in the "storeEeprom"-function.

Is there any better way to implement this? Ie. not disabling global interrupts, only the timer in charge of the interrupts and some other way of storing primask?

somlioy avatar Jun 20 '23 10:06 somlioy

Very technically, primask shall always be stored as local variable. Can you point out how your call stack looks like when this happens?

MaJerle avatar Jun 20 '23 11:06 MaJerle

Or maybe disable CO_CONFIG_FLAG_SDO_ALWAYS_LOCK_OD instead?

MaJerle avatar Jun 20 '23 11:06 MaJerle

Yeah primask stored in a local variable would be the optimal choice for such an event, altought I'm not sure how that would be done when using macros like that. A local variable inside the do-while would be limited to access within that scope.

See Call stack below (breakpoint at the point where I call CO_LOCK_OD (in storeEeprom). So CO_LOCK_OD is first called by CO_SDOserver_process() and then by me in storeEeprom() image

I've already updated CANopenNode to latest version, so CO_CONFIG_FLAG_SDO_ALWAYS_LOCK_OD is deprecated. See commit attached in first post.

Technically I guess it safe to assume storeEeprom() and restoreEeprom() never would be called from anything else than via SDO such that the LOCKs I have in store-/restoreEeprom safely can be removed.

However there's still the "issue" where systicks are disabled doing a LOCK causing HAL_Delay to be stuck in a while-loop. Encountered this during my first implementation of the eeprom-writing function where HAL_Delay was used to allow for eeprom-page writing to be finished, thought this was fixed by polling the eeprom wether it is ready or not (HAL_I2C_IsDeviceReady).

somlioy avatar Jun 20 '23 13:06 somlioy

I'm seeing this problem is well.

CO_LOCK_OD has a __disable_irq() function inside: #define CO_LOCK_OD(CAN_MODULE) \ do { \ (CAN_MODULE)->primask_od = __get_PRIMASK(); \ __disable_irq(); \ } while (0)

Where as CO_UNLOCK_OD only set the mask: #define CO_UNLOCK_OD(CAN_MODULE) __set_PRIMASK((CAN_MODULE)->primask_od)

Should there be an enable irq inside CO_UNLOCK_OD ?

skpang avatar Jul 07 '23 08:07 skpang

No. __set_PRIMASK will enable the interrupts, if these have been enabled before __disable_irq() has been called.

MaJerle avatar Jul 07 '23 08:07 MaJerle

Ok, somehow the __set_PRIMASK is not re-enabling the interrupts for me.

skpang avatar Jul 07 '23 08:07 skpang

This will happen if this happens:

CO_LOCK_OD();
CO_LOCK_OD(); <--- previous value is overwritten
CO_UNLOCK_OD(); <--- invalid previous value 
CO_UNLOCK_OD(); <--- invalid previous value

So is there a case when LOCK is called twice in a sequence

MaJerle avatar Jul 07 '23 08:07 MaJerle

Closing this issue due to a long period of inactivity. Please feel free to reopen it if this is still a problem.

HamedJafarzadeh avatar May 28 '25 10:05 HamedJafarzadeh

I hit this problem, too. By my analysis this does not happen because of nested locking, but because the mainline thread gets interrupted by the realtime thread while in the locking macro. Looking at the primask values in CO_CANmodule_t, this always happened in CO_LOCK_CAN_SEND. The mainline thread wants to lock can send:

#define CO_LOCK_CAN_SEND(CAN_MODULE)                                                                                   \
    do {                                                                                                               \
        (CAN_MODULE)->primask_send = __get_PRIMASK();                                                                  \
        // gets interrupted here \
        __disable_irq();                                                                                               \
    } while (0)
#define CO_UNLOCK_CAN_SEND(CAN_MODULE) __set_PRIMASK((CAN_MODULE)->primask_send)

The realtime thread then immediately locks the OD, so the whole realtime thread runs with disabled interrupts. Inside the realtime thread CO_LOCK_CAN_SEND is called and overwrites the primask stored in the mainline thread. On return, the mainline thread disables interrupts and on unlocking restores the value from the realtime thread (interrupts disabled). I fixed this by first writing primask to a local variable, then copy it to global after interrupts are disabled:

#define CO_LOCK_CAN_SEND(CAN_MODULE) \
    do { \
        uint32_t primask = __get_PRIMASK(); \
        __disable_irq(); \
        (CAN_MODULE)->primask_send = primask; \
    } while (0)

Hasn't happened since. Using CO_LOCK_CAN_SEND as an example here, this should be done in all locking macros.

jsbhth avatar May 28 '25 11:05 jsbhth

@jsbhth Thanks for the insights, is there any specific setup to follow to reproduce this issue ?

HamedJafarzadeh avatar May 28 '25 12:05 HamedJafarzadeh

I found this a few months ago, so I hope i get this right. What I did was remove the 1 ms intervall from the mainline thread (using a 1 Mhz timer as time source), having it constantly running. It will then reliably happen after some time, provided the device is sending a lot of messages. Sometimes after 5 minutes, sometimes after hours. I than changed the macro to something like

#define CO_LOCK_CAN_SEND(CAN_MODULE) \
    do { \
        uint32_t primask = __get_PRIMASK(); \
        (CAN_MODULE)->primask_send = primask; \
        __disable_irq(); \
        if (primask != (CAN_MODULE)->primask_send) { \
            __BKPT(0); \
        } \
    } while (0)

When the breakpoint is hit, you know the primask variable has been changed in another thread. Won't get you the location where it was changed, though.

jsbhth avatar Jun 04 '25 18:06 jsbhth

My comment is still valid. The primask state should always be stored in the local variable. This is the most correct approach to the problem. With the solution of @jsbhth it will get overwritten by another option, because there are interrupt enabled between getting the primask and setting the primask.

The ONLY correct way is to:

  • Use only local variables to keep the primask value, meaning that CO_LOCK_CAN_SEND is never called twice in a row. If this cannot be achieved, then use the local variable to store the primask, and convert the primask_send to an array that has a stack of all primasks ever used, plus a pointer that increases the array stack.
  • Get the primask, then lock interrupts, then restore the primask from local value (or from array as mentioned above)

I usually do a step further to ensure that you never forget or wrongly call the lock/unlock.

/* Critical section macros */
#define CRITICAL_SECT_DEFINE                 uint32_t primask = __get_PRIMASK();
#define CRITICAL_SECT_LOCK()                 __disable_irq()
#define CRITICAL_SECT_UNLOCK()               __set_PRIMASK(primask)

/* New defines to enter/exit critical section with interrupts */
#define CRITICAL_ENTER()                                                                                               \
    do {                                                                                                               \
        CRITICAL_SECT_DEFINE;                                                                                          \
        CRITICAL_SECT_LOCK();
#define CRITICAL_EXIT()                                                                                                \
    CRITICAL_SECT_UNLOCK();                                                                                            \
    }                                                                                                                  \
    while (0)

Then user must do:

do_something();
/* Here is the state whatever it is, interrupts might be protected, or not */
CRITICAL_ENTER();
/* Here you are IRQ protected....*/
CRITICAL_EXIT();
/* Here you are restored from state before

This allows to ensure there is a compilation error if you forget to unlock:

CRITICAL_ENTER();
/* somethin protected */
/* CRITICAL_EXIT(); */ /* Forget to exit is an error */

It also allows you to always use local variables if you call ENTER twice:

/* Original state */
CRITICAL_ENTER(); /* state before */
CRITICAL_ENTER(); /* Already locked */
CRITICAL_EXIT(); /* It was locked before, so it will stay locked */
CRITICAL_EXIT(); /* It will restore from the original state */

MaJerle avatar Jun 04 '25 18:06 MaJerle

The primask state should always be stored in the local variable. This is the most correct approach to the problem.

While I agree with this in general, I currently cannot think of a way how to implement this. Doing it as in your example breaks current CANopenNode (e.g. here) and likely also some user projects because it puts the critical section in a scope that gets destroyed when unlocking.

With the solution of @jsbhth it will get overwritten by another option, because there are interrupt enabled between getting the primask and setting the primask.

Could you explain or give an example how this could happen because I cannot imagine how. The primask is stored locally while interrupts are enabled. The global variable is only used after interrupts have been disabled and before primask is restored. In this period you can neither get interrupted, nor can you have interrupted other code having the variable in use. I admit that this does not support nested locking, but it was like this in the first place and i assumed it's just not supported. Also couldn't find any occurrence of that.

jsbhth avatar Jun 06 '25 09:06 jsbhth

@jsbhth the break you are referring in your link, if I copy the code, looks like:

/* write data to Object Dictionary */
CO_LOCK_OD(SDO_C->CANdevTx);
ODR_t odRet = SDO_C->OD_IO.write(&SDO_C->OD_IO.stream, buf, (OD_size_t)count, &countWritten);
CO_UNLOCK_OD(SDO_C->CANdevTx);

If CO_LOCK_OD is defined same as my CRITICAL_ENTER and CO_UNLOCK_OD is the same as my CRITICAL_EXIT, one guarantees that:

  • call to unlock wasn't forgotten
  • variable is local inside the do {} while statement

In your another example:

#define CO_LOCK_CAN_SEND(CAN_MODULE) \
    do { \
        uint32_t primask = __get_PRIMASK(); \
        (CAN_MODULE)->primask_send = primask; \ //this is the problem, what happens if interrupt happens just after this instruction
           \  //interrupt is here, and it overrides the ->primask_send for whatever reason
        __disable_irq(); \
        if (primask != (CAN_MODULE)->primask_send) { \
            __BKPT(0); \
        } \
    } while (0)

MaJerle avatar Jun 06 '25 10:06 MaJerle

@MaJerle

@jsbhth the break you are referring in your link, if I copy the code, looks like:

You have to look at the following lines. odRet is declared in the critical section and used after leaving it. But it doesn't exist anymore, because the scope was left.

In your another example:

This was intended to not work to show whats currently wrong. That was a response to @HamedJafarzadeh request for a way to reproduce this. Please look at the code in my first comment

jsbhth avatar Jun 06 '25 10:06 jsbhth

You have to look at the following lines. odRet is declared in the critical section and used after leaving it. But it doesn't exist anymore, because the scope was left.

You are right, the variable definition shall be done before the function call.

Your first comment is better, however it does not support the nested calls of functions which all disable the interrupts. This would fail:

void my_function() {
    CO_LOCK_OD(SDO_C->CANdevTx);
    do something()
    CO_UNLOCK_OD(SDO_C->CANdevTx);
}

//Actual code
CO_LOCK_OD(SDO_C->CANdevTx);
my_function()...
CO_UNLOCK_OD(SDO_C->CANdevTx);

MaJerle avatar Jun 06 '25 10:06 MaJerle

I've added this pull request: https://github.com/CANopenNode/CANopenNode/pull/582

MaJerle avatar Jun 06 '25 17:06 MaJerle