libcoap icon indicating copy to clipboard operation
libcoap copied to clipboard

Several bugs in the lwIP TCP context

Open Kev-Jiang opened this issue 4 months ago • 26 comments

Environment

  • Build System: [Make]
  • Operating System: [FreeRTOS]
  • Operating System Version: [ ]
  • Hosted Environment: [LwIP]

libcoap Configuration Summary

#define WITH_LWIP 1 #define COAP_IPV4_SUPPORT 1 #define COAP_CONSTRAINED_STACK 1 #define COAP_DISABLE_TCP 0 #define COAP_ASYNC_SUPPORT 1 #define COAP_WITH_OBSERVE_PERSIST 0 #define COAP_WS_SUPPORT 0 #define COAP_Q_BLOCK_SUPPORT 0 #define COAP_THREAD_SAFE 1 #define COAP_THREAD_RECURSIVE_CHECK 0

Problem Description

Issue 1: Timeout Handling Problem in coap_io_process

The timeout logic in coap_io_process (located in coap_io_lwip.c) appears problematic. Here's the issue:

  • sys_timeout() is used to set a timeout callback and send a semaphore signal.
  • sys_arch_sem_wait() blocks for the timeout period when no signals are available.

Observed Behavior: Both timeout and semaphore signal completion can happen simultaneously, causing the following call to coap_io_process() to immediately return without performing the intended timeout handling.

Current Code Snippet

if (timeout) {
  sys_timeout(timeout, coap_io_process_timeout, context);
  context->timer_configured = 1;
}

#if NO_SYS == 0
} else {
  coap_lock_callback_release(sys_arch_sem_wait(&coap_io_timeout_sem, timeout),
                             return 0);
#endif /* NO_SYS == 0 */
}

Example:

If coap_io_process() is invoked repeatedly (e.g., every 5s in a loop):

  1. sys_arch_sem_wait() times out after 5 seconds and returns.
  2. coap_io_process_timeout() simultaneously signals the semaphore.
  3. Re-calling coap_io_process() immediately after uses the prior semaphore signal and returns instantly.
while (1) {
    coap_io_process(ctx, 5000); // 5-second timeout
}

Suggested Fix: Remove the use of sys_timeout() and rely solely on sys_arch_sem_wait.


Issue 2: Non-Immediate Return for TCP Events

When using TCP transport, coap_io_process() does not immediately process incoming events. UDP transport (coap_dgrm_lwip.c) correctly utilizes calls like sys_sem_signal to notify event readiness:

#if NO_SYS == 0
  sys_sem_signal(&coap_io_timeout_sem);
#endif /* NO_SYS == 0 */

However, TCP-related code in coap_strm_lwip.c lacks these sys_sem_signal calls. At least six places may need signaling:

  1. do_tcp_err
  2. coap_tcp_recv
  3. do_tcp_connected
  4. do_tcp_accept (plus other relevant return points in the TCP flow)

Suggested Fix:

Add sys_sem_signal calls at these critical return points to ensure proper TCP event notification.


Issue 3: TCP Window Not Restored After Data Read

Failure to restore the TCP window after data read leads to issues such as:

  • When the remote endpoint closes the connection, tcp_close_shutdown() erroneously checks if pcb->rcv_wnd != TCP_WND_MAX(pcb) and leaves the pcb unfreed.
  • This issue limits the number of active PCBs once the limit is reached, preventing new connections.

Areas impacted:

  1. coap_tcp_recv
  2. coap_socket_read

Suggested Fix:

Call tcp_recved() to restore the TCP window.

Example Fix in coap_tcp_recv:

} else if (err != ERR_OK) {
  /* cleanup, for unknown reason */
  if (p != NULL) {
    pbuf_free(p);
    tcp_recved(sock->tcp_pcb, p->tot_len);
  }
}

Example Fix in coap_socket_read:

ssize_t
coap_socket_read(coap_socket_t *sock, uint8_t *data, size_t data_len) {
  if (sock->p) {
    if (data_len < sock->p->len) {
      uint8_t *ptr = (uint8_t *)sock->p->payload;

      /* Handle partial read */
      memcpy(data, sock->p->payload, data_len);
      tcp_recved(sock->tcp_pcb, data_len);
      sock->p->payload = &ptr[data_len];
      sock->p->len -= data_len;
      return data_len;
    } else {
      data_len = sock->p->len;
      memcpy(data, sock->p->payload, sock->p->len);
      pbuf_free(sock->p);
      tcp_recved(sock->tcp_pcb, sock->p->len);
      sock->p = NULL;
      return data_len;
    }
  }
  return 0;
}

Issue 4: Improper TCP Connection Closure in coap_lwip_tcp_close

Closing a TCP connection via coap_lwip_tcp_close (located in coap_strm_lwip.c) does not clear the TCP hooks (tcp_err, tcp_arg).

This leaves callbacks like do_tcp_err active post-session cleanup, potentially causing crashes.

Suggested Fix:

  1. Clear tcp_err and tcp_arg.
  2. Validate the session before processing in do_tcp_err.

Example Fix for coap_lwip_tcp_close:

static void
coap_lwip_tcp_close(void *ctx) {
  struct tcp_pcb *tcp_pcb = (struct tcp_pcb *)ctx;

  if (tcp_pcb) {
    tcp_recv(tcp_pcb, NULL);
    tcp_err(tcp_pcb, NULL);
    tcp_arg(tcp_pcb, NULL);
    tcp_close(tcp_pcb);
  }
}

Safety Check Example for do_tcp_err:

static void
do_tcp_err(void *arg, err_t err) {
  coap_session_t *session = (coap_session_t *)arg;

  if (session) {
    (void)err;

    coap_lock_lock(return);
    coap_handle_event_lkd(session->context, COAP_EVENT_TCP_FAILED, session);
    /*
     * As per TCP documentation, the associated PCB is already freed 
     * when this callback is called. Prevent double-free during cleanup.
     */
    session->sock.tcp_pcb = NULL;
    coap_session_disconnected_lkd(session, COAP_NACK_NOT_DELIVERABLE);

#if NO_SYS == 0
    sys_sem_signal(&coap_io_timeout_sem);
#endif 
    coap_lock_unlock();
  }
}

Conclusion

In an LwIP environment, issues observed in libcoap with TCP transport include:

  1. Timeout handling logic conflicts.
  2. Lack of immediate event processing for TCP.
  3. TCP window restoration issues.
  4. Improper TCP connection cleanup.

Suggested fixes should address these issues, improving libcoap's robustness and compatibility in LwIP environments.

Kev-Jiang avatar Dec 06 '25 16:12 Kev-Jiang

Thanks for raising and detailing these issues. I will take a look at them and get fixes out as appropriate.

mrdeep1 avatar Dec 06 '25 16:12 mrdeep1

Another question: Why use LwIP Raw API instead of Socket API? Using Socket API might improve compatibility and portability.

Kev-Jiang avatar Dec 06 '25 16:12 Kev-Jiang

Why use LwIP Raw API instead of Socket API?

Good question for which I don't have a proper answer. libcoap had LwIP support before I started to work on it and this was based on the Raw API.. Certainly non-blocking i/o is required and the libcoap design is to be using callbacks whenever any TCP/IP data arrives, which may also have to pass through a DTLS layer (currently for TinyDTLS).

mrdeep1 avatar Dec 06 '25 16:12 mrdeep1

I feel that libcoap has many issues when running on lwIP, especially with TCP. Although I am working with common usage scenarios, it seems quite challenging to use it directly. However, if these four issues are resolved, basic communication should be able to function properly.

Kev-Jiang avatar Dec 07 '25 13:12 Kev-Jiang

I have taken most of the suggestions, am running some soak tests and tracking down some anomalies.

mrdeep1 avatar Dec 08 '25 11:12 mrdeep1

Issue Summary: I discovered a new lwip issue where CoAP experiences probabilistic crashes during TCP connection establishment.

Log Fragment:

tcp_bind: bind to port 49916\n
tcp_connect to port 5683\n
tcp_enqueue_flags: queueing 10230066:10230067 (0x2)\n
tcp_output: snd_wnd 12288, cwnd 1, wnd 1, effwnd 0, seq 10230066, ack 10230066\n
tcp_slowtmr: processing active pcb\n
tcp_slowtmr: polling application\n
tcp_output: snd_wnd 12288, cwnd 1, wnd 1, effwnd 0, seq 10230066, ack 10230066\n
tcp_output: snd_wnd 12288, cwnd 1, wnd 1, effwnd 0, seq 10230066, ack 10230066, i 0\n
tcp_output_segment: 10230066:10230066\n
ethernet_output: sending packet 0x4229748\n
lwip_protocol_send L:426 run
send len = $ 58 $\r\n
tcp_output: snd_wnd 12288, cwnd 1, wnd 1, effwnd 0, seq 10230066, ack 10230066, i 0\n
tcp_output_segment: 10230066:10230066\n
ethernet_output: sending packet 0x4229748\n
lwip_protocol_send L:426 run
send len = $ 58 $\r\n
exception_mode:0xec00ec\r\n

Problem Analysis: Initial Call: tcp_connect invokes tcp_output Interrupt Occurrence: During tcp_output execution, the tcp_slowtmr timer interrupt triggers Reentrant Call: Within tcp_slowtmr's polling process, tcp_output is called again (via TCP_EVENT_POLL -> tcp_output(prev)) State Conflict: Both tcp_output calls operate on the same PCB, causing data structure state inconsistency Crash: When returning to the first tcp_output call, the PCB state has been corrupted, resulting in an exception

Kev-Jiang avatar Dec 08 '25 13:12 Kev-Jiang

Thanks for this. Not currently sure whether this is a LwIP library specific issue, or because it is caused by a timer condition set up by libcoap.

mrdeep1 avatar Dec 08 '25 14:12 mrdeep1

Issues 2, 3 and 4 addressed in #1827. Please check if this works for you.

I do not think 1 is an issue at this point as just an additional call is made to coap_io_process() (which will do nothing).

With your new issue, what is your input function when you are doing a netif_add() ? I found my examples were incorrect and we had two threads tapif_thread and tcpip_thread doing the tcpip work.

mrdeep1 avatar Dec 08 '25 20:12 mrdeep1

For new issue, I need to first emphasize and add a detail: this issue is probabilistic in nature. I later conducted a 4-hour stress test and successfully reproduced the problem. The reproduction process is as follows:

The libcoap server works in call-home mode. After the session is established and communication is normal, every 30 seconds the server side actively calls coap_session_disconnected (connection disconnects and session is subsequently released). Then after another 10 seconds, coap_new_client_session is called again to establish a TCP connection. This process is repeated continuously during stress testing, and there's a probability that during the execution of tcp_output in tcp_connect, it gets preempted by tcp_slowtmr, ultimately leading to a crash.

Regarding the issue you're concerned about: "With your new issue, what is your input function when you are doing a netif_add()?"

During netif_init, I set tcpip_input for netif_add and also set the NETIF_FLAG_ETHARP flag.

Kev-Jiang avatar Dec 09 '25 07:12 Kev-Jiang

Regarding your modification in https://github.com/obgm/libcoap/pull/1827, there's one part I don't quite understand: why do you need to increment the window by 1 (tcp_recved(sock->tcp_pcb, 1);) when p==NULL in coap_tcp_recv? I haven't seen similar usage in lwip's apps.

static err_t
coap_tcp_recv(void *arg, struct tcp_pcb *tpcb, struct pbuf *p, err_t err) {
  coap_session_t *session = (coap_session_t *)arg;
  coap_socket_t *sock = session ? &session->sock : NULL;
  coap_tick_t now;

  (void)tpcb;
  if (!session)
    return ERR_ARG;

  if (p == NULL) {
    /* remote host closed connection */
    tcp_recved(sock->tcp_pcb, 1); /* why do this? */
    tcp_close(sock->tcp_pcb);
    tcp_arg(sock->tcp_pcb, NULL);
    tcp_recv(sock->tcp_pcb, NULL);

Kev-Jiang avatar Dec 09 '25 07:12 Kev-Jiang

Yes, you are right, tcp_recved(sock->tcp_pcb, 1); should not be there. I had added it in when troubleshooting when two threads were incorrectly being used to handle TCP/IP I/O and had not removed it. Updated code pushed to #1827.

mrdeep1 avatar Dec 09 '25 10:12 mrdeep1

For new issue, I need to first emphasize and add a detail

Thanks for providing further information. Not sure why LwIP is not handling tcp_slowtmr in active tcp_connect. I will look further into this,

mrdeep1 avatar Dec 09 '25 10:12 mrdeep1

If you make something similar to the following change to the LwIP code, then the logging will report the thread id so we can see if there is a thread clash usage when you get the failure.

diff --git a/src/include/lwip/arch.h b/src/include/lwip/arch.h
index c659a80c..de021c27 100644
--- a/src/include/lwip/arch.h
+++ b/src/include/lwip/arch.h
@@ -78,7 +78,7 @@
  * systems, this should be defined to something less resource-consuming.
  */
 #ifndef LWIP_PLATFORM_DIAG
-#define LWIP_PLATFORM_DIAG(x) do {printf x;} while(0)
+#define LWIP_PLATFORM_DIAG(y) do {printf("%lx: ", xTaskGetCurrentTaskHandle()) ; printf y;} while(0)
 #include <stdio.h>
 #include <stdlib.h>
 #endif

mrdeep1 avatar Dec 09 '25 10:12 mrdeep1

I discovered another probabilistic issue: Although coap_lwip_tcp_close already contains tcp_err(tcp_pcb, NULL); tcp_arg(tcp_pcb, NULL); But when the server is in call-home mode, if it fails to connect from the beginning, 6 seconds after the session closes, lwip's slowtmr still triggers do_tcp_err (inferred from the logs), and subsequently causes an unlock assertion crash in coap_io_process.

Below is a log excerpt:

14:17:46.178684        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425f10c: created outgoing session\n
14:17:51.179809        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : timeout waiting for first response\n
14:17:51.179824        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: delayed\n
14:17:51.179833        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n
14:17:52.488122        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:52.488135        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: not transmitted after disconnect\n
14:17:52.488147        v:Reliable c:Ping {} [ ]\n
14:17:52.488154        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session disconnected (COAP_NACK_NOT_DELIVERABLE)\n
14:17:52.488163        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:52.488175        DEBG:***EVENT: COAP_EVENT_SESSION_FAILED\n
14:17:52.488190        DEBG:***EVENT: COAP_EVENT_SERVER_SESSION_DEL\n
14:17:52.488203        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425bfbc: closed\
14:17:58.441081        etharp_timer\n
14:17:58.476451        tcp_slowtmr: processing active pcb\n
14:17:58.476478        tcp_slowtmr: max SYN retries reached\n
14:17:58.476492        tcp_pcb_purge\n
14:17:58.476503        tcp_pcb_purge: data left on ->unacked\n
14:17:58.476519        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:58.476581        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: not transmitted after disconnect\n
14:17:58.476643        v:Reliable c:Ping {} [ ]\n
14:17:58.476680        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session disconnected (COAP_NACK_NOT_DELIVERABLE)\n
14:17:58.476737        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:58.476801        DEBG:***EVENT: COAP_EVENT_SESSION_FAILED\n
14:17:58.495531        exception_mode:0xec00ec\r\n
14:17:58.998484        assert failed: coap_thread_pid == global_lock.pid, file: libcoap/src/coap_threadsafe.c, line: 69

Kev-Jiang avatar Dec 09 '25 10:12 Kev-Jiang

If you make something similar to the following change to the LwIP code, then the logging will report the thread id so we can see if there is a thread clash usage when you get the failure.

diff --git a/src/include/lwip/arch.h b/src/include/lwip/arch.h
index c659a80c..de021c27 100644
--- a/src/include/lwip/arch.h
+++ b/src/include/lwip/arch.h
@@ -78,7 +78,7 @@
  * systems, this should be defined to something less resource-consuming.
  */
 #ifndef LWIP_PLATFORM_DIAG
-#define LWIP_PLATFORM_DIAG(x) do {printf x;} while(0)
+#define LWIP_PLATFORM_DIAG(y) do {printf("%lx: ", xTaskGetCurrentTaskHandle()) ; printf y;} while(0)
 #include <stdio.h>
 #include <stdlib.h>
 #endif

Thank you, I will try to reproduce it later.

Kev-Jiang avatar Dec 09 '25 10:12 Kev-Jiang

All the changes so far have been consolidated into #1812. I will look into the above.

mrdeep1 avatar Dec 09 '25 11:12 mrdeep1

Re https://github.com/obgm/libcoap/issues/1826#issuecomment-3631578485, I think that this is missing

diff --git a/src/coap_strm_lwip.c b/src/coap_strm_lwip.c
index 7b9879be..d6b50203 100644
--- a/src/coap_strm_lwip.c
+++ b/src/coap_strm_lwip.c
@@ -420,6 +420,9 @@ coap_lwip_endpoint_tcp_close(void *ctx) {
   struct tcp_pcb *tcp_pcb = (struct tcp_pcb *)ctx;

   if (tcp_pcb) {
+    tcp_err(tcp_pcb, NULL);
+    tcp_arg(tcp_pcb, NULL);
+    tcp_recv(tcp_pcb, NULL);
     tcp_close(tcp_pcb);
   }
 }

What puzzles me is this has a short time difference between becoming a server and then TCP fails.

14:17:51.179833        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n
14:17:52.488122        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n

and appears to be the result of a call to do_tcp_err(). I am unable to reproduce this.

It also appears that you are not invoking coap_context_set_session_reconnect_time(). Correct?

mrdeep1 avatar Dec 09 '25 16:12 mrdeep1

Re #1826 (comment), I think that this is missing

diff --git a/src/coap_strm_lwip.c b/src/coap_strm_lwip.c
index 7b9879be..d6b50203 100644
--- a/src/coap_strm_lwip.c
+++ b/src/coap_strm_lwip.c
@@ -420,6 +420,9 @@ coap_lwip_endpoint_tcp_close(void *ctx) {
   struct tcp_pcb *tcp_pcb = (struct tcp_pcb *)ctx;

   if (tcp_pcb) {
+    tcp_err(tcp_pcb, NULL);
+    tcp_arg(tcp_pcb, NULL);
+    tcp_recv(tcp_pcb, NULL);
     tcp_close(tcp_pcb);
   }
 }

Oh, regarding coap_lwip_endpoint_tcp_close, I think you are right. This part is indeed missing in https://github.com/obgm/libcoap/pull/1827. However, I noticed that coap_lwip_endpoint_tcp_close and coap_lwip_tcp_close should have the same behavior, but I'm not sure why they need to be separated into two functions. Additionally, I noticed that in coap_session_set_type_server_lkd, an endpoint is created and stored in sock (ep->sock.endpoint = ep;), but in coap_session_set_type_client_lkd, the endpoint in sock is not cleared (session->sock = session->endpoint->sock;). I'm not sure if this could cause some potential issues, and whether there are other unconsidered matters that need to be handled when switching between client and server.

What puzzles me is this has a short time difference between becoming a server and then TCP fails.

14:17:51.179833        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n
14:17:52.488122        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n

I believe this is because the client is not listening at that moment, so the server immediately receives an RST, which is exactly the scenario I'm testing.

and appears to be the result of a call to do_tcp_err(). I am unable to reproduce this.

It also appears that you are not invoking coap_context_set_session_reconnect_time(). Correct?

You are correct. I am currently not enabling reconnect time because I had already implemented my own reconnection logic at the upper layer before you solved the reconnection issue in https://github.com/obgm/libcoap/pull/1812. Another consideration is that, from what I recall, libcoap's native reconnection mechanism doesn't seem to notify the upper layer during the reconnection process - it only notifies the upper layer when the reconnection attempt has a result (am I correct about this?). If the upper layer cannot know that the lower layer is reconnecting, my upper-layer business logic might encounter certain exceptions (or it might not, but this risk exists). So I temporarily haven't enabled libcoap's underlying reconnection mechanism. If libcoap notifies the upper layer when reconnection is in progress, I will reconsider switching to libcoap's native reconnection mechanism in the future.

Kev-Jiang avatar Dec 10 '25 02:12 Kev-Jiang

I noticed that coap_lwip_endpoint_tcp_close and coap_lwip_tcp_close should have the same behavior

Good spot - the difference is tcp_accept() which needs to be done on a listening endpoint.

but in coap_session_set_type_client_lkd, the endpoint in sock is not cleared

This needs to be left there for UDP sessions as the sock is common across all inbound traffic which have individual sessions. The session->sock here is empty so when coap_socket_dgrm_close() is invoked it does nothing. I agree there may be some unintended consequences though and I will have a look at this..

libcoap's native reconnection mechanism doesn't seem to notify the upper layer during the reconnection process

3 new events are in the latest code

  /*
   * Reconnect events
   */
  /** Triggered when a session failed, and a reconnect is going to be attempted */
  COAP_EVENT_RECONNECT_FAILED = 0x9001,
  /** Triggered when a session failed, and a reconnect is successful */
  COAP_EVENT_RECONNECT_SUCCESS = 0x9002,
  /** Triggered when a session failed, and retry reconnect attempts failed */
  COAP_EVENT_RECONNECT_NO_MORE = 0x9003,

mrdeep1 avatar Dec 10 '25 10:12 mrdeep1

@mrdeep1

Regarding the interruption of tcp_output in tcp_connect by tcp_slowtmr, I have some thoughts:

Issue: Race Conditions in TCP Operations and Missing LOCK_TCPIP_CORE() Protection in lwIP Integration

Environment Configuration

  • libcoap uses lwIP raw API
  • coap_io_process() runs in a dedicated FreeRTOS task (similar to coap_io_process_loop in POSIX)
  • Other libcoap interfaces are called from different threads
  • libcoap initialization does not configure coap_lwip_set_input_wait_handler
  • lwIP is also used by other business logic through lwIP socket API (I understand all events are routed to tcpip_thread for single-threaded processing via mbox)

Problem Description

I encountered an issue where tcp_connect was interrupted by tcp_slowtmr during tcp_output execution. This seems related to missing thread synchronization protection.

Analysis

  1. Threading Architecture:

    • Socket API operations are serialized through tcpip_thread (single-threaded via mbox)
    • tcpip_thread calls sys_check_timeouts(), triggering tcp_slowtmr
    • libcoap directly uses raw API, and coap_io_process_lkd() also calls sys_check_timeouts()
  2. Inconsistent Lock Protection:

    • UDP connections (coap_socket_connect_udp) use LOCK_TCPIP_CORE() (Should sys_check_timeouts wait for UNLOCK_TCPIP_CORE() to execute after coap_socket_connect_udp locks LOCK_TCPIP_CORE?)
    • TCP connections (coap_socket_connect_tcp) do NOT use LOCK_TCPIP_CORE()
  3. Potential Race Conditions:

    • libcoap interfaces and tcpip_thread run in different threads
    • Both sides might operate on PCBs simultaneously without proper synchronization
    • Other scenarios might also lack LOCK_TCPIP_CORE() / UNLOCK_TCPIP_CORE() protection

Questions

  1. Is my analysis regarding missing LOCK_TCPIP_CORE() in TCP operations correct?
  2. Should other libcoap lwIP operations also be protected with LOCK_TCPIP_CORE()?
  3. Why do only write and close operations use tcpip_try_callback()? (My understanding of lwIP is still quite limited)
  4. Is it necessary for coap_io_process_lkd() to call sys_check_timeouts() when tcpip_thread already seems to handle it?

Expected Behavior

TCP operations should have proper synchronization protection to prevent race conditions between libcoap thread raw API, other business thread socket API, and lwIP's tcpip_thread.

Kev-Jiang avatar Dec 10 '25 11:12 Kev-Jiang

3 new events are in the latest code

  /*
   * Reconnect events
   */
  /** Triggered when a session failed, and a reconnect is going to be attempted */
  COAP_EVENT_RECONNECT_FAILED = 0x9001,
  /** Triggered when a session failed, and a reconnect is successful */
  COAP_EVENT_RECONNECT_SUCCESS = 0x9002,
  /** Triggered when a session failed, and retry reconnect attempts failed */
  COAP_EVENT_RECONNECT_NO_MORE = 0x9003,

I previously saw the Reconnect events you added in #1812, but as I described, they still seem to be only reconnection results. What I need is something like COAP_EVENT_RECONNECTING that indicates the underlying libcoap is in the process of reconnecting.

Kev-Jiang avatar Dec 10 '25 11:12 Kev-Jiang

something like COAP_EVENT_RECONNECTING

Has been added to the (not yet pushed) code, the event fires when the reconnect is initiated.

mrdeep1 avatar Dec 10 '25 15:12 mrdeep1

TCP connections (coap_socket_connect_tcp) do NOT use LOCK_TCPIP_CORE()

I added LOCK_TCPIP_CORE() in coap_socket_connect_tcp1 and ran stress tests overnight (about 14 hours), and haven't encountered this issue again so far.

@mrdeep1

Regarding the interruption of tcp_output in tcp_connect by tcp_slowtmr, I have some thoughts:

Issue: Race Conditions in TCP Operations and Missing LOCK_TCPIP_CORE() Protection in lwIP Integration

Environment Configuration

  • libcoap uses lwIP raw API
  • coap_io_process() runs in a dedicated FreeRTOS task (similar to coap_io_process_loop in POSIX)
  • Other libcoap interfaces are called from different threads
  • libcoap initialization does not configure coap_lwip_set_input_wait_handler
  • lwIP is also used by other business logic through lwIP socket API (I understand all events are routed to tcpip_thread for single-threaded processing via mbox)

Problem Description

I encountered an issue where tcp_connect was interrupted by tcp_slowtmr during tcp_output execution. This seems related to missing thread synchronization protection.

Analysis

  1. Threading Architecture:

    • Socket API operations are serialized through tcpip_thread (single-threaded via mbox)
    • tcpip_thread calls sys_check_timeouts(), triggering tcp_slowtmr
    • libcoap directly uses raw API, and coap_io_process_lkd() also calls sys_check_timeouts()
  2. Inconsistent Lock Protection:

    • UDP connections (coap_socket_connect_udp) use LOCK_TCPIP_CORE() (Should sys_check_timeouts wait for UNLOCK_TCPIP_CORE() to execute after coap_socket_connect_udp locks LOCK_TCPIP_CORE?)
    • TCP connections (coap_socket_connect_tcp) do NOT use LOCK_TCPIP_CORE()
  3. Potential Race Conditions:

    • libcoap interfaces and tcpip_thread run in different threads
    • Both sides might operate on PCBs simultaneously without proper synchronization
    • Other scenarios might also lack LOCK_TCPIP_CORE() / UNLOCK_TCPIP_CORE() protection

Questions

  1. Is my analysis regarding missing LOCK_TCPIP_CORE() in TCP operations correct?
  2. Should other libcoap lwIP operations also be protected with LOCK_TCPIP_CORE()?
  3. Why do only write and close operations use tcpip_try_callback()? (My understanding of lwIP is still quite limited)
  4. Is it necessary for coap_io_process_lkd() to call sys_check_timeouts() when tcpip_thread already seems to handle it?

Expected Behavior

TCP operations should have proper synchronization protection to prevent race conditions between libcoap thread raw API, other business thread socket API, and lwIP's tcpip_thread.

Kev-Jiang avatar Dec 11 '25 01:12 Kev-Jiang

What puzzles me is this has a short time difference between becoming a server and then TCP fails. I believe this is because the client is not listening at that moment, so the server immediately receives an RST, which is exactly the scenario I'm testing.

I previously thought it was caused by RST, but this assumption was incorrect. I re-extracted and analyzed more CoAP and lwIP related logs, and I believe I have found the root cause:

Log Summary:

14:17:39.972765        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425bfbc: created outgoing session\n
14:17:44.976421        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : timeout waiting for first response\n
14:17:44.976433        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: delayed\n
14:17:44.976443        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n
14:17:46.178684        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425f10c: created outgoing session\n
14:17:51.179809        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : timeout waiting for first response\n
14:17:51.179824        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: delayed\n
14:17:51.179833        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n
14:17:52.421792        etharp_timer\n
14:17:52.488110        tcp_slowtmr: processing active pcb\n
14:17:52.488113        tcp_slowtmr: max SYN retries reached\n
14:17:52.488117        tcp_pcb_purge\n
14:17:52.488119        tcp_pcb_purge: data left on ->unacked\n
14:17:52.488122        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:52.488135        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: not transmitted after disconnect\n
14:17:52.488147        v:Reliable c:Ping {} [ ]\n
14:17:52.488154        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session disconnected (COAP_NACK_NOT_DELIVERABLE)\n
14:17:52.488163        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:52.488175        DEBG:***EVENT: COAP_EVENT_SESSION_FAILED\n
14:17:52.488190        DEBG:***EVENT: COAP_EVENT_SERVER_SESSION_DEL\n
14:17:52.488203        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425bfbc: closed\
14:17:58.441081        etharp_timer\n
14:17:58.476451        tcp_slowtmr: processing active pcb\n
14:17:58.476478        tcp_slowtmr: max SYN retries reached\n
14:17:58.476492        tcp_pcb_purge\n
14:17:58.476503        tcp_pcb_purge: data left on ->unacked\n
14:17:58.476519        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:58.476581        DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: not transmitted after disconnect\n
14:17:58.476643        v:Reliable c:Ping {} [ ]\n
14:17:58.476680        DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session disconnected (COAP_NACK_NOT_DELIVERABLE)\n
14:17:58.476737        DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:58.476801        DEBG:***EVENT: COAP_EVENT_SESSION_FAILED\n
14:17:58.495531        exception_mode:0xec00ec\r\n
14:17:58.998484        assert failed: coap_thread_pid == global_lock.pid, file: libcoap/src/coap_threadsafe.c, line: 69

Code snippet:

int server_call_home_start(coap_context_t *ctx, coap_address_t *src, coap_address_t *dst, coap_proto_t proto)
{
    session = coap_new_client_session(g_ctx, src, dst, proto);
    if (!session)
    {
        return -1;
    }

    coap_session_set_type_server(session);
    return 0;
}

The problem flow is as follows: When the connection receives no response, coap_new_client_session returns immediately, but coap_session_set_type_server needs to wait until "timeout waiting for first response" before it can return (I believe this is because coap_session_set_type_server is waiting for a lock, which causes my server_call_home_start to block - is this by design in libcoap?).

After coap_session_set_type_server returns, I initiated a new connection without waiting for the event callback to notify COAP_EVENT_TCP_FAILED connection status.

After the second coap_new_client_session and coap_session_set_type_server return, the first connection's COAP_EVENT_TCP_FAILED event callback occurs (I believe this is triggered by do_tcp_err from tcp_slowtmr in the tcpip_thread).

Finally, the second COAP_EVENT_TCP_FAILED event callback occurs (I believe this is triggered by do_tcp_err from tcp_slowtmr in coap_io_process_lkd).

Further Analysis: While I could modify the code to wait for the COAP_EVENT_TCP_FAILED event before initiating a new connection, I don't think this is a fundamental solution. Therefore, I further analyzed why the do_tcp_err triggered by tcp_slowtmr in coap_io_process_lkd causes an assert failure when coap_io_process unlocks (assert failed: coap_thread_pid == global_lock.pid).

The function call sequence when the problem occurs is as follows:

  1. Enter coap_io_process
  2. coap_lock_lock(return 0);
  3. Enter coap_io_process_lkd
  4. Enter sys_check_timeouts
  5. Enter handler(arg); (tcp_slowtmr)
  6. Enter TCP_EVENT_ERR (do_tcp_err)
  7. coap_lock_lock(return);

Here there are two scenarios: 7.1. If sys_mutex_new implementation defaults to RecursiveMutex, then sys_mutex_lock continues execution (my FreeRTOS is currently configured this way) 7.2. If sys_mutex_new implementation defaults to non-RecursiveMutex, then sys_mutex_lock will deadlock!!! (because step 2 already acquired the lock)

  1. Enter coap_lock_unlock(); (under condition 7.1)
  2. global_lock.pid = 0;
  3. Return all the way back to coap_io_process
  4. Enter coap_lock_unlock();
  5. assert(coap_thread_pid == global_lock.pid);

At this point, global_lock.pid is 0, which is not equal to coap_thread_pid, so assert failed!!!

Solution: I think we can remove the use of sys_check_timeouts in coap_io_process_lkd, because tcpip_thread already calls sys_check_timeouts. Correct?

  UNLOCK_TCPIP_CORE();

  if (context->input_wait) {
    coap_lock_callback_release(context->input_wait(context->input_arg, timeout),
                               return 0);
#if NO_SYS == 0
  } else {
    coap_lock_callback_release(sys_arch_sem_wait(&coap_io_timeout_sem, timeout),
                               return 0);
#endif /* NO_SYS == 0 */
  }

  coap_lock_invert(LOCK_TCPIP_CORE(),
                   UNLOCK_TCPIP_CORE(); return 0);

#if 0
  sys_check_timeouts(); /* remove this? */
#endif

  UNLOCK_TCPIP_CORE();

  coap_ticks(&now);
  return (int)(((now - before) * 1000) / COAP_TICKS_PER_SECOND);
}

Kev-Jiang avatar Dec 11 '25 10:12 Kev-Jiang

For https://github.com/obgm/libcoap/issues/1826#issuecomment-3641327458. Thanks for spending the time troubleshooting this and providing the detail.

I think we can remove the use of sys_check_timeouts in coap_io_process_lkd

Yes, they can be removed for NO_SYS == 0. Fixed in #1812.

coap_session_set_type_server needs to wait until "timeout waiting for first response"

coap_session_set_type_server () initiates a Ping to start communications to the remote peer (more relevant to UDP based protocols or TLS) and is waiting (for up to 5 seconds) for the CoAP TCP CMS packet exchange.

For https://github.com/obgm/libcoap/issues/1826#issuecomment-3639632738. Thanks for spending the time troubleshooting this and providing the detail.

I added LOCK_TCPIP_CORE() in coap_socket_connect_tcp1 and ran stress tests overnight (about 14 hours), and haven't encountered this issue again so far.

That is good news. Fix in #1812.

For https://github.com/obgm/libcoap/issues/1826#issuecomment-3636592413. Thanks for spending the time troubleshooting this and providing the detail with good questions.

I encountered an issue where tcp_connect was interrupted by tcp_slowtmr during tcp_output execution.

This should now be fixed in #1812 by removing sys_check_timeouts() in coap_io_process_lkd()

Re Questions

Is my analysis regarding missing LOCK_TCPIP_CORE() in TCP operations correct?

Yes. I have made some changes in #1812.

Should other libcoap lwIP operations also be protected with LOCK_TCPIP_CORE()?

Yes. I have made some changes in #1812.

Why do only write and close operations use tcpip_try_callback()? (My understanding of lwIP is still quite limited)

It is safer to pass the write / close requests over to the tcpip_thread as these are more likely to frequently occur. Having to do the lock inversion coap_lock_invert(LOCK_TCPIP_CORE(), goto err_unlock); (so we don't get global_lock->tcpip_lock and tcpip_lock->global_lock in 2 threads) means that global_lock has to get temporarily unlocked which may change the state of things by another thread. Doing this in coap_socket_connect_tcp1() should be fine.

Is it necessary for coap_io_process_lkd() to call sys_check_timeouts() when tcpip_thread already seems to handle it?

It only needs to be called if NO_SYS == 1. Fixed in #1812.

mrdeep1 avatar Dec 11 '25 12:12 mrdeep1

coap_session_set_type_server () initiates a Ping to start communications to the remote peer (more relevant to UDP based protocols or TLS) and is waiting (for up to 5 seconds) for the CoAP TCP CMS packet exchange.

Oh, I got it. The blocking is caused by coap_client_delay_first being called in coap_session_set_type_server.

int
coap_client_delay_first(coap_session_t *session) {
#if COAP_CLIENT_SUPPORT
  if (session->type == COAP_SESSION_TYPE_CLIENT && session->doing_first) {
    int timeout_ms = 5000;
    coap_session_state_t current_state = session->state;

    if (session->delay_recursive) {
      return 0;
    } else {
      session->delay_recursive = 1;
    }
    /*
     * Need to wait for first request to get out and response back before
     * continuing.. Response handler has to clear doing_first if not an error.
     */
    coap_session_reference_lkd(session);
    while (session->doing_first != 0) {
      int result = coap_io_process_lkd(session->context, 1000);

      if (result < 0) {
        session->doing_first = 0;
        session->delay_recursive = 0;
        coap_session_release_lkd(session);
        return 0;
      }

And thank you very much for your fix in https://github.com/obgm/libcoap/pull/1812. I will pull and use it again after this PR is merged into the develop branch.

Kev-Jiang avatar Dec 11 '25 14:12 Kev-Jiang

@mrdeep1 Note that https://github.com/obgm/libcoap/pull/1812 has been merged. I'm currently testing and found a compilation failure issue caused by a typo: coap_mutex_internal.h

#define caop_thread_pid_t     TaskHandle_t

Should be:

#define coap_thread_pid_t     TaskHandle_t

Kev-Jiang avatar Dec 15 '25 11:12 Kev-Jiang

Apologies for that. #1836 will fix this.

mrdeep1 avatar Dec 15 '25 11:12 mrdeep1

@mrdeep1 I used the latest develop branch code and noticed a change in the server-side behavior under call-home mode: in the current version, coap_session_set_type_server no longer remains in a waiting state, and the server also triggers the COAP_EVENT_SERVER_SESSION_CONNECTED event, behaving similarly to the client. In previous versions, the server would only trigger the COAP_EVENT_SESSION_CONNECTED event, and only the client would additionally trigger a COAP_EVENT_SERVER_SESSION_CONNECTED event after COAP_EVENT_SESSION_CONNECTED.

Currently, I execute coap_session_set_type_client in the callback for COAP_EVENT_SERVER_SESSION_CONNECTED, which is causing some issues. I'd like to confirm whether the current server-side behavior in call-home mode is the intended logic as per your design. If this is the intended behavior, I will adjust the upper-layer code to adapt to libcoap's new logic. If not, this might be a bug.

Dec 16 21:23:45.480 DEBG ***0.0.0.0:0 <-> 10.56.65.14:5683 TCP : session 0x558613df40: created outgoing session
Dec 16 21:23:45.485 DEBG ***0.0.0.0:0 <-> 10.56.65.14:5683 TCP : session now is type Server
Dec 16 21:23:45.518 DEBG ***EVENT: COAP_EVENT_TCP_CONNECTED
Dec 16 21:23:45.523 DEBG ***10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : sending CSM
Dec 16 21:23:45.537 DEBG *  10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : netif: sent    7 bytes
v:Reliable c:CSM {} [ Max-Message-Size:8388864, Block-Wise-Transfer: ]
Dec 16 21:23:45.542 DEBG *  10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : netif: recv    7 bytes
v:Reliable c:CSM {} [ Max-Message-Size:8388864, Block-Wise-Transfer: ]
Dec 16 21:23:45.542 DEBG ***10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : session connected
Dec 16 21:23:45.542 DEBG ***EVENT: COAP_EVENT_SESSION_CONNECTED
Dec 16 21:23:45.544 DEBG ***EVENT: COAP_EVENT_SERVER_SESSION_CONNECTED
Dec 16 21:23:45.555 DEBG ***10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : session now is type Client

Kev-Jiang avatar Dec 16 '25 13:12 Kev-Jiang

On Actual-Server, you create a client session and immediately set the session type using coap_session_set_type_server() so now on, the session is of type server and will be receiving requests. Once the server session completes, it will raise COAP_EVENT_SERVER_SESSION_CONNECTED. However, this event is only to be processed and not used to call coap_session_set_type_client() - that should only be done on Actual-Client once the (starting as) Server session has completed and also raises COAP_EVENT_SERVER_SESSION_CONNECTED to signal this.

So, only invoke coap_session_set_type_client() on the Actual-Client, not Actual-Server on receipt of COAP_EVENT_SERVER_SESSION_CONNECTED.

For Actual-Server, as the call-home client session has been converted to type Server, I think it right that when the the session (as a type Server) finishes the setup, it should report COAP_EVENT_SERVER_SESSION_CONNECTED. for consistency.

mrdeep1 avatar Dec 16 '25 14:12 mrdeep1