Several bugs in the lwIP TCP context
Environment
- Build System: [Make]
- Operating System: [FreeRTOS]
- Operating System Version: [ ]
- Hosted Environment: [LwIP]
libcoap Configuration Summary
#define WITH_LWIP 1 #define COAP_IPV4_SUPPORT 1 #define COAP_CONSTRAINED_STACK 1 #define COAP_DISABLE_TCP 0 #define COAP_ASYNC_SUPPORT 1 #define COAP_WITH_OBSERVE_PERSIST 0 #define COAP_WS_SUPPORT 0 #define COAP_Q_BLOCK_SUPPORT 0 #define COAP_THREAD_SAFE 1 #define COAP_THREAD_RECURSIVE_CHECK 0
Problem Description
Issue 1: Timeout Handling Problem in coap_io_process
The timeout logic in coap_io_process (located in coap_io_lwip.c) appears problematic. Here's the issue:
-
sys_timeout()is used to set a timeout callback and send a semaphore signal. -
sys_arch_sem_wait()blocks for the timeout period when no signals are available.
Observed Behavior:
Both timeout and semaphore signal completion can happen simultaneously, causing the following call to coap_io_process() to immediately return without performing the intended timeout handling.
Current Code Snippet
if (timeout) {
sys_timeout(timeout, coap_io_process_timeout, context);
context->timer_configured = 1;
}
#if NO_SYS == 0
} else {
coap_lock_callback_release(sys_arch_sem_wait(&coap_io_timeout_sem, timeout),
return 0);
#endif /* NO_SYS == 0 */
}
Example:
If coap_io_process() is invoked repeatedly (e.g., every 5s in a loop):
-
sys_arch_sem_wait()times out after 5 seconds and returns. -
coap_io_process_timeout()simultaneously signals the semaphore. - Re-calling
coap_io_process()immediately after uses the prior semaphore signal and returns instantly.
while (1) {
coap_io_process(ctx, 5000); // 5-second timeout
}
Suggested Fix:
Remove the use of sys_timeout() and rely solely on sys_arch_sem_wait.
Issue 2: Non-Immediate Return for TCP Events
When using TCP transport, coap_io_process() does not immediately process incoming events.
UDP transport (coap_dgrm_lwip.c) correctly utilizes calls like sys_sem_signal to notify event readiness:
#if NO_SYS == 0
sys_sem_signal(&coap_io_timeout_sem);
#endif /* NO_SYS == 0 */
However, TCP-related code in coap_strm_lwip.c lacks these sys_sem_signal calls. At least six places may need signaling:
-
do_tcp_err -
coap_tcp_recv -
do_tcp_connected -
do_tcp_accept(plus other relevant return points in the TCP flow)
Suggested Fix:
Add sys_sem_signal calls at these critical return points to ensure proper TCP event notification.
Issue 3: TCP Window Not Restored After Data Read
Failure to restore the TCP window after data read leads to issues such as:
- When the remote endpoint closes the connection,
tcp_close_shutdown()erroneously checks ifpcb->rcv_wnd != TCP_WND_MAX(pcb)and leaves thepcbunfreed. - This issue limits the number of active PCBs once the limit is reached, preventing new connections.
Areas impacted:
-
coap_tcp_recv -
coap_socket_read
Suggested Fix:
Call tcp_recved() to restore the TCP window.
Example Fix in coap_tcp_recv:
} else if (err != ERR_OK) {
/* cleanup, for unknown reason */
if (p != NULL) {
pbuf_free(p);
tcp_recved(sock->tcp_pcb, p->tot_len);
}
}
Example Fix in coap_socket_read:
ssize_t
coap_socket_read(coap_socket_t *sock, uint8_t *data, size_t data_len) {
if (sock->p) {
if (data_len < sock->p->len) {
uint8_t *ptr = (uint8_t *)sock->p->payload;
/* Handle partial read */
memcpy(data, sock->p->payload, data_len);
tcp_recved(sock->tcp_pcb, data_len);
sock->p->payload = &ptr[data_len];
sock->p->len -= data_len;
return data_len;
} else {
data_len = sock->p->len;
memcpy(data, sock->p->payload, sock->p->len);
pbuf_free(sock->p);
tcp_recved(sock->tcp_pcb, sock->p->len);
sock->p = NULL;
return data_len;
}
}
return 0;
}
Issue 4: Improper TCP Connection Closure in coap_lwip_tcp_close
Closing a TCP connection via coap_lwip_tcp_close (located in coap_strm_lwip.c) does not clear the TCP hooks (tcp_err, tcp_arg).
This leaves callbacks like do_tcp_err active post-session cleanup, potentially causing crashes.
Suggested Fix:
- Clear
tcp_errandtcp_arg. - Validate the session before processing in
do_tcp_err.
Example Fix for coap_lwip_tcp_close:
static void
coap_lwip_tcp_close(void *ctx) {
struct tcp_pcb *tcp_pcb = (struct tcp_pcb *)ctx;
if (tcp_pcb) {
tcp_recv(tcp_pcb, NULL);
tcp_err(tcp_pcb, NULL);
tcp_arg(tcp_pcb, NULL);
tcp_close(tcp_pcb);
}
}
Safety Check Example for do_tcp_err:
static void
do_tcp_err(void *arg, err_t err) {
coap_session_t *session = (coap_session_t *)arg;
if (session) {
(void)err;
coap_lock_lock(return);
coap_handle_event_lkd(session->context, COAP_EVENT_TCP_FAILED, session);
/*
* As per TCP documentation, the associated PCB is already freed
* when this callback is called. Prevent double-free during cleanup.
*/
session->sock.tcp_pcb = NULL;
coap_session_disconnected_lkd(session, COAP_NACK_NOT_DELIVERABLE);
#if NO_SYS == 0
sys_sem_signal(&coap_io_timeout_sem);
#endif
coap_lock_unlock();
}
}
Conclusion
In an LwIP environment, issues observed in libcoap with TCP transport include:
- Timeout handling logic conflicts.
- Lack of immediate event processing for TCP.
- TCP window restoration issues.
- Improper TCP connection cleanup.
Suggested fixes should address these issues, improving libcoap's robustness and compatibility in LwIP environments.
Thanks for raising and detailing these issues. I will take a look at them and get fixes out as appropriate.
Another question: Why use LwIP Raw API instead of Socket API? Using Socket API might improve compatibility and portability.
Why use LwIP Raw API instead of Socket API?
Good question for which I don't have a proper answer. libcoap had LwIP support before I started to work on it and this was based on the Raw API.. Certainly non-blocking i/o is required and the libcoap design is to be using callbacks whenever any TCP/IP data arrives, which may also have to pass through a DTLS layer (currently for TinyDTLS).
I feel that libcoap has many issues when running on lwIP, especially with TCP. Although I am working with common usage scenarios, it seems quite challenging to use it directly. However, if these four issues are resolved, basic communication should be able to function properly.
I have taken most of the suggestions, am running some soak tests and tracking down some anomalies.
Issue Summary: I discovered a new lwip issue where CoAP experiences probabilistic crashes during TCP connection establishment.
Log Fragment:
tcp_bind: bind to port 49916\n
tcp_connect to port 5683\n
tcp_enqueue_flags: queueing 10230066:10230067 (0x2)\n
tcp_output: snd_wnd 12288, cwnd 1, wnd 1, effwnd 0, seq 10230066, ack 10230066\n
tcp_slowtmr: processing active pcb\n
tcp_slowtmr: polling application\n
tcp_output: snd_wnd 12288, cwnd 1, wnd 1, effwnd 0, seq 10230066, ack 10230066\n
tcp_output: snd_wnd 12288, cwnd 1, wnd 1, effwnd 0, seq 10230066, ack 10230066, i 0\n
tcp_output_segment: 10230066:10230066\n
ethernet_output: sending packet 0x4229748\n
lwip_protocol_send L:426 run
send len = $ 58 $\r\n
tcp_output: snd_wnd 12288, cwnd 1, wnd 1, effwnd 0, seq 10230066, ack 10230066, i 0\n
tcp_output_segment: 10230066:10230066\n
ethernet_output: sending packet 0x4229748\n
lwip_protocol_send L:426 run
send len = $ 58 $\r\n
exception_mode:0xec00ec\r\n
Problem Analysis: Initial Call: tcp_connect invokes tcp_output Interrupt Occurrence: During tcp_output execution, the tcp_slowtmr timer interrupt triggers Reentrant Call: Within tcp_slowtmr's polling process, tcp_output is called again (via TCP_EVENT_POLL -> tcp_output(prev)) State Conflict: Both tcp_output calls operate on the same PCB, causing data structure state inconsistency Crash: When returning to the first tcp_output call, the PCB state has been corrupted, resulting in an exception
Thanks for this. Not currently sure whether this is a LwIP library specific issue, or because it is caused by a timer condition set up by libcoap.
Issues 2, 3 and 4 addressed in #1827. Please check if this works for you.
I do not think 1 is an issue at this point as just an additional call is made to coap_io_process() (which will do nothing).
With your new issue, what is your input function when you are doing a netif_add() ? I found my examples were incorrect and we had two threads tapif_thread and tcpip_thread doing the tcpip work.
For new issue, I need to first emphasize and add a detail: this issue is probabilistic in nature. I later conducted a 4-hour stress test and successfully reproduced the problem. The reproduction process is as follows:
The libcoap server works in call-home mode. After the session is established and communication is normal, every 30 seconds the server side actively calls coap_session_disconnected (connection disconnects and session is subsequently released). Then after another 10 seconds, coap_new_client_session is called again to establish a TCP connection. This process is repeated continuously during stress testing, and there's a probability that during the execution of tcp_output in tcp_connect, it gets preempted by tcp_slowtmr, ultimately leading to a crash.
Regarding the issue you're concerned about: "With your new issue, what is your input function when you are doing a netif_add()?"
During netif_init, I set tcpip_input for netif_add and also set the NETIF_FLAG_ETHARP flag.
Regarding your modification in https://github.com/obgm/libcoap/pull/1827, there's one part I don't quite understand: why do you need to increment the window by 1 (tcp_recved(sock->tcp_pcb, 1);) when p==NULL in coap_tcp_recv? I haven't seen similar usage in lwip's apps.
static err_t
coap_tcp_recv(void *arg, struct tcp_pcb *tpcb, struct pbuf *p, err_t err) {
coap_session_t *session = (coap_session_t *)arg;
coap_socket_t *sock = session ? &session->sock : NULL;
coap_tick_t now;
(void)tpcb;
if (!session)
return ERR_ARG;
if (p == NULL) {
/* remote host closed connection */
tcp_recved(sock->tcp_pcb, 1); /* why do this? */
tcp_close(sock->tcp_pcb);
tcp_arg(sock->tcp_pcb, NULL);
tcp_recv(sock->tcp_pcb, NULL);
Yes, you are right, tcp_recved(sock->tcp_pcb, 1); should not be there. I had added it in when troubleshooting when two threads were incorrectly being used to handle TCP/IP I/O and had not removed it. Updated code pushed to #1827.
For new issue, I need to first emphasize and add a detail
Thanks for providing further information. Not sure why LwIP is not handling tcp_slowtmr in active tcp_connect. I will look further into this,
If you make something similar to the following change to the LwIP code, then the logging will report the thread id so we can see if there is a thread clash usage when you get the failure.
diff --git a/src/include/lwip/arch.h b/src/include/lwip/arch.h
index c659a80c..de021c27 100644
--- a/src/include/lwip/arch.h
+++ b/src/include/lwip/arch.h
@@ -78,7 +78,7 @@
* systems, this should be defined to something less resource-consuming.
*/
#ifndef LWIP_PLATFORM_DIAG
-#define LWIP_PLATFORM_DIAG(x) do {printf x;} while(0)
+#define LWIP_PLATFORM_DIAG(y) do {printf("%lx: ", xTaskGetCurrentTaskHandle()) ; printf y;} while(0)
#include <stdio.h>
#include <stdlib.h>
#endif
I discovered another probabilistic issue:
Although coap_lwip_tcp_close already contains tcp_err(tcp_pcb, NULL); tcp_arg(tcp_pcb, NULL);
But when the server is in call-home mode, if it fails to connect from the beginning, 6 seconds after the session closes, lwip's slowtmr still triggers do_tcp_err (inferred from the logs), and subsequently causes an unlock assertion crash in coap_io_process.
Below is a log excerpt:
14:17:46.178684 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425f10c: created outgoing session\n
14:17:51.179809 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : timeout waiting for first response\n
14:17:51.179824 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: delayed\n
14:17:51.179833 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n
14:17:52.488122 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:52.488135 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: not transmitted after disconnect\n
14:17:52.488147 v:Reliable c:Ping {} [ ]\n
14:17:52.488154 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session disconnected (COAP_NACK_NOT_DELIVERABLE)\n
14:17:52.488163 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:52.488175 DEBG:***EVENT: COAP_EVENT_SESSION_FAILED\n
14:17:52.488190 DEBG:***EVENT: COAP_EVENT_SERVER_SESSION_DEL\n
14:17:52.488203 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425bfbc: closed\
14:17:58.441081 etharp_timer\n
14:17:58.476451 tcp_slowtmr: processing active pcb\n
14:17:58.476478 tcp_slowtmr: max SYN retries reached\n
14:17:58.476492 tcp_pcb_purge\n
14:17:58.476503 tcp_pcb_purge: data left on ->unacked\n
14:17:58.476519 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:58.476581 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: not transmitted after disconnect\n
14:17:58.476643 v:Reliable c:Ping {} [ ]\n
14:17:58.476680 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session disconnected (COAP_NACK_NOT_DELIVERABLE)\n
14:17:58.476737 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:58.476801 DEBG:***EVENT: COAP_EVENT_SESSION_FAILED\n
14:17:58.495531 exception_mode:0xec00ec\r\n
14:17:58.998484 assert failed: coap_thread_pid == global_lock.pid, file: libcoap/src/coap_threadsafe.c, line: 69
If you make something similar to the following change to the LwIP code, then the logging will report the thread id so we can see if there is a thread clash usage when you get the failure.
diff --git a/src/include/lwip/arch.h b/src/include/lwip/arch.h index c659a80c..de021c27 100644 --- a/src/include/lwip/arch.h +++ b/src/include/lwip/arch.h @@ -78,7 +78,7 @@ * systems, this should be defined to something less resource-consuming. */ #ifndef LWIP_PLATFORM_DIAG -#define LWIP_PLATFORM_DIAG(x) do {printf x;} while(0) +#define LWIP_PLATFORM_DIAG(y) do {printf("%lx: ", xTaskGetCurrentTaskHandle()) ; printf y;} while(0) #include <stdio.h> #include <stdlib.h> #endif
Thank you, I will try to reproduce it later.
All the changes so far have been consolidated into #1812. I will look into the above.
Re https://github.com/obgm/libcoap/issues/1826#issuecomment-3631578485, I think that this is missing
diff --git a/src/coap_strm_lwip.c b/src/coap_strm_lwip.c
index 7b9879be..d6b50203 100644
--- a/src/coap_strm_lwip.c
+++ b/src/coap_strm_lwip.c
@@ -420,6 +420,9 @@ coap_lwip_endpoint_tcp_close(void *ctx) {
struct tcp_pcb *tcp_pcb = (struct tcp_pcb *)ctx;
if (tcp_pcb) {
+ tcp_err(tcp_pcb, NULL);
+ tcp_arg(tcp_pcb, NULL);
+ tcp_recv(tcp_pcb, NULL);
tcp_close(tcp_pcb);
}
}
What puzzles me is this has a short time difference between becoming a server and then TCP fails.
14:17:51.179833 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n
14:17:52.488122 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
and appears to be the result of a call to do_tcp_err(). I am unable to reproduce this.
It also appears that you are not invoking coap_context_set_session_reconnect_time(). Correct?
Re #1826 (comment), I think that this is missing
diff --git a/src/coap_strm_lwip.c b/src/coap_strm_lwip.c index 7b9879be..d6b50203 100644 --- a/src/coap_strm_lwip.c +++ b/src/coap_strm_lwip.c @@ -420,6 +420,9 @@ coap_lwip_endpoint_tcp_close(void *ctx) { struct tcp_pcb *tcp_pcb = (struct tcp_pcb *)ctx; if (tcp_pcb) { + tcp_err(tcp_pcb, NULL); + tcp_arg(tcp_pcb, NULL); + tcp_recv(tcp_pcb, NULL); tcp_close(tcp_pcb); } }
Oh, regarding coap_lwip_endpoint_tcp_close, I think you are right. This part is indeed missing in https://github.com/obgm/libcoap/pull/1827. However, I noticed that coap_lwip_endpoint_tcp_close and coap_lwip_tcp_close should have the same behavior, but I'm not sure why they need to be separated into two functions. Additionally, I noticed that in coap_session_set_type_server_lkd, an endpoint is created and stored in sock (ep->sock.endpoint = ep;), but in coap_session_set_type_client_lkd, the endpoint in sock is not cleared (session->sock = session->endpoint->sock;). I'm not sure if this could cause some potential issues, and whether there are other unconsidered matters that need to be handled when switching between client and server.
What puzzles me is this has a short time difference between becoming a server and then TCP fails.
14:17:51.179833 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n 14:17:52.488122 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
I believe this is because the client is not listening at that moment, so the server immediately receives an RST, which is exactly the scenario I'm testing.
and appears to be the result of a call to
do_tcp_err(). I am unable to reproduce this.It also appears that you are not invoking
coap_context_set_session_reconnect_time(). Correct?
You are correct. I am currently not enabling reconnect time because I had already implemented my own reconnection logic at the upper layer before you solved the reconnection issue in https://github.com/obgm/libcoap/pull/1812. Another consideration is that, from what I recall, libcoap's native reconnection mechanism doesn't seem to notify the upper layer during the reconnection process - it only notifies the upper layer when the reconnection attempt has a result (am I correct about this?). If the upper layer cannot know that the lower layer is reconnecting, my upper-layer business logic might encounter certain exceptions (or it might not, but this risk exists). So I temporarily haven't enabled libcoap's underlying reconnection mechanism. If libcoap notifies the upper layer when reconnection is in progress, I will reconsider switching to libcoap's native reconnection mechanism in the future.
I noticed that coap_lwip_endpoint_tcp_close and coap_lwip_tcp_close should have the same behavior
Good spot - the difference is tcp_accept() which needs to be done on a listening endpoint.
but in coap_session_set_type_client_lkd, the endpoint in sock is not cleared
This needs to be left there for UDP sessions as the sock is common across all inbound traffic which have individual sessions. The session->sock here is empty so when coap_socket_dgrm_close() is invoked it does nothing. I agree there may be some unintended consequences though and I will have a look at this..
libcoap's native reconnection mechanism doesn't seem to notify the upper layer during the reconnection process
3 new events are in the latest code
/*
* Reconnect events
*/
/** Triggered when a session failed, and a reconnect is going to be attempted */
COAP_EVENT_RECONNECT_FAILED = 0x9001,
/** Triggered when a session failed, and a reconnect is successful */
COAP_EVENT_RECONNECT_SUCCESS = 0x9002,
/** Triggered when a session failed, and retry reconnect attempts failed */
COAP_EVENT_RECONNECT_NO_MORE = 0x9003,
@mrdeep1
Regarding the interruption of tcp_output in tcp_connect by tcp_slowtmr, I have some thoughts:
Issue: Race Conditions in TCP Operations and Missing LOCK_TCPIP_CORE() Protection in lwIP Integration
Environment Configuration
- libcoap uses lwIP raw API
-
coap_io_process()runs in a dedicated FreeRTOS task (similar tocoap_io_process_loopin POSIX) - Other libcoap interfaces are called from different threads
- libcoap initialization does not configure
coap_lwip_set_input_wait_handler - lwIP is also used by other business logic through lwIP socket API (I understand all events are routed to
tcpip_threadfor single-threaded processing via mbox)
Problem Description
I encountered an issue where tcp_connect was interrupted by tcp_slowtmr during tcp_output execution. This seems related to missing thread synchronization protection.
Analysis
-
Threading Architecture:
- Socket API operations are serialized through
tcpip_thread(single-threaded via mbox) -
tcpip_threadcallssys_check_timeouts(), triggeringtcp_slowtmr - libcoap directly uses raw API, and
coap_io_process_lkd()also callssys_check_timeouts()
- Socket API operations are serialized through
-
Inconsistent Lock Protection:
- UDP connections (
coap_socket_connect_udp) useLOCK_TCPIP_CORE()(Shouldsys_check_timeoutswait forUNLOCK_TCPIP_CORE()to execute aftercoap_socket_connect_udplocksLOCK_TCPIP_CORE?) - TCP connections (
coap_socket_connect_tcp) do NOT useLOCK_TCPIP_CORE()
- UDP connections (
-
Potential Race Conditions:
- libcoap interfaces and
tcpip_threadrun in different threads - Both sides might operate on PCBs simultaneously without proper synchronization
- Other scenarios might also lack
LOCK_TCPIP_CORE()/UNLOCK_TCPIP_CORE()protection
- libcoap interfaces and
Questions
- Is my analysis regarding missing
LOCK_TCPIP_CORE()in TCP operations correct? - Should other libcoap lwIP operations also be protected with
LOCK_TCPIP_CORE()? - Why do only write and close operations use
tcpip_try_callback()? (My understanding of lwIP is still quite limited) - Is it necessary for
coap_io_process_lkd()to callsys_check_timeouts()whentcpip_threadalready seems to handle it?
Expected Behavior
TCP operations should have proper synchronization protection to prevent race conditions between libcoap thread raw API, other business thread socket API, and lwIP's tcpip_thread.
3 new events are in the latest code
/* * Reconnect events */ /** Triggered when a session failed, and a reconnect is going to be attempted */ COAP_EVENT_RECONNECT_FAILED = 0x9001, /** Triggered when a session failed, and a reconnect is successful */ COAP_EVENT_RECONNECT_SUCCESS = 0x9002, /** Triggered when a session failed, and retry reconnect attempts failed */ COAP_EVENT_RECONNECT_NO_MORE = 0x9003,
I previously saw the Reconnect events you added in #1812, but as I described, they still seem to be only reconnection results. What I need is something like COAP_EVENT_RECONNECTING that indicates the underlying libcoap is in the process of reconnecting.
something like COAP_EVENT_RECONNECTING
Has been added to the (not yet pushed) code, the event fires when the reconnect is initiated.
TCP connections (coap_socket_connect_tcp) do NOT use LOCK_TCPIP_CORE()
I added LOCK_TCPIP_CORE() in coap_socket_connect_tcp1 and ran stress tests overnight (about 14 hours), and haven't encountered this issue again so far.
Regarding the interruption of tcp_output in tcp_connect by tcp_slowtmr, I have some thoughts:
Issue: Race Conditions in TCP Operations and Missing LOCK_TCPIP_CORE() Protection in lwIP Integration
Environment Configuration
- libcoap uses lwIP raw API
coap_io_process()runs in a dedicated FreeRTOS task (similar tocoap_io_process_loopin POSIX)- Other libcoap interfaces are called from different threads
- libcoap initialization does not configure
coap_lwip_set_input_wait_handler- lwIP is also used by other business logic through lwIP socket API (I understand all events are routed to
tcpip_threadfor single-threaded processing via mbox)Problem Description
I encountered an issue where
tcp_connectwas interrupted bytcp_slowtmrduringtcp_outputexecution. This seems related to missing thread synchronization protection.Analysis
Threading Architecture:
- Socket API operations are serialized through
tcpip_thread(single-threaded via mbox)tcpip_threadcallssys_check_timeouts(), triggeringtcp_slowtmr- libcoap directly uses raw API, and
coap_io_process_lkd()also callssys_check_timeouts()Inconsistent Lock Protection:
- UDP connections (
coap_socket_connect_udp) useLOCK_TCPIP_CORE()(Shouldsys_check_timeoutswait forUNLOCK_TCPIP_CORE()to execute aftercoap_socket_connect_udplocksLOCK_TCPIP_CORE?)- TCP connections (
coap_socket_connect_tcp) do NOT useLOCK_TCPIP_CORE()Potential Race Conditions:
- libcoap interfaces and
tcpip_threadrun in different threads- Both sides might operate on PCBs simultaneously without proper synchronization
- Other scenarios might also lack
LOCK_TCPIP_CORE()/UNLOCK_TCPIP_CORE()protectionQuestions
- Is my analysis regarding missing
LOCK_TCPIP_CORE()in TCP operations correct?- Should other libcoap lwIP operations also be protected with
LOCK_TCPIP_CORE()?- Why do only write and close operations use
tcpip_try_callback()? (My understanding of lwIP is still quite limited)- Is it necessary for
coap_io_process_lkd()to callsys_check_timeouts()whentcpip_threadalready seems to handle it?Expected Behavior
TCP operations should have proper synchronization protection to prevent race conditions between libcoap thread raw API, other business thread socket API, and lwIP's
tcpip_thread.
What puzzles me is this has a short time difference between becoming a server and then TCP fails. I believe this is because the client is not listening at that moment, so the server immediately receives an RST, which is exactly the scenario I'm testing.
I previously thought it was caused by RST, but this assumption was incorrect. I re-extracted and analyzed more CoAP and lwIP related logs, and I believe I have found the root cause:
Log Summary:
14:17:39.972765 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425bfbc: created outgoing session\n
14:17:44.976421 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : timeout waiting for first response\n
14:17:44.976433 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: delayed\n
14:17:44.976443 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n
14:17:46.178684 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425f10c: created outgoing session\n
14:17:51.179809 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : timeout waiting for first response\n
14:17:51.179824 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: delayed\n
14:17:51.179833 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session now is type Server\n
14:17:52.421792 etharp_timer\n
14:17:52.488110 tcp_slowtmr: processing active pcb\n
14:17:52.488113 tcp_slowtmr: max SYN retries reached\n
14:17:52.488117 tcp_pcb_purge\n
14:17:52.488119 tcp_pcb_purge: data left on ->unacked\n
14:17:52.488122 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:52.488135 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: not transmitted after disconnect\n
14:17:52.488147 v:Reliable c:Ping {} [ ]\n
14:17:52.488154 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session disconnected (COAP_NACK_NOT_DELIVERABLE)\n
14:17:52.488163 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:52.488175 DEBG:***EVENT: COAP_EVENT_SESSION_FAILED\n
14:17:52.488190 DEBG:***EVENT: COAP_EVENT_SERVER_SESSION_DEL\n
14:17:52.488203 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session 0x425bfbc: closed\
14:17:58.441081 etharp_timer\n
14:17:58.476451 tcp_slowtmr: processing active pcb\n
14:17:58.476478 tcp_slowtmr: max SYN retries reached\n
14:17:58.476492 tcp_pcb_purge\n
14:17:58.476503 tcp_pcb_purge: data left on ->unacked\n
14:17:58.476519 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:58.476581 DEBG:** [192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : mid=0x0000: not transmitted after disconnect\n
14:17:58.476643 v:Reliable c:Ping {} [ ]\n
14:17:58.476680 DEBG:***[192.168.144.51]:0 <-> [192.168.144.1]:5683 TCP : session disconnected (COAP_NACK_NOT_DELIVERABLE)\n
14:17:58.476737 DEBG:***EVENT: COAP_EVENT_TCP_FAILED\n
14:17:58.476801 DEBG:***EVENT: COAP_EVENT_SESSION_FAILED\n
14:17:58.495531 exception_mode:0xec00ec\r\n
14:17:58.998484 assert failed: coap_thread_pid == global_lock.pid, file: libcoap/src/coap_threadsafe.c, line: 69
Code snippet:
int server_call_home_start(coap_context_t *ctx, coap_address_t *src, coap_address_t *dst, coap_proto_t proto)
{
session = coap_new_client_session(g_ctx, src, dst, proto);
if (!session)
{
return -1;
}
coap_session_set_type_server(session);
return 0;
}
The problem flow is as follows:
When the connection receives no response, coap_new_client_session returns immediately, but coap_session_set_type_server needs to wait until "timeout waiting for first response" before it can return (I believe this is because coap_session_set_type_server is waiting for a lock, which causes my server_call_home_start to block - is this by design in libcoap?).
After coap_session_set_type_server returns, I initiated a new connection without waiting for the event callback to notify COAP_EVENT_TCP_FAILED connection status.
After the second coap_new_client_session and coap_session_set_type_server return, the first connection's COAP_EVENT_TCP_FAILED event callback occurs (I believe this is triggered by do_tcp_err from tcp_slowtmr in the tcpip_thread).
Finally, the second COAP_EVENT_TCP_FAILED event callback occurs (I believe this is triggered by do_tcp_err from tcp_slowtmr in coap_io_process_lkd).
Further Analysis:
While I could modify the code to wait for the COAP_EVENT_TCP_FAILED event before initiating a new connection, I don't think this is a fundamental solution. Therefore, I further analyzed why the do_tcp_err triggered by tcp_slowtmr in coap_io_process_lkd causes an assert failure when coap_io_process unlocks (assert failed: coap_thread_pid == global_lock.pid).
The function call sequence when the problem occurs is as follows:
- Enter
coap_io_process -
coap_lock_lock(return 0); - Enter
coap_io_process_lkd - Enter
sys_check_timeouts - Enter
handler(arg);(tcp_slowtmr) - Enter
TCP_EVENT_ERR(do_tcp_err) -
coap_lock_lock(return);
Here there are two scenarios:
7.1. If sys_mutex_new implementation defaults to RecursiveMutex, then sys_mutex_lock continues execution (my FreeRTOS is currently configured this way)
7.2. If sys_mutex_new implementation defaults to non-RecursiveMutex, then sys_mutex_lock will deadlock!!! (because step 2 already acquired the lock)
- Enter
coap_lock_unlock();(under condition 7.1) -
global_lock.pid = 0; - Return all the way back to
coap_io_process - Enter
coap_lock_unlock(); -
assert(coap_thread_pid == global_lock.pid);
At this point, global_lock.pid is 0, which is not equal to coap_thread_pid, so assert failed!!!
Solution:
I think we can remove the use of sys_check_timeouts in coap_io_process_lkd, because tcpip_thread already calls sys_check_timeouts. Correct?
UNLOCK_TCPIP_CORE();
if (context->input_wait) {
coap_lock_callback_release(context->input_wait(context->input_arg, timeout),
return 0);
#if NO_SYS == 0
} else {
coap_lock_callback_release(sys_arch_sem_wait(&coap_io_timeout_sem, timeout),
return 0);
#endif /* NO_SYS == 0 */
}
coap_lock_invert(LOCK_TCPIP_CORE(),
UNLOCK_TCPIP_CORE(); return 0);
#if 0
sys_check_timeouts(); /* remove this? */
#endif
UNLOCK_TCPIP_CORE();
coap_ticks(&now);
return (int)(((now - before) * 1000) / COAP_TICKS_PER_SECOND);
}
For https://github.com/obgm/libcoap/issues/1826#issuecomment-3641327458. Thanks for spending the time troubleshooting this and providing the detail.
I think we can remove the use of sys_check_timeouts in coap_io_process_lkd
Yes, they can be removed for NO_SYS == 0. Fixed in #1812.
coap_session_set_type_server needs to wait until "timeout waiting for first response"
coap_session_set_type_server () initiates a Ping to start communications to the remote peer (more relevant to UDP based protocols or TLS) and is waiting (for up to 5 seconds) for the CoAP TCP CMS packet exchange.
For https://github.com/obgm/libcoap/issues/1826#issuecomment-3639632738. Thanks for spending the time troubleshooting this and providing the detail.
I added LOCK_TCPIP_CORE() in coap_socket_connect_tcp1 and ran stress tests overnight (about 14 hours), and haven't encountered this issue again so far.
That is good news. Fix in #1812.
For https://github.com/obgm/libcoap/issues/1826#issuecomment-3636592413. Thanks for spending the time troubleshooting this and providing the detail with good questions.
I encountered an issue where tcp_connect was interrupted by tcp_slowtmr during tcp_output execution.
This should now be fixed in #1812 by removing sys_check_timeouts() in coap_io_process_lkd()
Re Questions
Is my analysis regarding missing LOCK_TCPIP_CORE() in TCP operations correct?
Yes. I have made some changes in #1812.
Should other libcoap lwIP operations also be protected with LOCK_TCPIP_CORE()?
Yes. I have made some changes in #1812.
Why do only write and close operations use tcpip_try_callback()? (My understanding of lwIP is still quite limited)
It is safer to pass the write / close requests over to the tcpip_thread as these are more likely to frequently occur. Having to do the lock inversion coap_lock_invert(LOCK_TCPIP_CORE(), goto err_unlock); (so we don't get global_lock->tcpip_lock and tcpip_lock->global_lock in 2 threads) means that global_lock has to get temporarily unlocked which may change the state of things by another thread. Doing this in coap_socket_connect_tcp1() should be fine.
Is it necessary for coap_io_process_lkd() to call sys_check_timeouts() when tcpip_thread already seems to handle it?
It only needs to be called if NO_SYS == 1. Fixed in #1812.
coap_session_set_type_server () initiates a Ping to start communications to the remote peer (more relevant to UDP based protocols or TLS) and is waiting (for up to 5 seconds) for the CoAP TCP CMS packet exchange.
Oh, I got it. The blocking is caused by coap_client_delay_first being called in coap_session_set_type_server.
int
coap_client_delay_first(coap_session_t *session) {
#if COAP_CLIENT_SUPPORT
if (session->type == COAP_SESSION_TYPE_CLIENT && session->doing_first) {
int timeout_ms = 5000;
coap_session_state_t current_state = session->state;
if (session->delay_recursive) {
return 0;
} else {
session->delay_recursive = 1;
}
/*
* Need to wait for first request to get out and response back before
* continuing.. Response handler has to clear doing_first if not an error.
*/
coap_session_reference_lkd(session);
while (session->doing_first != 0) {
int result = coap_io_process_lkd(session->context, 1000);
if (result < 0) {
session->doing_first = 0;
session->delay_recursive = 0;
coap_session_release_lkd(session);
return 0;
}
And thank you very much for your fix in https://github.com/obgm/libcoap/pull/1812. I will pull and use it again after this PR is merged into the develop branch.
@mrdeep1 Note that https://github.com/obgm/libcoap/pull/1812 has been merged. I'm currently testing and found a compilation failure issue caused by a typo: coap_mutex_internal.h
#define caop_thread_pid_t TaskHandle_t
Should be:
#define coap_thread_pid_t TaskHandle_t
Apologies for that. #1836 will fix this.
@mrdeep1
I used the latest develop branch code and noticed a change in the server-side behavior under call-home mode: in the current version, coap_session_set_type_server no longer remains in a waiting state, and the server also triggers the COAP_EVENT_SERVER_SESSION_CONNECTED event, behaving similarly to the client. In previous versions, the server would only trigger the COAP_EVENT_SESSION_CONNECTED event, and only the client would additionally trigger a COAP_EVENT_SERVER_SESSION_CONNECTED event after COAP_EVENT_SESSION_CONNECTED.
Currently, I execute coap_session_set_type_client in the callback for COAP_EVENT_SERVER_SESSION_CONNECTED, which is causing some issues. I'd like to confirm whether the current server-side behavior in call-home mode is the intended logic as per your design. If this is the intended behavior, I will adjust the upper-layer code to adapt to libcoap's new logic. If not, this might be a bug.
Dec 16 21:23:45.480 DEBG ***0.0.0.0:0 <-> 10.56.65.14:5683 TCP : session 0x558613df40: created outgoing session
Dec 16 21:23:45.485 DEBG ***0.0.0.0:0 <-> 10.56.65.14:5683 TCP : session now is type Server
Dec 16 21:23:45.518 DEBG ***EVENT: COAP_EVENT_TCP_CONNECTED
Dec 16 21:23:45.523 DEBG ***10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : sending CSM
Dec 16 21:23:45.537 DEBG * 10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : netif: sent 7 bytes
v:Reliable c:CSM {} [ Max-Message-Size:8388864, Block-Wise-Transfer: ]
Dec 16 21:23:45.542 DEBG * 10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : netif: recv 7 bytes
v:Reliable c:CSM {} [ Max-Message-Size:8388864, Block-Wise-Transfer: ]
Dec 16 21:23:45.542 DEBG ***10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : session connected
Dec 16 21:23:45.542 DEBG ***EVENT: COAP_EVENT_SESSION_CONNECTED
Dec 16 21:23:45.544 DEBG ***EVENT: COAP_EVENT_SERVER_SESSION_CONNECTED
Dec 16 21:23:45.555 DEBG ***10.56.65.14:36823 <-> 10.56.65.14:5683 TCP : session now is type Client
On Actual-Server, you create a client session and immediately set the session type using coap_session_set_type_server() so now on, the session is of type server and will be receiving requests. Once the server session completes, it will raise COAP_EVENT_SERVER_SESSION_CONNECTED. However, this event is only to be processed and not used to call coap_session_set_type_client() - that should only be done on Actual-Client once the (starting as) Server session has completed and also raises COAP_EVENT_SERVER_SESSION_CONNECTED to signal this.
So, only invoke coap_session_set_type_client() on the Actual-Client, not Actual-Server on receipt of COAP_EVENT_SERVER_SESSION_CONNECTED.
For Actual-Server, as the call-home client session has been converted to type Server, I think it right that when the the session (as a type Server) finishes the setup, it should report COAP_EVENT_SERVER_SESSION_CONNECTED. for consistency.