Receiving ERR_QUEUE_FULL error while replaying a log file over a Vector interface via python-can but runs fine while using Vector CANoe replay block
I am trying to replay a BLF file using python-can over a vector interface with an implementation of MessageSync iterator object and can.send operating on the yielded messages. While it functions for 20-30 seconds as expected but after that it keeps raising ERR_QUEUE_FULL exception while sending CAN messages. Have tried to handle that using can_bus.flush_tx_buffer() and can_bus.reset() but to no effect. I understand that the transmit buffer gets full while the messages are written too fast at a given segment causing buffer overflow.
The same replay file while run over a Vector CANoe replay blocks runs just fine without any buffer issues in given replay duration.
Can anyone shed light on whether python-can and Vector CANoe (both running on Win10 PC) has different way of configuring transmit queue buffer? Any suggestions on how I can increase the transmit queue buffer used by python-can is highly appreciated along with handling such buffer overflows(since flush_tx_buffer isn't having any impact). Note: In Vector Hardware Configuration, transmit queue size is configured as 256 messages. I am not sure if python-can uses the same configuration before I want to change it.
Additional context
OS and version: Win 10 Python version: Python 3 python-can version: 3.3.4 python-can interface/s (if applicable): Vector VN1630
Let me know for any further information on top of this. Sorry, I will not be able to share the trace file as it is proprietary, but any details regarding it's nature I can get back on it.
You probably have errors on the bus which prevent the messages from being sent. Is there another device on the CAN which can acknowledge your messages?
Is there another device on the CAN which can acknowledge your messages?
Yes. There is another real ECU for acknowledgement of Tx messages. This runs fine if I keep a decent wait time(10 ms - minimum that time.sleep can provide) between consecutive messages. Drawback is that with the wait time injection, it takes 6x-7x times the actual replay time. Regarding the errors on bus, I am also monitoring the CAN trace from the Vector CANoe at the same time. It doesn't report any ack errors or chip state going passive (due to multiple ack errors). As for error messages that might come from the replay file, I have kept a check if message is an error frame to filter out the error frames from it before sending on the bus.
@akasonu: Are you filtering RX messages for the replay ?
@jazi007 Assuming that you are asking that if the Rx messages in replay trace - Yes. The Rx messages are filtered out as per message IDs and only the Tx messages are sent to avoid redundancy of CAN messages in network. But one point that might be relevant here is the Tx message traffic is high with consecutive messages in trace sent between 1-5 ms in some cases.
Are you using can.player or have you created your own player using MessageSync class?
The Vector driver (used by python-can under-the-hood) does use the transmit queue you mention. However, if it is the same PC where CANoe plays it back correctly then, python-can can possibly play it back too.
Do you see any buffer full errors in CANoe write window? If so, then even CANoe can't play it back without running into errors. Difference is that CANoe will 'catch' that error (not sure if it drops frames or resends). If CANoe shows buffer full errors, then in python you can try catching the VectorError raised on the send and then decide whether to dtop, resend immediately, resend after a delay (depends on your requirements).
Also, if CANoe is showing buffer full you can increase the transmit queue until it plays back in CANoe without errors. Then come back to your python implementation.
If CANoe doesn't show any buffer full errors on playback, then I'd suggest using can.player as it should playback based on the timestamps in the logfile. I suspect, if you're using your implementation, then your playback is trying to send the messages as fast as the code will run without looking at the timestamps?
@dpatel20 I am using MessageSync class instance for the replay with timestamps=True and explored with different gap periods. CANoe on the same PC doesn't show buffer overflow errors at it's end.
While running via python-can, it starts popping up the ERR_QUEUE_FULL errors after around 20-30 seconds of playback. There is an exception handle in the python code, where I tried following calls to resolve the issue: flush_tx_buffer(), bus.reset, sleep for some time. But none of them works to resolve it during runtime unless I use CANoe application to reset the bus.
Did a little digging on XLDriver library and the DLL used by vxlapi.py, flush_tx_buffer() on Vector (under the hood) always provides Pass for new XL family devices. There is no usage of tx_queue_timeout in bus.send() in Vector (available in Kvaser), which would have restricted the execution to wait till the message is sent and ack'ed.
Buffer usage is little unclear to me, since I have tried setting to max values as well but python-can breaks down whereas CANoe works perfectly for lower transmit buffer values as well. If there was any way to check the tx queue state, that can be used as a basis to evaluate if the messages are sent too fast.
Currently, as an workaround, I am using a bigger gap value which spaces out the messages more than one observed in replay file. Just that it increases the execution time by 6-7x of actual replay time.
If CANoe is able to handle it without showing any buffer full errors, then it suggests an issue with your code.
Is there a reason you are not using the CAN player? If you can, I would start with that and see if it succeeds.
As far as I know, CANoe and python-can use the same underlying Vector driver. Both use the driver to place data into the Vector transmit buffer queue (size as per Vector Hardware Config). If CANoe, is not showing errors, then it means that the rate of sending in the replay BLF file is not an issue. So, my guess would be that, somehow, your code is sending the data too fast and the buffer is getting full. Hence, I would try with python-can player.
If that's not possible, then use CANoe in purely 'trace/log mode' and take a log while playing back the replay using Python. And then see if the time deltas between messages match that from your original replay BLF logfile. I suspect you'll find that your code is sending the data faster than was recorded in the BLF file.