server start failed for reason 'oversized message'
nri-daemon log
time="2024-08-10T06:23:28Z" level=info msg="Configuring plugin 05-Controller for runtime containerd/v1.7.16"
time="2024-08-10T06:23:28Z" level=info msg="Started plugin 05-Controller..."
time="2024-08-10T06:23:29Z" level=error msg="error receiving message" error="failed to discard after receiving oversized message: cannot allocate memory"
containerd log
time="2024-08-10T14:25:55.098248241+08:00" level=info msg="synchronizing plugin 05-Controller"
time="2024-08-10T14:25:57.098681078+08:00" level=info msg="failed to synchronize plugin: context deadline exceeded"
time="2024-08-10T14:25:57.098762853+08:00" level=info msg="plugin \"05-Controller\" connected"
Sure the issue is likely within the ttrpc repository, there is a check on the window during the reception process within ttrpc. https://github.com/containerd/ttrpc/blob/655622931dab8c39a563e8c82ae90cdc748f72a1/channel.go#L126
However, there are questions:
- Is
ttrpcthe appropriate RPC framework for this context? - Should exit when encountering such an error?
/cc @klihub
@yylt Is that a reproducible problem in your environment ? Can you give a bit more details ? I'd be interested at least in the number of pods and containers you have running in your system.
There are 150 pods, primarily distributed in the same namespace, such as "default", might be related to the number of env.
The issue can be consistently reproduced in my environment, maybe a certain number of pods would be required when reproduced.
And how many containers do you have altogether in those pods ?
And how many containers do you have altogether in those pods ?
The number of containers per pod should not seem to affect this, as each sync operation is independent of the others.
Not per pod. The number of total containers in all pods. I assume we hit the ttrpc messageLengthMax limit with the sync request, so what matters is both the total number of pods and the total number of containers. That's why I'd like to know it. IOW what does crictl ps | grep -v CONTAINER | wc -l tell on the failing host ?
Also it would be interesting to see the result of these:
- crictl pods -o json | wc -c
- crictl ps -o json | wc -c
# ctr -n k8s.io c ls |wc -l
643
# crictl ps | grep -v CONTAINER | wc -l
156
# crictl pods -o json | wc -c
256810
# crictl ps -o json | wc -c
200467
@yylt I have a branch with a fix for kicking out plugins if synchronization fail, which alone would provide more graceful behavior, by kicking the plugin out if synchronization fails.
I also have an initial fix attempt for the size overflow, and a v.17.16 redirected to compile with those fixes. With that in place, my local test which used to trigger the error is now gone and the plugin registers successfully. Would you be able to give it a try, compile it, drop it into your test cluster to see if gets rid of the problems on your side, too ? I could then try to polish/finalize it a bit more then file PRs with the fixes.
@yylt I have a branch with a fix for kicking out plugins if synchronization fail, which alone would provide more graceful behavior, by kicking the plugin out if synchronization fails.
I also have an initial fix attempt for the size overflow, and a v.17.16 redirected to compile with those fixes. With that in place, my local test which used to trigger the error is now gone and the plugin registers successfully. Would you be able to give it a try, compile it, drop it into your test cluster to see if gets rid of the problems on your side, too ? I could then try to polish/finalize it a bit more then file PRs with the fixes.
ok, Is https://github.com/klihub/nri/tree/fixes/yylt-sync-failure, right?
@yylt Yes, but I have a directly patched 1.17.16 containerd tree pointing at that nri version and re-vendored here, so it's easier to just compile and use that...
https://github.com/klihub/containerd/tree/fixes/yylt-sync-failure
Oh, and you will need to recompile your plugin against that NRI tree as well. Otherwise the runtime-side will detect that the plugin does not have the necessary support compiled in and will kick it out during synchronization.
After replacing both nri-daemon and containerd, the sync error no longer occurs upon restart.
If nri-daemon is replaced individually, the issue still persists.
After replacing both
nri-daemonandcontainerd, thesync errorno longer occurs upon restart.If
nri-daemonis replaced individually, the issue still persists.
Yes, that is the expected behavior. And if you only update containerd, but run with an old plugin ('nri-daemon' I believe in your case), then the plugin should get disconnected during synchronization...
I wonder if we can provide a function to trim the data before containerd calls the nri plugin, filtering out some fields that users don't need.
A fix for the oversized initial synchronization message has been merged (229236e40a59a1aa91759b6e2449663f0aac7ad9).
Fixed by #111.