nri server start failed for reason 'oversized message'

nri-daemon log

time="2024-08-10T06:23:28Z" level=info msg="Configuring plugin 05-Controller for runtime containerd/v1.7.16"
time="2024-08-10T06:23:28Z" level=info msg="Started plugin 05-Controller..."
time="2024-08-10T06:23:29Z" level=error msg="error receiving message" error="failed to discard after receiving oversized message: cannot allocate memory"

containerd log

time="2024-08-10T14:25:55.098248241+08:00" level=info msg="synchronizing plugin 05-Controller"
time="2024-08-10T14:25:57.098681078+08:00" level=info msg="failed to synchronize plugin: context deadline exceeded"
time="2024-08-10T14:25:57.098762853+08:00" level=info msg="plugin \"05-Controller\" connected"

Sure the issue is likely within the ttrpc repository, there is a check on the window during the reception process within ttrpc. https://github.com/containerd/ttrpc/blob/655622931dab8c39a563e8c82ae90cdc748f72a1/channel.go#L126

However, there are questions:

Is ttrpc the appropriate RPC framework for this context?
Should exit when encountering such an error?

/cc @klihub

Aug 13 '24 06:08 yylt

@yylt Is that a reproducible problem in your environment ? Can you give a bit more details ? I'd be interested at least in the number of pods and containers you have running in your system.

Aug 13 '24 07:08 klihub

There are 150 pods, primarily distributed in the same namespace, such as "default", might be related to the number of env.

The issue can be consistently reproduced in my environment, maybe a certain number of pods would be required when reproduced.

Aug 13 '24 08:08 yylt

And how many containers do you have altogether in those pods ?

Aug 13 '24 08:08 klihub

And how many containers do you have altogether in those pods ?

The number of containers per pod should not seem to affect this, as each sync operation is independent of the others.

Aug 13 '24 08:08 yylt

Not per pod. The number of total containers in all pods. I assume we hit the ttrpc messageLengthMax limit with the sync request, so what matters is both the total number of pods and the total number of containers. That's why I'd like to know it. IOW what does crictl ps | grep -v CONTAINER | wc -l tell on the failing host ?

Aug 13 '24 10:08 klihub

Also it would be interesting to see the result of these:

crictl pods -o json | wc -c
crictl ps -o json | wc -c

Aug 13 '24 13:08 klihub

# ctr -n k8s.io c ls  |wc -l
643

# crictl ps | grep -v CONTAINER | wc -l
156

# crictl pods -o json | wc -c
256810

# crictl ps -o json | wc -c
200467

Aug 13 '24 13:08 yylt

@yylt I have a branch with a fix for kicking out plugins if synchronization fail, which alone would provide more graceful behavior, by kicking the plugin out if synchronization fails.

I also have an initial fix attempt for the size overflow, and a v.17.16 redirected to compile with those fixes. With that in place, my local test which used to trigger the error is now gone and the plugin registers successfully. Would you be able to give it a try, compile it, drop it into your test cluster to see if gets rid of the problems on your side, too ? I could then try to polish/finalize it a bit more then file PRs with the fixes.

Aug 14 '24 11:08 klihub

@yylt I have a branch with a fix for kicking out plugins if synchronization fail, which alone would provide more graceful behavior, by kicking the plugin out if synchronization fails.

I also have an initial fix attempt for the size overflow, and a v.17.16 redirected to compile with those fixes. With that in place, my local test which used to trigger the error is now gone and the plugin registers successfully. Would you be able to give it a try, compile it, drop it into your test cluster to see if gets rid of the problems on your side, too ? I could then try to polish/finalize it a bit more then file PRs with the fixes.

ok, Is https://github.com/klihub/nri/tree/fixes/yylt-sync-failure, right?

Aug 14 '24 12:08 yylt

@yylt Yes, but I have a directly patched 1.17.16 containerd tree pointing at that nri version and re-vendored here, so it's easier to just compile and use that...

https://github.com/klihub/containerd/tree/fixes/yylt-sync-failure

Aug 14 '24 12:08 klihub

Oh, and you will need to recompile your plugin against that NRI tree as well. Otherwise the runtime-side will detect that the plugin does not have the necessary support compiled in and will kick it out during synchronization.

Aug 14 '24 12:08 klihub

After replacing both nri-daemon and containerd, the sync error no longer occurs upon restart.

If nri-daemon is replaced individually, the issue still persists.

Aug 14 '24 12:08 yylt

After replacing both nri-daemon and containerd, the sync error no longer occurs upon restart.

If nri-daemon is replaced individually, the issue still persists.

Yes, that is the expected behavior. And if you only update containerd, but run with an old plugin ('nri-daemon' I believe in your case), then the plugin should get disconnected during synchronization...

Aug 14 '24 14:08 klihub

I wonder if we can provide a function to trim the data before containerd calls the nri plugin, filtering out some fields that users don't need.

Dec 27 '24 07:12 lengrongfu

A fix for the oversized initial synchronization message has been merged (229236e40a59a1aa91759b6e2449663f0aac7ad9).

Jan 10 '25 12:01 klihub

Fixed by #111.

Jan 27 '25 17:01 klihub