Adding OpenTelemetry spans on the Go client's methods, CRI, GC and HTTP calls
What is the problem you're trying to solve
containerd 1.6 will have OpenTelemetry spans, but they are added automatically around gRPC calls. There are a few areas we should have spans to make troubleshooting easier.
- The Go client's methods
- CRI
- GC operations
- HTTP calls around container image operations
Describe the solution you'd like
We need to first define our naming conversion based on https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/trace/semantic_conventions. After that we can update each component accordingly.
Additional context
No response
Wahoo! My 2c is that spans should be on every entry/exit for tracing/debugging. So for example, CRI is all gRPC entry methods, it will automatically start a span. It does not need create internal sub-spans, (unless it wants to for some reason) but when it calls out to containerd via client (a new public API), or the task, or a windows/linux syscall, etc it may want to wrap that call in a span to show it's exit from the current code and entry into the next. This allows for entry and exit measurements, profiling, and tracing. Extremely helpful for both optimizing code, but also finding those hard to reach race conditions
Maybe another example since CRI is a "special" plugin. I think this idea holds for all of our services. Ex: We have a service api/local pattern. The span should be created at the service scope, once resolved to the local implementation we dont need a sub-span (most likely) but it is helpful for that service itself to have crated the child. This easily allows us to know when we have entered the "snapshotter service" etc. to do an operation and that it took 15 seconds.
add 1.7 milestone here cc @estesp
@kzys Hi, can I work on this issue.
@lengrongfu I did some work on this already and can have a PR ready by next week. I think we can reduce some overlap here since I am mainly focusing on CRI and related go client methods. Do you have any particular area of interest or if you can wait for my PR, that will be even better.
@lengrongfu I did some work on this already and can have a PR ready by next week. I think we can reduce some overlap here since I am mainly focusing on CRI and related go client methods. Do you have any particular area of interest or if you can wait for my PR, that will be even better.
I can pay attention to the point of "HTTP calls around container image operations", mainly where it interacts with the registry. There are also interactions with the registry in the CRI. Have you already dealt with it?
I can pay attention to the point of "HTTP calls around container image operations", mainly where it interacts with the
registry. There are also interactions with theregistryin the CRI. Have you already dealt with it?
Sure, I have not looked into that yet. Also you can take a look at GC operations as well.
@swagatbora90 Hi, i will be working around The Go client's methods, will this work be repeated?
@swagatbora90 Hi, i will be working around
The Go client's methods, will this work be repeated?
HI @lengrongfu yes I am making changes to CRI runtime apis and and also adding spans in the related client methods which mainly in client.go, task.go, process.goand container.go.
I am listing out some of the current and planned items for adding otel spans to Containerd. This should help streamline some of the work, and others can contribute as well.
Current
- [ ] GC operation - #7448
- [x] CRI Image service - #7453
- [ ] CRI Runtime service - #7616
- [ ] Containerd Go client: #7616
- [ ] client.go (7616)
- [ ] container.go (7616)
- [ ] process.go (7616)
- [ ] task.go (7616)
- [x] pull.go (7453)
- [x] HTTP calls around container image operations - (7453)
- [ ] CRI stream server - (#7692)
- [x] Add Unit/Integration test #7493
This should at least cover the areas listed in this tracking issue. Though there might be scope for additional areas, like tracing calls across containerd and runc boundary.
@swagatbora90 I can look into the Add Unit/Integration test TODO task if you don't mind.