ApplicationInsights-node.js icon indicating copy to clipboard operation
ApplicationInsights-node.js copied to clipboard

Correlation context is changed for long running operations

Open johnib opened this issue 3 years ago • 6 comments

Hi, this is a tough one, appreciate your help;

Setup

  • node: v16
  • application insights: v2.1.4
    • auto collection enabled (dependencies, requests, exceptions)
    • traces are sent via winston logger connected to application insights sink
  • winston: v3.3.3
  • winston-azure-application-insights: v3.0.0
  • winston-transport: v4.4.0

General info about the application

  • This is an express js app that serves as a worker
  • A queue message is translated by another component to an HTTP call on my express app, which executes the relevant middleware to handle the message.
  • The auto collection of requests is based on the processing of the express request -- which works great for me (except for the issue described below)

The issue

AI telemetry of long running operations (long running in my case is hours) are losing correlation and spread over multiple operation IDs (other than the original one). Thus mixing up traces of multiple, unrelated operations together.

When I mention "Operation" I mean a distinct express HTTP call

When that happens, when the operation succeeds, there's no request telemetry record being sent for this operation (something that works perfectly for short running operations).

Repro

Reproducing this issue is hard, because when it happens, it happens after running for hours. I tried reproducing this for operation that ran for about ~1h with no success. In addition, I believe there should be several more operations running concurrently for this bug to realize.

Notes

I tried looking into your code, but I wasn't able to understand the magic behind keeping the correlation context correct on nodejs apps. I'd appreciate if you provide a high-level explanation of how you aim to keep the context correct given the characteristics of a nodeJS app (single thread, etc.)

johnib avatar Apr 23 '22 07:04 johnib

@johnib correlation will be stored using cls-hooked or continuation-local-storage if using node.js < 8.2.x, these libraries will take care of multiple contexts by requests and associated callbacks, sounds like something is actually failing in the middle of the long request that is causing the issue, maybe some kind of timeout or similar, trying to get more internal logs could be useful if these can be associated to the specific request though.

hectorhdzg avatar May 03 '22 22:05 hectorhdzg

Hey, I have better conjecture on what's causing this. It's unrelated directly to the duration of the operation, it just happens more frequent for long running ones.

What I believe is happening is that when the caller cancels the request, and express ends the request, my operation may be still running and producing telemetry, in which will have unexpected correlation.

@hectorhdzg can you share a short summary of what is the procedure when express ends the request, from AI sdk's side? Does it clean up the correlation data from the correlation store?

johnib avatar Jun 17 '22 22:06 johnib

Application Insights code to handle the request "ending " is available here, it should end the correlation and generate the request telemetry, is there a specific HTTP event being triggered in your case when the request is cancelled?

hectorhdzg avatar Jun 17 '22 23:06 hectorhdzg

We'll, the thing is @hectorhdzg, the fact that the request ended, doesn't mean that all related telemetry has been generated yet, nor the triggered flows have been terminated.

In my case, even though the request can be canceled by the client, I may proceed with the operation it was waiting for.

In such case, all telemetry produced will be set with correlation to other operations (unrelated ones)

What can we do about this ?

johnib avatar Jun 21 '22 18:06 johnib

@johnib context should be maintained for all code executed in the request, the SDK will bind to cls-hooked session/workspace here, code I share before just end the request and create the telemetry is not actually ending the context, that will be done automatically by cls-hooked when all code is done, if you have some small repro we may be able to figure out what is going on exactly in your case.

hectorhdzg avatar Jun 21 '22 23:06 hectorhdzg

@johnib The project has been reconfigured to run on OpenTelemetry under the hood. As such, we expect issues with correlation may be resolved. https://opentelemetry.io/docs/languages/js/propagation/

JacksonWeber avatar Apr 25 '24 21:04 JacksonWeber