Mark McLoughlin

Results 63 comments of Mark McLoughlin

> FYI @markmc I changed the metrics logic now in this PR to not attempt to aggregate from multiple engines and just log them separately. For prometheus I added a...

I figured I'd review these metrics one by one, starting with **model_load_time**. (Incidentally, doing each of these in a separate PR might mean they get merged more quickly) A high-level...

> Provided as a log for CUDA devices Uh? How can an auto-scaler use a log message to "determine the pod autoscaling threshold and frequency" ?

It's not obvious to me that we need to calculate the sum of a number of individual timings in order to log a useful startup time timing Why not something...

It doesn't seem like we actually need an exact count of nans - we just want a signal that corruption is spiking? What happens the request when this happens? Is...

Some overlap with #18765 ... except I'm not sure this NaN case results in the request explicitly failing?

> I did not take into account Prometheus since I'm not too confident about the best design practice. I'm also hoping to get this PR merged soon so that people...

Oh, I also meant to say ... we should be a bit cautious about getting new APIs right because they're difficult/disruptive to remove later. Temporarily printing metrics as debug logging...

> Meanwhile, I kinda disagree printing metrics is the most helpful way for debugging Yep, I just mentioned it as a potential stop-gap solution if metrics take e.g. a week...