cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Frontend retries genuine errors

Open gouthamve opened this issue 7 years ago • 6 comments

Relevant log line:

level=error ts=2018-10-04T09:20:28.637542898Z caller=frontend.go:203 msg="error processing request" try=4 err=null resp="&HTTPResponse{Code:500,Headers:[&Header{Key:Access-Control-Allow-Methods,Values:[GET, OPTIONS],} &Header{Key:Access-Control-Allow-Origin,Values:[*],} &Header{Key:Access-Control-Expose-Headers,Values:[Date],} &Header{Key:Content-Type,Values:[application/json],} &Header{Key:Content-Encoding,Values:[gzip],} &Header{Key:Access-Control-Allow-Headers,Values:[Accept, Authorization, Content-Type, Origin],}],Body:[31 139 8 0 0 0 0 0 0 255 68 200 65 10 2 49 12 5 208 171 132 191 118 46 208 115 120 129 140 243 209 66 155 104 147 34 131 120 119 17 145 89 61 120 47 68 106 206 64 1 199 240 129 211 207 243 126 39 10 170 37 135 105 251 55 10 186 218 190 164 47 95 165 107 94 110 213 174 98 158 162 173 249 147 91 57 182 233 202 22 210 103 164 172 148 105 245 49 41 110 226 70 137 186 17 239 15 0 0 0 255 255 1 0 0 255 255 49 130 136 63 129 0 0 0],}"

Me debugging it: https://play.golang.com/p/iqo4VwjVWrg

Turns out the error is:

{"status":"error","errorType":"internal","error":"many-to-many matching not allowed: matching labels must be unique on one side"}

We should also print the body properly and not make people jump through the hoops I did to figure it out.

gouthamve avatar Oct 04 '18 09:10 gouthamve

Also, we don't propagate the real error, but "query failed 5 times" or something.

tomwilkie avatar Oct 04 '18 10:10 tomwilkie

I'm going to try to tackle this one.

rfratto avatar Nov 20 '19 19:11 rfratto

I should've read Goutham's comment before digging into this, but it gave me an opportunity to understand the query frontend more. It does seem this has been fixed upstream; I can't reproduce the specific error myself.

It doesn't look like look like the byte slice gets logged anymore, either, so that issue is also fixed.

@gouthamve should this be closed?

rfratto avatar Nov 21 '19 18:11 rfratto

@rfratto can we have a unit test than confirms this?

tomwilkie avatar Nov 25 '19 19:11 tomwilkie

@tomwilkie I'm not sure we're missing any unit tests for this one. The retry middleware in pkg/queryier/queryrange has sufficient tests to make sure only 500s get retried; and the underlying issue here was that the PromQL engine was returning a genuine error as a 5xx. Could you point me in the right direction for where you'd like to see more test coverage?

rfratto avatar Dec 02 '19 19:12 rfratto

Any news about this issue?

aymericDD avatar Mar 21 '22 17:03 aymericDD