graphite-web [Q] Metric dropouts when using GRAPHITE_CLUSTER

I'm not sure how to debug this, any would be appreciated.

I have a chain of docker-containers:

Grafana (Host A) -> Graphite-Web (Host B) -> Graphite-Web Cluster Servers (Host C, D, E, ...) with local Go-Graphite

If I graph multiple metrics, sometimes one metric is missing or stops right in the middle of data (not all datapoints are returned).

I tried to narrow it down by leaving out the Graphite-Web on Host B, in which case the problem never happens. I attached a screenshot, these are exctly the same panels and graphite-queries but the first row uses the graphite-web with CLUSTER_SERVERS and the second one connects Grafana directly to Graphite on Host C:

My graphite queries (per panel) look like this:

graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in

On initial dashboard-load sometimes something is missing, if I press refresh it usually all works. Therefore I would suspect some kind of caching mechanism on graphite-web.

Here are logs with 6 identical panels. 3 Show in+out, 3 show only out (metric 'in' is missing):

`==> info.log <== 2022-05-03,13:29:59.889 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in" took 0.000227213s 2022-05-03,13:29:59.893 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out" took 0.000151157s 2022-05-03,13:29:59.978 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in" took 0.000159979s 2022-05-03,13:29:59.981 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out" took 0.000142097s 2022-05-03,13:30:00.067 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in" took 0.000159025s 2022-05-03,13:30:00.070 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out" took 0.000117064s

==> rendering.log <== 2022-05-03,13:29:59.885 :: Fetched data for [graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out, graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in] in 0.081730s 2022-05-03,13:29:59.894 :: json rendering time 0.000610 2022-05-03,13:29:59.894 :: Total request processing time 0.099860 2022-05-03,13:29:59.973 :: Fetched data for [graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out, graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in] in 0.071881s 2022-05-03,13:29:59.983 :: json rendering time 0.001315 2022-05-03,13:29:59.983 :: Total request processing time 0.087413 2022-05-03,13:30:00.064 :: Fetched data for [graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out, graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in] in 0.075265s 2022-05-03,13:30:00.071 :: json rendering time 0.001026 2022-05-03,13:30:00.071 :: Total request processing time 0.086037

==> cache.log <== 2022-05-03,13:29:59.795 :: Request-Cache miss [e73b89076867897730ee78b0c861a8e4] 2022-05-03,13:29:59.795 :: Data-Cache miss [54b2233a1b24554210cfbf27b0b888f7] 2022-05-03,13:29:59.896 :: Request-Cache miss [e73b89076867897730ee78b0c861a8e4] 2022-05-03,13:29:59.896 :: Data-Cache miss [54b2233a1b24554210cfbf27b0b888f7] 2022-05-03,13:29:59.986 :: Request-Cache miss [e73b89076867897730ee78b0c861a8e4] 2022-05-03,13:29:59.986 :: Data-Cache miss [54b2233a1b24554210cfbf27b0b888f7] `

My config for the cluster-servers looks like that:

GRAPHITE_CLUSTER_SERVERS="http://hostc:3443?format=msgpack,http://hostd:3443?format=msgpack,http://hoste:3443?format=msgpack,http://hostf:3443?format=msgpack,http://hostg:3443?format=msgpack"

Another thing: it seems to affect only the second metric, never the first one (I never get a completely empty result).

Is there anything I can tune in localsettings to narrow down the problem?

Thanks, Kai

May 03 '22 14:05 kgroshert

To try something I added this option to the frontend graphite-web-container:

GRAPHITE_REMOTE_BUFFER_SIZE=0

and this has fixed the problem for now. Is this expected behaviour or a bug?

May 17 '22 09:05 kgroshert

Hi @kgroshert

Still have no explanation for behaviour above, but, if you're using go-carbon on hosts C,D,E etc. theoretically you can omit graphite-web on these hosts and connect to carbonserver interface of go-carbon. I.e.

on host B:

GRAPHITE_CLUSTER_SERVERS="http://hostc:8080,http://hostd:8080,http://hoste:8080,http://hostf:8080,http://hostg:8080

on hosts C,D,E.. in go-carbon.conf

[carbonserver]
listen = "0.0.0.0:8080"
enabled = true

May 22 '22 11:05 deniszh

After revisiting this, I think I spoke too fast: GRAPHITE_REMOTE_BUFFER_SIZE=0 did not fix it. I will try to reconfigure graphite-web to use carbonserver directly as you suggested and report back.

Jul 07 '22 06:07 kgroshert

Hi @deniszh,

sorry for the late answer. I implemented your recommendation to connect the frontend graphite-web directly to the carbonservers on port 8000 and this fixes the problem.

Aug 10 '22 14:08 kgroshert

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Oct 16 '22 10:10 stale[bot]

[Q] Metric dropouts when using GRAPHITE_CLUSTER_SERVERS