graphite-web icon indicating copy to clipboard operation
graphite-web copied to clipboard

[Q] Metric dropouts when using GRAPHITE_CLUSTER_SERVERS

Open kgroshert opened this issue 3 years ago • 4 comments

I'm not sure how to debug this, any would be appreciated.

I have a chain of docker-containers:

Grafana (Host A) -> Graphite-Web (Host B) -> Graphite-Web Cluster Servers (Host C, D, E, ...) with local Go-Graphite

If I graph multiple metrics, sometimes one metric is missing or stops right in the middle of data (not all datapoints are returned).

I tried to narrow it down by leaving out the Graphite-Web on Host B, in which case the problem never happens. I attached a screenshot, these are exctly the same panels and graphite-queries but the first row uses the graphite-web with CLUSTER_SERVERS and the second one connects Grafana directly to Graphite on Host C:

image

My graphite queries (per panel) look like this:

graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in

On initial dashboard-load sometimes something is missing, if I press refresh it usually all works. Therefore I would suspect some kind of caching mechanism on graphite-web.

Here are logs with 6 identical panels. 3 Show in+out, 3 show only out (metric 'in' is missing):

`==> info.log <== 2022-05-03,13:29:59.889 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in" took 0.000227213s 2022-05-03,13:29:59.893 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out" took 0.000151157s 2022-05-03,13:29:59.978 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in" took 0.000159979s 2022-05-03,13:29:59.981 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out" took 0.000142097s 2022-05-03,13:30:00.067 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in" took 0.000159025s 2022-05-03,13:30:00.070 :: graphite.render.datalib.fetchData :: lookup and merge of "graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out" took 0.000117064s

==> rendering.log <== 2022-05-03,13:29:59.885 :: Fetched data for [graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out, graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in] in 0.081730s 2022-05-03,13:29:59.894 :: json rendering time 0.000610 2022-05-03,13:29:59.894 :: Total request processing time 0.099860 2022-05-03,13:29:59.973 :: Fetched data for [graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out, graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in] in 0.071881s 2022-05-03,13:29:59.983 :: json rendering time 0.001315 2022-05-03,13:29:59.983 :: Total request processing time 0.087413 2022-05-03,13:30:00.064 :: Fetched data for [graphite_.hamburg01.Interface_GigabitEthernet1_0_10.out, graphite_.hamburg01.Interface_GigabitEthernet1_0_10.in] in 0.075265s 2022-05-03,13:30:00.071 :: json rendering time 0.001026 2022-05-03,13:30:00.071 :: Total request processing time 0.086037

==> cache.log <== 2022-05-03,13:29:59.795 :: Request-Cache miss [e73b89076867897730ee78b0c861a8e4] 2022-05-03,13:29:59.795 :: Data-Cache miss [54b2233a1b24554210cfbf27b0b888f7] 2022-05-03,13:29:59.896 :: Request-Cache miss [e73b89076867897730ee78b0c861a8e4] 2022-05-03,13:29:59.896 :: Data-Cache miss [54b2233a1b24554210cfbf27b0b888f7] 2022-05-03,13:29:59.986 :: Request-Cache miss [e73b89076867897730ee78b0c861a8e4] 2022-05-03,13:29:59.986 :: Data-Cache miss [54b2233a1b24554210cfbf27b0b888f7] `

My config for the cluster-servers looks like that:

GRAPHITE_CLUSTER_SERVERS="http://hostc:3443?format=msgpack,http://hostd:3443?format=msgpack,http://hoste:3443?format=msgpack,http://hostf:3443?format=msgpack,http://hostg:3443?format=msgpack"

Another thing: it seems to affect only the second metric, never the first one (I never get a completely empty result).

Is there anything I can tune in localsettings to narrow down the problem?

Thanks, Kai

kgroshert avatar May 03 '22 14:05 kgroshert

To try something I added this option to the frontend graphite-web-container:

GRAPHITE_REMOTE_BUFFER_SIZE=0

and this has fixed the problem for now. Is this expected behaviour or a bug?

kgroshert avatar May 17 '22 09:05 kgroshert

Hi @kgroshert

Still have no explanation for behaviour above, but, if you're using go-carbon on hosts C,D,E etc. theoretically you can omit graphite-web on these hosts and connect to carbonserver interface of go-carbon. I.e.

  • on host B:
GRAPHITE_CLUSTER_SERVERS="http://hostc:8080,http://hostd:8080,http://hoste:8080,http://hostf:8080,http://hostg:8080
  • on hosts C,D,E.. in go-carbon.conf
[carbonserver]
listen = "0.0.0.0:8080"
enabled = true

deniszh avatar May 22 '22 11:05 deniszh

After revisiting this, I think I spoke too fast: GRAPHITE_REMOTE_BUFFER_SIZE=0 did not fix it. I will try to reconfigure graphite-web to use carbonserver directly as you suggested and report back.

kgroshert avatar Jul 07 '22 06:07 kgroshert

Hi @deniszh,

sorry for the late answer. I implemented your recommendation to connect the frontend graphite-web directly to the carbonservers on port 8000 and this fixes the problem.

kgroshert avatar Aug 10 '22 14:08 kgroshert

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 16 '22 10:10 stale[bot]