erlang-server-sdk icon indicating copy to clipboard operation
erlang-server-sdk copied to clipboard

Redis errors put the system in a permanant bad state

Open hexedpackets opened this issue 3 months ago • 2 comments

Is this a support request? No

Describe the bug When using the Redis backing store, network errors can take down that section of the supervisor tree without crashing the whole client. This leads to an irrecoverable state where every lookup will return the supplied default value - reconnecting to Redis will never be attempted. I observed this when seeing timeouts to Redis, not sure if the same behavior happens for other network issues but I suspect it does.

When eredis starts returning an error tuple of {:error, :no_connection}, the ldclient_storage_redis_server_default process fails to pattern match on it and crashes. This causes the supervising process, ldclient_storage_redis_sup_default, to restart it.

The supervisors are aggressive about giving up on crash recovery. The ldclient_storage_redis_sup process is set to terminate itself if the child process ever crashes - so both will never be restarted. Its parent, ldclient_instance_sup, will recover from a single crash in 5s - any more than that and it will also give up.

When ldclient_instance_sup gives up and terminates itself, its brought back up - but doesn't try to start redis again. The problem seems to be that redis tree is added as a child to ldclient_instance_sup in the start/3 function of ldclient_instance. That method is then never called again - when ldclient_instance_sup is restarted, only the children defined in its init/1 function are created.

Making this worse, ldclient_eval will discard errors coming from the lookup and return the supplied default. This means my application can't detect the failure and terminate itself or otherwise react to it.

I could be missing something (I found the various supervisors with names not matching their modules hard to reason about) but that's my best understanding of whats happening.

To reproduce

  • Configure the sdk to use redis with use_ldd: true
  • Inject latency greater than the 5s timeout. I have redis running in Docker and did this with pumba: pumba netem -d 5m delay -t 6000 redis
  • Query a flag, :ldclient.variation(flag, ld_context, false)
  • Immediately repeat the query when the first one times out
  • Wait for the second timeout/crash
  • The third+ attempts will immediately return the default with no errors

Expected behavior I would expect the supervisor to restart all children. Failing that, if the whole system is unusable the error should be propagated up instead of being silently discarded.

Logs

iex(6)> mods = [:ldclient_instance, :ldclient_instance_sup, :ldclient_sup, :ldclient_storage_redis]
[:ldclient_instance, :ldclient_instance_sup, :ldclient_sup,
 :ldclient_storage_redis]
iex(7)> context = :ldclient_context.new("abc")
%{key: "abc", kind: "user"}

# Query running successfully
iex(8)> :ldclient.variation("example-integer", context, false)
500

#
# Started network latency injection here
#

iex(9)> mods |> Enum.map(fn m -> {m, :_, :return_trace} end) |> :recon_trace.calls({100, 1000}, scope: :local)
35
iex(10)> :ldclient.variation("example-integer", context, false)

17:3:01.046082 <0.1149.0> ldclient_instance:update_processor_initialized(default)

17:3:01.046374 <0.1149.0> ldclient_instance:update_processor_initialized/1 --> false

17:3:01.046578 <0.1149.0> ldclient_instance:feature_store_initialized(default)

17:3:01.046755 <0.1149.0> ldclient_instance:feature_store_initialized/1 --> true

17:3:01.046873 <0.1149.0> ldclient_storage_redis:get(default, features, <<"example-integer">>)

17:3:01.047021 <0.1149.0> ldclient_storage_redis:get_local_reg_name(worker, default)

17:3:01.047155 <0.1149.0> ldclient_storage_redis:get_local_reg_name/2 --> ldclient_storage_redis_server_default
false
[error] module=gen_server GenServer :ldclient_storage_redis_server_default terminating
** (stop) exited in: :gen_server.call(#PID<0.601.0>, {:request, [["*", ~c"3", "\r\n"], [["$", ~c"4", "\r\n", "HGET", "\r\n"], ["$", ~c"23", "\r\n", "ld_Development:features", "\r\n"], ["$", ~c"15", "\r\n", "example-integer", "\r\n"]]]}, 5000)
    ** (EXIT) time out
    (stdlib 6.2.2.2) gen_server.erl:1222: :gen_server.call/3
    (ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis_server.erl:268: :ldclient_storage_redis_server.lookup_key/5
    (ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis_server.erl:163: :ldclient_storage_redis_server.handle_call/3
    (stdlib 6.2.2.2) gen_server.erl:2381: :gen_server.try_handle_call/4
    (stdlib 6.2.2.2) gen_server.erl:2410: :gen_server.handle_msg/6
    (stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.1149.0>): {:get, :features, "example-integer"}
State: %{tag: :default, prefix: ~c"ld_Development", client: #PID<0.601.0>, buckets: [:segments, :features]}
Client #PID<0.1149.0> is alive

    (iex 1.18.4) lib/iex/evaluator.ex:186: IEx.Evaluator.loop/1
    (iex 1.18.4) lib/iex/evaluator.ex:34: IEx.Evaluator.init/5
    (stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
[notice] Starting redis server with name :ldclient_storage_redis_server_default
iex(11)> :ldclient.variation("example-integer", context, false)

17:3:07.445872 <0.1149.0> ldclient_instance:update_processor_initialized(default)

17:3:07.446192 <0.1149.0> ldclient_instance:update_processor_initialized/1 --> false

17:3:07.446423 <0.1149.0> ldclient_instance:feature_store_initialized(default)

17:3:07.446579 <0.1149.0> ldclient_instance:feature_store_initialized/1 --> false

17:3:07.446759 <0.1149.0> ldclient_storage_redis:get_init(default)

17:3:07.446928 <0.1149.0> ldclient_storage_redis:get_local_reg_name(worker, default)

17:3:07.447058 <0.1149.0> ldclient_storage_redis:get_local_reg_name/2 --> ldclient_storage_redis_server_default
[warning] module=eredis_client eredis: Initial connect failed: {:authentication_error, :timeout}
[error] module=gen_server GenServer :ldclient_storage_redis_server_default terminating
** (MatchError) no match of right hand side value: {:error, :no_connection}
    (ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis_server.erl:340: :ldclient_storage_redis_server.get_init/2
    (ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis_server.erl:175: :ldclient_storage_redis_server.handle_call/3
    (stdlib 6.2.2.2) gen_server.erl:2381: :gen_server.try_handle_call/4
    (stdlib 6.2.2.2) gen_server.erl:2410: :gen_server.handle_msg/6
    (stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.1149.0>): {:get_init}
State: %{tag: :default, prefix: ~c"ld_Development", client: #PID<0.1160.0>, buckets: []}
Client #PID<0.1149.0> is alive

    (stdlib 6.2.2.2) gen.erl:260: :gen.do_call/4
    (stdlib 6.2.2.2) gen_server.erl:1138: :gen_server.call/2
    (ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_storage_redis.erl:92: :ldclient_storage_redis.get_init/1
    (ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_eval.erl:293: :ldclient_eval.get_initialization_state/2
    (ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient_eval.erl:69: :ldclient_eval.flag_key_for_context/4
    (ldclient 3.8.0) /my-repo/deps/ldclient/src/ldclient.erl:144: :ldclient.variation/4
    (elixir 1.18.4) elixir.erl:386: :elixir.eval_external_handler/3
    (stdlib 6.2.2.2) erl_eval.erl:919: :erl_eval.do_apply/7
    (elixir 1.18.4) elixir.erl:364: :elixir.eval_forms/4
    (elixir 1.18.4) lib/module/parallel_checker.ex:120: Module.ParallelChecker.verify/1
    (iex 1.18.4) lib/iex/evaluator.ex:336: IEx.Evaluator.eval_and_inspect/3
    (iex 1.18.4) lib/iex/evaluator.ex:310: IEx.Evaluator.eval_and_inspect_parsed/3
    (iex 1.18.4) lib/iex/evaluator.ex:299: IEx.Evaluator.parse_eval_inspect/4
    (iex 1.18.4) lib/iex/evaluator.ex:189: IEx.Evaluator.loop/1
    (iex 1.18.4) lib/iex/evaluator.ex:34: IEx.Evaluator.init/5
    (stdlib 6.2.2.2) proc_lib.erl:329: :proc_lib.init_p_do_apply/3
false

17:3:11.092550 <0.491.0> ldclient_instance_sup:start_link(ldclient_instance_default, ldclient_instance_stream_default, ldclient_update_null_server, ldclient_instance_events_default, default)
[notice] Terminating event service, reason: :shutdown
[notice] Terminating, reason: :shutdown; Pid none


17:3:11.095185 <0.1161.0> ldclient_instance_sup:init([ldclient_instance_stream_default,ldclient_update_null_server,
 ldclient_instance_events_default,default])
[notice] Starting instance supervisor for :default with name :ldclient_instance_default
[notice] Starting event supervisor for :default with name :ldclient_instance_events_default

17:3:11.095501 <0.1161.0> ldclient_instance_sup:children(ldclient_instance_stream_default, ldclient_update_null_server, ldclient_instance_events_default, default)
[notice] Starting event storage server for :default with name :ldclient_event_server_default
[notice] Starting event processor for :default with name :ldclient_event_process_server_default

17:3:11.096708 <0.1161.0> ldclient_instance_sup:children/4 --> [{ldclient_update_sup,
     {ldclient_update_sup,start_link,
         [ldclient_instance_stream_default,ldclient_update_null_server]},
     permanent,5000,supervisor,
     [ldclient_update_sup]},
 {ldclient_event_sup,
     {ldclient_event_sup,start_link,
         [ldclient_instance_events_default,default]},
     permanent,5000,supervisor,
     [ldclient_event_sup]}]

17:3:11.096988 <0.1161.0> ldclient_instance_sup:init/1 --> {ok,{{one_for_one,1,5},
     [{ldclient_update_sup,
          {ldclient_update_sup,start_link,
              [ldclient_instance_stream_default,ldclient_update_null_server]},
          permanent,5000,supervisor,
          [ldclient_update_sup]},
      {ldclient_event_sup,
          {ldclient_event_sup,start_link,
              [ldclient_instance_events_default,default]},
          permanent,5000,supervisor,
          [ldclient_event_sup]}]}}

17:3:11.098170 <0.491.0> ldclient_instance_sup:start_link/5 --> {ok,<0.1161.0>}

#
# All calls now return the default immediately
#
iex(12)> :ldclient.variation("example-integer", context, false)

17:3:12.229785 <0.1149.0> ldclient_instance:update_processor_initialized(default)
false

17:3:12.230080 <0.1149.0> ldclient_instance:update_processor_initialized/1 --> false

17:3:12.230291 <0.1149.0> ldclient_instance:feature_store_initialized(default)

17:3:12.230491 <0.1149.0> ldclient_instance:feature_store_initialized/1 --> false

17:3:12.230623 <0.1149.0> ldclient_storage_redis:get_init(default)

17:3:12.230738 <0.1149.0> ldclient_storage_redis:get_local_reg_name(worker, default)

17:3:12.230860 <0.1149.0> ldclient_storage_redis:get_local_reg_name/2 --> ldclient_storage_redis_server_default
iex(13)> 

SDK version Running from git, bb5bc812f0d4e2e5fafa131d969fdf43afaa64b1

Language version, developer tools Elixir 1.18.4 / OTP 27

hexedpackets avatar Oct 27 '25 21:10 hexedpackets

Hello @hexedpackets,

Thank you for the report. We will look into it.

Thank you, Ryan

Filed internally as SDK-1517

kinyoklion avatar Oct 27 '25 21:10 kinyoklion

Just to add a bit more clarity, here's the supervision tree at startup:

Image

and after the irrecoverable crash:

Image

hexedpackets avatar Oct 27 '25 21:10 hexedpackets

@hexedpackets 3.8.1 contains a number of changes to incrementally improve the redis persistent store implementation.

It doesn't yet address all the concerned that you have raised; it makes the store more durable, but doesn't provide insight into the store health.

I cannot provide an ETA, but the answer for that insight will be support for a data store status feature as we have in other SDKs. An example here: https://github.com/launchdarkly/dotnet-core/blob/21cfe9fafe21f2dbabb50b5e125e7918493edba4/pkgs/sdk/server/src/Interfaces/IDataStoreStatusProvider.cs#L22

Thank you, Ryan

kinyoklion avatar Nov 12 '25 17:11 kinyoklion