flyte icon indicating copy to clipboard operation
flyte copied to clipboard

[Core feature] Allow flyteadmin to start even if OIDC is unavailable (Improve flyteadmin startup resiliency)

Open ddl-rliu opened this issue 1 year ago • 1 comments

Tracking issue

https://github.com/flyteorg/flyte/issues/5701

Why are the changes needed?

Today, the flyteadmin pod is blocked from starting up until the OIDC provider is healthy and available (the pod gets stuck in Error state). In some Kubernetes configurations, this erroring-pod could cause deployment-wide issues. The current behavior could be made more resilient.

(Note that this applies to configurations using useAuth=true)

What changes were proposed in this pull request?

A better approach in these configurations is to allow flyte to start up, even if the OIDC provider is unavailable. Then, try to re-initialize the OIDC provider later in the deployment lifespan. This is a more resilient approach, and it can be made configurable.

Adds an onlyStartIfOIDCIsAvailable config which controls this behavior.

How was this patch tested?

A writeup is here which shows the "good" flow when onlyStartIfOIDCIsAvailable is enabled and OIDC is unhealthy for a period: https://gist.github.com/ddl-rliu/4c09862404f46a5adbc451025160e0eb

Setup process

Screenshots

Check all the applicable boxes

  • [ ] I updated the documentation accordingly.
  • [ ] All new and existing tests passed.
  • [x] All commits are signed-off.

Related PRs

Docs link

ddl-rliu avatar Aug 28 '24 19:08 ddl-rliu

Codecov Report

Attention: Patch coverage is 2.38095% with 41 lines in your changes missing coverage. Please review.

Project coverage is 36.17%. Comparing base (f075b34) to head (080a4cf). Report is 423 commits behind head on master.

Files with missing lines Patch % Lines
flyteadmin/auth/auth_context.go 2.38% 41 Missing :warning:
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #5702       +/-   ##
===========================================
- Coverage   60.92%   36.17%   -24.76%     
===========================================
  Files         796     1302      +506     
  Lines       51689   109627    +57938     
===========================================
+ Hits        31494    39660     +8166     
- Misses      17288    65822    +48534     
- Partials     2907     4145     +1238     
Flag Coverage Δ
unittests-datacatalog 51.37% <ø> (-17.95%) :arrow_down:
unittests-flyteadmin 55.29% <2.38%> (-3.44%) :arrow_down:
unittests-flytecopilot 12.17% <ø> (-5.62%) :arrow_down:
unittests-flytectl 62.18% <ø> (-5.24%) :arrow_down:
unittests-flyteidl 7.12% <ø> (-71.92%) :arrow_down:
unittests-flyteplugins 53.34% <ø> (-8.51%) :arrow_down:
unittests-flytepropeller 41.71% <ø> (-15.54%) :arrow_down:
unittests-flytestdlib 55.35% <ø> (-10.25%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Aug 28 '24 19:08 codecov[bot]

@Sovietaced brings up a good point regarding this change.

eapolinario avatar Dec 26 '24 20:12 eapolinario

Cleaning stale PRs. Please reopen if you wan to discuss this further.

eapolinario avatar Mar 03 '25 19:03 eapolinario