cloud-sql-proxy-operator icon indicating copy to clipboard operation
cloud-sql-proxy-operator copied to clipboard

InitContainer Ordering Issue with CloudSQL Operator 1.6.0 in Istio-Managed Environments

Open utmaks opened this issue 1 year ago • 17 comments

Expected Behavior

Right order for initContainers / integration with Istio.

Actual Behavior

Description: During the upgrade of CloudSQL Operator to version 1.6.0, issues were encountered when deploying new instances of the application. While the existing application instances continued functioning as expected, attempts to start new pods failed due to critical container startup errors, which would be unacceptable in a production environment.

Details: After upgrading to version 1.6.0 to utilize the minSigtermDelay parameter, the following problem was observed:

  • New instances of the application failed to initialize due to startup probe errors from the CloudSQL container: Startup probe failed: Get "http://10.210.0.89:15020/app-health/csql-apps-apps/startupz": dial tcp 10.210.0.89:15020: connect: connection refused.
  • Existing pods already in operation continued to function correctly, with no observed disruptions in their database interactions.

The issue seems tied to Istio sidecar injection. Specifically:

  • New pods attempted to start the istio-init container but failed before Istio was fully initialized.
  • The behavior differs from previous versions of the operator, which allowed greater compatibility in Istio-managed environments.

Resolution Attempt: Disabling Istio temporarily resolved the problem, allowing the CloudSQL container to start correctly in the new pods. However, this is not an acceptable solution for production workloads where Istio is required.

Proposed Fix for CloudSQL Operator team: Reintroduce user-configurable options for sidecar compatibility (e.g., sidecarType), as seen in earlier commits, to prevent such failures in Istio environments and ensure robust behavior during deployment in production scenarios.

Steps to Reproduce the Problem

  1. Install Istio (1.19.3)
  2. Install Operator
  3. Create a Deployment with CloudSQL annotation in namespace managed by Istio

Specifications

  • Version: 1.6.0
  • Platform: v1.30.5-gke.1443001
  • Istio version: 1.19.3

utmaks avatar Nov 26 '24 18:11 utmaks

Thanks for this @utmaks 👏

@hessjcg is OOO this week but will take a look next week when he is back

jackwotherspoon avatar Nov 26 '24 18:11 jackwotherspoon

@utmaks,

This seems like a good reason to add a configuration parameter to the AuthProxyWorkload CRD to allow you and others to make the proxy run as a PodSpec.Container container instead of a PodSpec.InitContainer sidecar container.

I am curious to understand how Istio 1.19.3 and Operator 1.6.0 conspire to create pods that won't start. Would you mind posting the pod declaration for one of these failing pods (redacted, of course)?

I wonder if Istio adding its sidecar container to PodSpec.Container or PodSpec.InitContainer? What is the order of the containers, and is Istio starting first? Are there issues with the health checks?

hessjcg avatar Dec 02 '24 17:12 hessjcg

Using these annotations on the pod template fixes the issue - sidecar.istio.io/rewriteAppHTTPProbers: 'false', traffic.sidecar.istio.io/excludeInboundPorts: '9801'. The problem is that the istio-validation init container always starts first. If you have strict mTls mode enabled, healthchecks may fail due to the startup probe rewrite not being present.

azunna1 avatar Dec 23 '24 14:12 azunna1

For a more permanent fix, the csql sidecar containers should run as user 1337 as stated here: https://cloud.google.com/knowledge/kb/pod-are-fail-to-start-due-to-init-containers-not-starting-with-istio-cni-enabled-000007358 I tried using the AuthProxyWorkload's AuthProxyWorkloadSpec container field to set it but this overrides the default configuration. Would be nice if it did a merge instead. @hessjcg

azunna1 avatar Dec 23 '24 14:12 azunna1

Hi @hessjcg just wanted to check for updates on this?

azunna1 avatar Jan 28 '25 00:01 azunna1

Hi @azunna1,

Thank you for proposing a solution. I'm glad to hear that fixing the sidecar is as simple as setting the user ID on the container. I would like to update the operator to detect Istio and set user id '1337'. Do you have any suggestions on how the operator can query for istio?

I think that an istio-specific fix would be better than allowing any free-form modification to the proxy container by merging the Container field. While merging container specs would enable a convenient hacks like this, it could also make unexpected errors more likely when you upgrade the operator in the future.

hessjcg avatar Jan 31 '25 23:01 hessjcg

I believe to detect istio, you can check for the presence of some annotations in the pod such as istio.io/rev or labels such as sidecar.istio.io/inject=true and service.istio.io/canonical-name. Also istio sidecars have the container name istio-proxy. Not sure if this would be a useful approach for the cloudsql proxy mutating webhook.

azunna1 avatar Feb 06 '25 13:02 azunna1

Hi @hessjcg, would you like some assistance with this? My Golang is a bit rough but i could give this issue a try with some guidance. We use istio in our environment and i'm avoiding making an update to the versions that make use of the initContainer sidecar because it would break all our deployments.

azunna1 avatar Feb 13 '25 01:02 azunna1

@azunna1 By all means, give it a try! Here are some pointers to get you started.

I can see two approaches. We can implement both. Approach #1 allows the override of the SecurityContext on the proxy container. Approach #2 makes the container istio-compatible by default.

Approach 1: Allow override of container SecurityContext

This way, users can set their own SecurityContext for the proxy containers in the AuthProxyWorkload resource.

  1. Add a new SecurityContext field to AuthProxyContainerSpec similar to the AuthProxyContainerSpec.Resources field.
  2. Modify the applyContainerSpec() method to set the container.SecurityContext field to set the SecurityContext field. similar to how it is handled for Resources
  3. Add unit tests to podspec_update_tests.go, again similar to TestResourcesFromSpec.

Approach 2: Detect Istio

This way, the operator automatically configures the container when Istio is present.

  1. Add a field istioPresent bool the updateState struct.
  2. Detect if istio is present from the pod metadata in updateState.update(). On this line you can get the entire corev1.Pod resource as wl.Pod. I'm not sure how to determine from the Pod metadata whether Istio is installed. You will need to do some research.
  3. Modify the applyContainerSpec() method to set the container.SecurityContext field to set the user field appropriately if istioPresent is true. link
  4. Add unit tests to podspec_update_tests.go, You can copy TestPodUpdateWorkload() as a starting point.
  5. Test this end-to-end manually on your own Istio-enabled cluster and let us know the results.

hessjcg avatar Feb 13 '25 18:02 hessjcg

hi all, any updates on this ? Thanks in advance @hessjcg

joran-fonjallaz avatar Mar 26 '25 10:03 joran-fonjallaz

Hi all, unfortunately, we have not had the time to start work on this. A contribution from the community would be very welcome.

hessjcg avatar Mar 26 '25 15:03 hessjcg

Hi all, unfortunately, we have not had the time to start work on this. A contribution from the community would be very welcome.

Hey @hessjcg!

@bblackbelt and I are looking into this. So far, we have tried to set the uid to 1337, which is causing issues with our managed istio (or maybe we're doing something wrong). Hopefully, we'll raise a PR once we fix it.

Thanks for the clear guidance!

Fcmam5 avatar Apr 03 '25 16:04 Fcmam5

Hi all, unfortunately, we have not had the time to start work on this. A contribution from the community would be very welcome.

Hey @hessjcg!

@bblackbelt and I are looking into this. So far, we have tried to set the uid to 1337, which is causing issues with our managed istio (or maybe we're doing something wrong). Hopefully, we'll raise a PR once we fix it.

Thanks for the clear guidance!

Specifically, the cloud SQL proxy comes online but istio doesn't (and so the main container). We are using Gcp anthos managed. Let me know if you need which version of anthos and istio

Regards

bblackbelt avatar Apr 04 '25 06:04 bblackbelt

Hi @bblackbelt With Istio, you need to ensure that you set a health check for your containers. Can you share the full pod/deployment yaml?

azunna1 avatar Apr 09 '25 08:04 azunna1

Hi @bblackbelt With Istio, you need to ensure that you set a health check for your containers. Can you share the full pod/deployment yaml?

@azunna1 I have one using the cloudsql proxy which is experiencing the same issue as with the cloudsql proxy operator. Here is the deployment file (it is helm templated. Apologies for the formatting but the code block is not really working)

apiVersion: apps/v1 kind: Deployment metadata: name: {{ .Values.serviceName }} spec: selector: matchLabels: app.kubernetes.io/instance: {{ .Values.serviceName }} template: metadata: name: {{ .Values.serviceName }} spec: serviceAccountName: {{ .Values.ksa }} containers: - env: - name: DB_USERNAME value: "DB_USERRNAME" - name: POSTGRES_HOST value: {{ .Values.LOCALHOST_IP | quote }} image: {{ print .Values.main.image ":" .Chart.AppVersion | quote }} imagePullPolicy: {{ .Values.main.image.pullPolicy }} ports: - name: http-port containerPort: 8080 protocol: TCP startupProbe: httpGet: path: /health/live port: http-port scheme: HTTP initialDelaySeconds: 20 periodSeconds: 20 failureThreshold: 5 timeoutSeconds: 20 livenessProbe: httpGet: path: /health/live port: http-port scheme: HTTP initialDelaySeconds: 20 periodSeconds: 20 failureThreshold: 5 timeoutSeconds: 20 readinessProbe: httpGet: path: /health/ready port: http-port scheme: HTTP initialDelaySeconds: 20 periodSeconds: 20 failureThreshold: 3 timeoutSeconds: 20 initContainers: - name: cloud-sql-proxy securityContext: runAsUser: 1337 restartPolicy: Always image: {{ .Values.clodusqlImage }} # cloud-sql-proxy:2.15.0 startupProbe: failureThreshold: 10 httpGet: path: /startup port: 9801 periodSeconds: 1 initialDelaySeconds: 30 timeoutSeconds: 10 args: - "--exit-zero-on-sigterm" - "--auto-iam-authn" - "--structured-logs" - "--exit-zero-on-sigterm" - "--private-ip" - "--health-check" - "--quitquitquit" - "--http-address=0.0.0.0" - "--port={{ .Values.config_map.DB_PORT }}" - "--http-port={{ .Values.CLOUDSQL_PROXY_HEALTH_CHECK_PORT }}" - "{{ .Values.gcpProjectId }}:{{ .Values.region}}:{{ .Values.name}}" restartPolicy: Always terminationGracePeriodSeconds: 5

bblackbelt avatar Apr 09 '25 11:04 bblackbelt

Try adding a liveness probe to the init container. It's the only difference i can see between my setup and yours.

            livenessProbe:
               httpGet:
                 path: /liveness
                 port: 9801
                 scheme: HTTP
               timeoutSeconds: 10
               periodSeconds: 10
               successThreshold: 1
               failureThreshold: 3

azunna1 avatar Apr 09 '25 13:04 azunna1

@azunna1 @bblackbelt @hessjcg hi all, curious if anyone had the chance to make progress on this issue ? And if there's anything that can be done to help ? Thanks

joran-fonjallaz avatar May 06 '25 10:05 joran-fonjallaz

@joran-fonjallaz @bblackbelt I tried again and realised that removing all the probes works fine. You still have to set the UID to 1337.

Unfortunately, the latest istio features are not available yet on managed istio control plane. The native Sidecar implementation would have solved the problem but this isn't supported yet. If you are using in-cluster control plane, you can try enabling that feature - https://cloud.google.com/service-mesh/docs/supported-features-managed#proxyconfig

azunna1 avatar May 21 '25 23:05 azunna1

I think we may need to implement a feature to override the container security context. This will also help with other use cases too, see #694.

I will track the work to add a way to override the security context #694, but leave this bug open as well.

hessjcg avatar Aug 12 '25 17:08 hessjcg

I'm going to close this issue for now.

Newer versions of Istio work correctly with native sidecar containers, and with the addition of #694, you can now set the security context on proxy containers to what you need for your cluster.

Unfortunately, Cloud Service Mesh does not yet support native sidecar containers. For those of you using Cloud Service Mesh, the operator will not work yet. Please feel free to request this feature from the GKE Cloud Service Mesh team.

hessjcg avatar Sep 23 '25 20:09 hessjcg