InitContainer Ordering Issue with CloudSQL Operator 1.6.0 in Istio-Managed Environments
Expected Behavior
Right order for initContainers / integration with Istio.
Actual Behavior
Description: During the upgrade of CloudSQL Operator to version 1.6.0, issues were encountered when deploying new instances of the application. While the existing application instances continued functioning as expected, attempts to start new pods failed due to critical container startup errors, which would be unacceptable in a production environment.
Details: After upgrading to version 1.6.0 to utilize the minSigtermDelay parameter, the following problem was observed:
- New instances of the application failed to initialize due to startup probe errors from the CloudSQL container:
Startup probe failed: Get "http://10.210.0.89:15020/app-health/csql-apps-apps/startupz": dial tcp 10.210.0.89:15020: connect: connection refused. - Existing pods already in operation continued to function correctly, with no observed disruptions in their database interactions.
The issue seems tied to Istio sidecar injection. Specifically:
- New pods attempted to start the istio-init container but failed before Istio was fully initialized.
- The behavior differs from previous versions of the operator, which allowed greater compatibility in Istio-managed environments.
Resolution Attempt: Disabling Istio temporarily resolved the problem, allowing the CloudSQL container to start correctly in the new pods. However, this is not an acceptable solution for production workloads where Istio is required.
Proposed Fix for CloudSQL Operator team: Reintroduce user-configurable options for sidecar compatibility (e.g., sidecarType), as seen in earlier commits, to prevent such failures in Istio environments and ensure robust behavior during deployment in production scenarios.
Steps to Reproduce the Problem
- Install Istio (1.19.3)
- Install Operator
- Create a Deployment with CloudSQL annotation in namespace managed by Istio
Specifications
- Version: 1.6.0
- Platform: v1.30.5-gke.1443001
- Istio version: 1.19.3
Thanks for this @utmaks 👏
@hessjcg is OOO this week but will take a look next week when he is back
@utmaks,
This seems like a good reason to add a configuration parameter to the AuthProxyWorkload CRD to allow you and others to make the proxy run as a PodSpec.Container container instead of a PodSpec.InitContainer sidecar container.
I am curious to understand how Istio 1.19.3 and Operator 1.6.0 conspire to create pods that won't start. Would you mind posting the pod declaration for one of these failing pods (redacted, of course)?
I wonder if Istio adding its sidecar container to PodSpec.Container or PodSpec.InitContainer? What is the order of the containers, and is Istio starting first? Are there issues with the health checks?
Using these annotations on the pod template fixes the issue - sidecar.istio.io/rewriteAppHTTPProbers: 'false', traffic.sidecar.istio.io/excludeInboundPorts: '9801'. The problem is that the istio-validation init container always starts first.
If you have strict mTls mode enabled, healthchecks may fail due to the startup probe rewrite not being present.
For a more permanent fix, the csql sidecar containers should run as user 1337 as stated here: https://cloud.google.com/knowledge/kb/pod-are-fail-to-start-due-to-init-containers-not-starting-with-istio-cni-enabled-000007358
I tried using the AuthProxyWorkload's AuthProxyWorkloadSpec container field to set it but this overrides the default configuration. Would be nice if it did a merge instead. @hessjcg
Hi @hessjcg just wanted to check for updates on this?
Hi @azunna1,
Thank you for proposing a solution. I'm glad to hear that fixing the sidecar is as simple as setting the user ID on the container. I would like to update the operator to detect Istio and set user id '1337'. Do you have any suggestions on how the operator can query for istio?
I think that an istio-specific fix would be better than allowing any free-form modification to the proxy container by merging the Container field. While merging container specs would enable a convenient hacks like this, it could also make unexpected errors more likely when you upgrade the operator in the future.
I believe to detect istio, you can check for the presence of some annotations in the pod such as istio.io/rev or labels such as sidecar.istio.io/inject=true and service.istio.io/canonical-name. Also istio sidecars have the container name istio-proxy. Not sure if this would be a useful approach for the cloudsql proxy mutating webhook.
Hi @hessjcg, would you like some assistance with this? My Golang is a bit rough but i could give this issue a try with some guidance. We use istio in our environment and i'm avoiding making an update to the versions that make use of the initContainer sidecar because it would break all our deployments.
@azunna1 By all means, give it a try! Here are some pointers to get you started.
I can see two approaches. We can implement both. Approach #1 allows the override of the SecurityContext on the proxy container. Approach #2 makes the container istio-compatible by default.
Approach 1: Allow override of container SecurityContext
This way, users can set their own SecurityContext for the proxy containers in the AuthProxyWorkload resource.
- Add a new
SecurityContextfield to AuthProxyContainerSpec similar to theAuthProxyContainerSpec.Resourcesfield. - Modify the
applyContainerSpec()method to set the container.SecurityContext field to set the SecurityContext field. similar to how it is handled for Resources - Add unit tests to podspec_update_tests.go, again similar to
TestResourcesFromSpec.
Approach 2: Detect Istio
This way, the operator automatically configures the container when Istio is present.
- Add a field
istioPresent boolthe updateState struct. - Detect if istio is present from the pod metadata in
updateState.update(). On this line you can get the entire corev1.Pod resource aswl.Pod. I'm not sure how to determine from the Pod metadata whether Istio is installed. You will need to do some research. - Modify the
applyContainerSpec()method to set the container.SecurityContext field to set the user field appropriately ifistioPresentis true. link - Add unit tests to podspec_update_tests.go, You can copy TestPodUpdateWorkload() as a starting point.
- Test this end-to-end manually on your own Istio-enabled cluster and let us know the results.
hi all, any updates on this ? Thanks in advance @hessjcg
Hi all, unfortunately, we have not had the time to start work on this. A contribution from the community would be very welcome.
Hi all, unfortunately, we have not had the time to start work on this. A contribution from the community would be very welcome.
Hey @hessjcg!
@bblackbelt and I are looking into this. So far, we have tried to set the uid to 1337, which is causing issues with our managed istio (or maybe we're doing something wrong). Hopefully, we'll raise a PR once we fix it.
Thanks for the clear guidance!
Hi all, unfortunately, we have not had the time to start work on this. A contribution from the community would be very welcome.
Hey @hessjcg!
@bblackbelt and I are looking into this. So far, we have tried to set the uid to
1337, which is causing issues with our managed istio (or maybe we're doing something wrong). Hopefully, we'll raise a PR once we fix it.Thanks for the clear guidance!
Specifically, the cloud SQL proxy comes online but istio doesn't (and so the main container). We are using Gcp anthos managed. Let me know if you need which version of anthos and istio
Regards
Hi @bblackbelt With Istio, you need to ensure that you set a health check for your containers. Can you share the full pod/deployment yaml?
Hi @bblackbelt With Istio, you need to ensure that you set a health check for your containers. Can you share the full pod/deployment yaml?
@azunna1 I have one using the cloudsql proxy which is experiencing the same issue as with the cloudsql proxy operator. Here is the deployment file (it is helm templated. Apologies for the formatting but the code block is not really working)
apiVersion: apps/v1 kind: Deployment metadata: name: {{ .Values.serviceName }} spec: selector: matchLabels: app.kubernetes.io/instance: {{ .Values.serviceName }} template: metadata: name: {{ .Values.serviceName }} spec: serviceAccountName: {{ .Values.ksa }} containers: - env: - name: DB_USERNAME value: "DB_USERRNAME" - name: POSTGRES_HOST value: {{ .Values.LOCALHOST_IP | quote }} image: {{ print .Values.main.image ":" .Chart.AppVersion | quote }} imagePullPolicy: {{ .Values.main.image.pullPolicy }} ports: - name: http-port containerPort: 8080 protocol: TCP startupProbe: httpGet: path: /health/live port: http-port scheme: HTTP initialDelaySeconds: 20 periodSeconds: 20 failureThreshold: 5 timeoutSeconds: 20 livenessProbe: httpGet: path: /health/live port: http-port scheme: HTTP initialDelaySeconds: 20 periodSeconds: 20 failureThreshold: 5 timeoutSeconds: 20 readinessProbe: httpGet: path: /health/ready port: http-port scheme: HTTP initialDelaySeconds: 20 periodSeconds: 20 failureThreshold: 3 timeoutSeconds: 20 initContainers: - name: cloud-sql-proxy securityContext: runAsUser: 1337 restartPolicy: Always image: {{ .Values.clodusqlImage }} # cloud-sql-proxy:2.15.0 startupProbe: failureThreshold: 10 httpGet: path: /startup port: 9801 periodSeconds: 1 initialDelaySeconds: 30 timeoutSeconds: 10 args: - "--exit-zero-on-sigterm" - "--auto-iam-authn" - "--structured-logs" - "--exit-zero-on-sigterm" - "--private-ip" - "--health-check" - "--quitquitquit" - "--http-address=0.0.0.0" - "--port={{ .Values.config_map.DB_PORT }}" - "--http-port={{ .Values.CLOUDSQL_PROXY_HEALTH_CHECK_PORT }}" - "{{ .Values.gcpProjectId }}:{{ .Values.region}}:{{ .Values.name}}" restartPolicy: Always terminationGracePeriodSeconds: 5
Try adding a liveness probe to the init container. It's the only difference i can see between my setup and yours.
livenessProbe:
httpGet:
path: /liveness
port: 9801
scheme: HTTP
timeoutSeconds: 10
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
@azunna1 @bblackbelt @hessjcg hi all, curious if anyone had the chance to make progress on this issue ? And if there's anything that can be done to help ? Thanks
@joran-fonjallaz @bblackbelt I tried again and realised that removing all the probes works fine. You still have to set the UID to 1337.
Unfortunately, the latest istio features are not available yet on managed istio control plane. The native Sidecar implementation would have solved the problem but this isn't supported yet. If you are using in-cluster control plane, you can try enabling that feature - https://cloud.google.com/service-mesh/docs/supported-features-managed#proxyconfig
I think we may need to implement a feature to override the container security context. This will also help with other use cases too, see #694.
I will track the work to add a way to override the security context #694, but leave this bug open as well.
I'm going to close this issue for now.
Newer versions of Istio work correctly with native sidecar containers, and with the addition of #694, you can now set the security context on proxy containers to what you need for your cluster.
Unfortunately, Cloud Service Mesh does not yet support native sidecar containers. For those of you using Cloud Service Mesh, the operator will not work yet. Please feel free to request this feature from the GKE Cloud Service Mesh team.