pixie icon indicating copy to clipboard operation
pixie copied to clipboard

deploy self-hosted pixie-cloud failed follow the production-readiness

Open Pger-Y opened this issue 7 months ago • 3 comments

Describe the bug I'm encountering an issue while trying to deploy a self-hosted Pixie Cloud for a production environment by following the official guide: Production Readiness Guide.

The guide states:

If you are using the Self-Hosted installation,

Follow steps 1 - 5 and 8 in Deploy Pixie Cloud. Comment lines 94 to 106 in ./scripts/create_cloud_secrets.sh Execute the script as explained in step 9 of Deploy Pixie Cloud

Initial Problem with Script Lines: I found that the line numbers (94-106) referenced for commenting out in ./scripts/create_cloud_secrets.sh appear to be outdated in the current version of the script. I attempted to identify the correct corresponding lines by referencing an older tag of the script and commented out lines 96-108, which seemed to match the intended section for generating self-signed certs (which I intended to replace with my own).

Subsequent Issue with Custom Certificate and Internal Access: Since commenting out those lines would prevent the creation of the cloud-proxy-tls-certs secret, I manually created this secret using our own wildcard certificate (*.zhipuai-infra.cn, which correctly covers pixie.xxx.xxx).

When I access https://pixie.xxx.xxx (where xxx.xxx is my domain, correctly covered by the wildcard cert) through my Traefik Ingress, the external access seems to work up to a point, but I encounter an "Internal Server Error" from Pixie.

To investigate, I tried to curl the internal cloud-proxy-service (which is a LoadBalancer type service, IP 10.123.60.98 in my setup). This internal request fails with the following TLS error:

  • start date: Oct 15 00:00:00 2024 GMT
  • expire date: Oct 15 23:59:59 2025 GMT
  • subjectAltName does not match ipv4 address 10.123.60.98
  • SSL: no alternative certificate subject name matches target ipv4 address '10.123.60.98'
  • Closing connection
  • TLSv1.3 (OUT), TLS alert, close notify (256): curl: (60) SSL: no alternative certificate subject name matches target ipv4 address '10.123.60.98' More details here: https://curl.se/docs/sslcerts.html This error suggests that the certificate being presented by the cloud-proxy pod/service at IP 10.123.60.98 (which is using the cloud-proxy-tls-certs secret I provided) does not have 10.123.60.98 or an appropriate internal DNS name as a Subject Alternative Name (SAN). My wildcard certificate is intended for external hostnames (like pixie.xxx.xxx) and naturally does not include internal IP addresses or Kubernetes service names (like cloud-proxy-service.namespace.svc.cluster.local) in its SANs.

My Current Understanding/Question: It seems the cloud-proxy might be using the cloud-proxy-tls-certs for two purposes:

TLS termination for external traffic coming via Ingress (for pixie.xxx.xxx). Possibly for internal communication or health checks that are trying to access it via its internal IP or service name, leading to the SAN mismatch. Could you please provide guidance on:

The correct lines to comment out in create_cloud_secrets.sh for the latest version if one intends to use their own external-facing TLS certificates? How TLS is intended to be handled for internal access to cloud-proxy or between internal Pixie Cloud components if the cloud-proxy-tls-certs secret is configured with a public-facing wildcard certificate? Should there be a separate internal CA and certificates for such internal traffic, or should the cloud-proxy-tls-certs also include SANs for internal service names/IPs? Additional Context:

I am using Traefik as my Ingress controller, not Nginx. I have modified the Ingress YAML accordingly. My traffic flow is: Client -> Load Balancer (External) -> Traefik Pod (TLS termination for pixie.xxx.xxx using my wildcard cert) -> Ingress Route -> cloud-proxy-service (K8s Service of type LoadBalancer, exposing cloud-proxy pods) -> cloud-proxy pod (IP 10.123.60.98). Correction based on flow: It's possible Traefik is not terminating TLS and is passing it through to the cloud-proxy-service if the Ingress is configured for TLS passthrough, or if Traefik is terminating, then the communication from Traefik to cloud-proxy-service might be HTTP or attempting HTTPS with this cert mismatch. My current setup has Traefik handling TLS for pixie.xxx.xxx. Thanks for your help! Image

ingress yaml:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: cloud-ingress-https
  namespace: plc
  annotations:
    traefik.ingress.kubernetes.io/service.serversscheme: "https"
    # traefik.ingress.kubernetes.io/service.serverstransport: "insecure@kubernetescrd"
spec:
  tls:
    - hosts:
        - pixie.xxx.com
        - work-pixie.xxx.com
  rules:
    - host: pixie.xxx.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: cloud-proxy-service
                port:
                  number: 443
    - host: work-pixie.xxx.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: cloud-proxy-service
                port:
                  number: 443

To Reproduce Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior deploy self-hosted pixie-cloud successfully

Screenshots If applicable, add screenshots to help explain your problem. Please make sure the screenshot does not contain any sensitive information such as API keys or access tokens.

Logs i'm deploy pixie cloud

App information (please complete the following information):

  • Pixie version release/cloud/v0.1.9
  • K8s cluster version v1.31.1

Additional context Add any other context about the problem here.

Pger-Y avatar May 29 '25 02:05 Pger-Y

I successfully deployed pixie-cloud using an external LoadBalancer pointing directly to the cloud-proxy-service. Since I only have one wildcard certificate for *.xxx.xxx, I used the domains pixie.xxx.xxx and pixie-work.xxx.xxx.

To make the deployment work, I modified a number of YAML files in the project—such as updating the domain in nginx.conf and setting the correct environment variables in the deployments.

However, when I try to log in to the Pixie UI, I encounter the following error:

Could not find hydra login state in cookie store: map[]

And in the logs from the api-server pod, I see this message:

api-server-76f9555554-kppqs time="2025-05-30T07:44:37Z" level=info msg="HTTP Request" req_method=GET req_path="/api/auth/oauth/login?hydra_login_state=0094a06f308aa054b83ea389816fb705a91470cae340b13c2a5e631f5a4686357140cffe271df43782fa7707275ec7a8&login_challenge=5be78f50f4454679952c8cf102e5736f" resp_code=500 resp_size=56 status="Internal Server Error" resp_time="29.504µs"

Any ideas on what might be going wrong or what I should check?

Thanks!

Pger-Y avatar May 30 '25 07:05 Pger-Y

so many errors about certs :( api-server-55d7bbc586-sprq6 time="2025-06-04T12:57:32Z" level=warning msg="[core] grpc: addrConn.createTransport failed to connect to {10.123.60.128:50600 plugin-server-c8d9bb85b-pzz9j 0 }. Err: connection error: desc = "transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate is valid for *.plc, *.plc.svc.cluster.local, *.pl-nats.plc.svc, *.pl-nats, pl-nats, *.local, localhost, kratos, not plugin-server-c8d9bb85b-pzz9j"" func=google.golang.org/grpc/internal/grpclog.WarningDepth file="google.golang.org/[email protected]/internal/grpclog/grpclog.go:46" system=system

vzconn-server vzconn-server-b77459677-9sqzs 2025/06/04 12:58:48 http: TLS handshake error from 100.122.204.128:24727: read tcp 10.123.60.138:51600->100.122.204.128:24727: read: connection reset by peer

Pger-Y avatar Jun 04 '25 13:06 Pger-Y

I ran into a similar issue - This is telling you that the connection attempt is using TLS but the name you are using does not match the certificate. You are hitting "plugin-server-c8d9bb85b-pzz9j" - and the cert is only valid for names that contain: ""transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate is valid for *.plc, *.plc.svc.cluster.local, *.pl-nats.plc.svc, *.pl-nats, pl-nats, *.local, localhost, kratos"

The service you are configured for isnt in the list as a URL name. Sorry though, I would need to check a few other things to try to help - ll try to look a little later today. The issue is easy enough to find (the cert) - in my API server logs it connects to plugin-service.plc:50600. Thats all part of the base deployment, Im not sure why yours shows the pod name and not the service name with the namespace .plc included.

My initial guess is that in the pixie-cloud deployment you might have changed some values??? theres a configmap that has values for the components called pl-scriptmgr-config - It has the value for the PL_PLUGIN_SERVICE - Can you confirm whats in that ConfigMap value?

ssurovich avatar Jun 09 '25 15:06 ssurovich