jetstack-secure VC-36510: Key Pair and Venafi Connection modes now use gzip compression

Summary:

This PR is a remediation for the 413s returned by our NGINX gateway due to the fact that the request limit was set too low (smth around 100MB). See VC-36510 to know more.
On 26 Sept, as a first attempt at remediating this, we bumped NGINX's client_max_body_size from 256 MB to 1024 MB, and client_body_buffer_size from 128 MB to 750 MB (gitlab MR, message).
- Before this fix, NGINX used to return a 413 as soon as the Content-Length was above 256 MB. When the request body goes above 750 MB, the body is written to a temporary file to disk rather than buffered in memory.
- After this this, NGINX hasn't returned any 413s, even when stress-testing it.
Side note: buffering in NGINX is still happening. Keeping buffering on puts stress in terms of memory and disk on NGINX, and I don’t think there is a reason in buffering these requests. The setting in question is proxy_request_buffering and is on by default.
The Venafi Control Plane API doesn't publicly state that GZIP compression is possible; I have tested it and it works, but we shall maybe talk to Prod Engineering.
Compression hasn't been added for Standalone (Jetstack Secure OAuth and Jetstack Secure API token modes). In this PR, I only added compression for the Key Pair and Venafi Connection modes.
Compression aims to lower the load on our intermediate NGNIX (because it buffers requests) as well as lower the egress cost for customers. Sending over 100MBs over the wire looked somewhat bad, enabling compression will improve that.
To ease the pain that might be caused by adding compression, we decided to add --disable-compression and let the Support team tell customers. I'm also in favor of adding --disable-compression for development purposes (e.g., when using mitmproxy).

Testing

[x] Test 1: (manual) test that the compression works in Venafi Connection mode by using ./hack/e2e/test.sh. (see results below)
[x] Test 2: (manual) test that the compression works in Key Pair mode. (see results below)
[x] Test 3: test that the headers are correctly set in Key Pair mode. → unit test written.
[x] Test 4: test that the headers are correctly set in Venafi Connection mode. → unit test written.
[x] Test 5: test that no compression takes place when --disable-compression is used in Key Pair mode → unit test written
[x] Test 6: test that no compression takes place when --disable-compression is used in Venafi Connection mode. → unit test written
[ ] Test 7: (manual) test that the data is actually in the S3 bucket. @tfadeyi will check this.

Rollout Plan:

If Olu or somebody in Cloud Services is taking over the feature and releasing v1.2.0, they will assess how confident they are about releasing. Although this feature is important, it isn't critical, so they might wait until Richard is back (29th Oct) or Mael is back (4th Nov).
Release v1.2.0 for all customers,
venctl will not immediately get v1.2.0 for the reason explained in the #venctl channel. You can contact @SgtCoDFish or @tfadeyi to know more about that.
Let the Support Team (@hawksight) know that this change might affect customers, and that they can tell them to use --disable-compression if a problem occurs.

Before: request body weights 49 MB:

POST https://api.venafi.cloud/v1/tlspk/upload/clusterdata/no?description=S2luZCBjbHVzdGVyIG9uIE1hZWwmIzM5O3MgQW9ydXMgaG9tZSBtYWNoaW5l&name=kind-mael HTTP/2.0
accept: application/json
content-type: application/json
authorization: Bearer REDACTED
content-length: 49000269
accept-encoding: gzip
user-agent: Go-http-client/2.0

{"agent_metadata":{...

HTTP/2.0 200 
date: Tue, 15 Oct 2024 08:42:45 GMT
content-type: application/json
www-authenticate: Bearer realm="Echo"
vary: Accept-Encoding
server: envoy
content-length: 70

{"status":"ok","organization":"756db001-280e-11ee-84fb-991f3177e2d0"}

After: request body weights 5 MB:

POST https://api.venafi.cloud/v1/tlspk/upload/clusterdata/no?description=S2luZCBjbHVzdGVyIG9uIE1hZWwmIzM5O3MgQW9ydXMgaG9tZSBtYWNoaW5l&name=kind-mael HTTP/2.0
authorization: Bearer REDACTED
accept: application/json
content-type: application/json
content-length: 5125690
accept-encoding: gzip
user-agent: Go-http-client/2.0

\x8b������\xff̽ͮh\xb9\x8d...

HTTP/2.0 200 
date: Tue, 15 Oct 2024 08:42:45 GMT
content-type: application/json
www-authenticate: Bearer realm="Echo"
vary: Accept-Encoding
server: envoy
content-length: 70

{"status":"ok","organization":"756db001-280e-11ee-84fb-991f3177e2d0"}

^ note that the above request is missing the Content-Encoding: gzip header. I fixed this in 22def464afb6f5d80f5b8aa68ad0fe058c6d37dc.

(see the section below to know how to reproduce this)

Manual Tests

Test 1

For this test, I've used the tenant https://ven-tlspk.venafi.cloud/. To access the API key, use the user [email protected] and the password is visible in the page Production Accounts (private to Venafi). Then go to the settings and find the API key, and set it in the env var APIKEY.

The tenant https://ven-tlspk.venafi.cloud/ doesn't have the right tier to pull images, so I use an API key from the tenant https://glow-in-the-dark.venafi.cloud/, that's why I set the env var APIKEY_GLOW_IN_THE_DARK. Ask Atanas to get access to the tenant https://glow-in-the-dark.venafi.cloud/.

export APIKEY=...
export APIKEY_GLOW_IN_THE_DARK=...

Then:

export \
 OCI_BASE=ttl.sh/maelvls \
 VEN_API_KEY=$APIKEY \
 VEN_API_KEY_PULL=$APIKEY_GLOW_IN_THE_DARK \
 VEN_API_HOST=api.venafi.cloud \
 VEN_VCP_REGION=us \
 VEN_ZONE='tlspk-bench\Default' \
 CLOUDSDK_CORE_PROJECT=jetstack-mael-valais \
 CLOUDSDK_COMPUTE_ZONE=europe-west1-b \
 CLUSTER_NAME=test-secretless

Finally, run the smoke test Venafi Connection mode:

./hack/e2e/test.sh

Result:

2024/10/18 19:18:35 Posting data to:
2024/10/18 19:18:36 Data sent successfully.

Test 2

For this test, I've decided to use mitmproxy to see exactly what is being sent on the wire.

First, copy the following script to a file called bigjsonbody.go (written in Go because bash is super slow):

///usr/bin/true; exec /usr/bin/env go run "$0" "$@"

package main

import (
	"encoding/json"
	"flag"
	"fmt"
	"time"

	"github.com/google/uuid"
)

var (
	count = flag.Int("count", 5000, "Number of items to generate")
)

type Metadata struct {
	Name              string            `json:"name"`
	Namespace         string            `json:"namespace"`
	UID               string            `json:"uid"`
	CreationTimestamp string            `json:"creationTimestamp"`
	Labels            map[string]string `json:"labels"`
}

type Resource struct {
	Kind       string   `json:"kind"`
	APIVersion string   `json:"apiVersion"`
	Metadata   Metadata `json:"metadata"`
}

type Item struct {
	Resource Resource `json:"resource"`
}

type Data struct {
	Items []Item `json:"items"`
}

type Output struct {
	ClusterID    string    `json:"cluster_id"`
	DataGatherer string    `json:"data-gatherer"`
	Timestamp    time.Time `json:"timestamp"`
	Data         Data      `json:"data"`
}

func generateItem() Item {
	randomUID := uuid.New().String()
	return Item{
		Resource: Resource{
			Kind:       "Pod",
			APIVersion: "v1",
			Metadata: Metadata{
				Name:              "test-" + randomUID,
				Namespace:         "test",
				UID:               randomUID,
				CreationTimestamp: "2024-09-20T18:03:51Z",
				Labels: map[string]string{
					"k8s-app": "test",
				},
			},
		},
	}
}

func main() {
	flag.Parse()
	items := make([]Item, *count)

	for i := 0; i < *count; i++ {
		items[i] = generateItem()
	}

	output := Output{
		ClusterID:    "mael temp",
		DataGatherer: "k8s/pods",
		Timestamp:    time.Date(2024, time.September, 30, 16, 47, 32, 0, time.FixedZone("CET", 2*3600)),
		Data:         Data{Items: items},
	}

	outputJSON, err := json.MarshalIndent(output, "", "  ")
	if err != nil {
		fmt.Println("Error:", err)
		return
	}

	fmt.Println(string(outputJSON))
}

Then:

chmod +x bigjsonbody.go

Finally:

$ ./bigjsonbody.go --count=200000 >input-data.json
du -h input-data.json
87M    input-data.json

For this tests, I've used the tenant https://ven-tlspk.venafi.cloud/. To access the API key, use the user [email protected] and the password is visible in the page Production Accounts (private to Venafi). Then go to the settings and find the API key, and set it as an env var:

export APIKEY=...

Create the service account and key pair:

venctl iam service-account agent create --name "$USER temp" \
  --vcp-region US \
  --output json \
  --owning-team $(curl -sS https://api.venafi.cloud/v1/teams -H "tppl-api-key: $APIKEY" | jq '.teams[0].id') \
  --output-file /tmp/agent-credentials.json \
  --api-key $APIKEY

Now, make sure to have 127.0.0.1 me in your /etc/hosts.

Then, run mitmproxy with:

curl -L https://raw.githubusercontent.com/maelvls/kubectl-incluster/main/watch-stream.py >/tmp/watch-stream.py
mitmproxy --mode regular@9090 --ssl-insecure -s /tmp/watch-stream.py --set client_certs=$(kubectl incluster --print-client-cert >/tmp/me.pem && echo /tmp/me.pem)

Run this:

cat <<EOF >minimal-config.yaml
cluster_id: "kind-mael"
cluster_description: "Kind cluster on Mael's machine"
server: "https://api.venafi.cloud/"
venafi-cloud:
  uploader_id: "no"
  upload_path: "/v1/tlspk/upload/clusterdata"
data-gatherers: []
period: 3m
EOF

Finally, run the Agent with:

go install github.com/maelvls/kubectl-incluster@latest
export HTTPS_PROXY=http://localhost:9090 KUBECONFIG=/tmp/kube && KUBECONFIG= HTTPS_PROXY= kubectl incluster --replace-ca-cert ~/.mitmproxy/mitmproxy-ca-cert.pem --sa=venafi/venafi-kubernetes-agent | sed 's|127.0.0.1|me|' >/tmp/kube

go run . agent -c minimal-config.yaml \
  --client-id $(jq -r .client_id /tmp/agent-credentials.json) \
  --private-key-path <(jq -r .private_key /tmp/agent-credentials.json) \
  --install-namespace=venafi \
  --input-path=input-data.json \
  --one-shot

Result:

2024/10/18 19:18:35 Posting data to: https://api.venafi.cloud/
2024/10/18 19:18:36 Data sent successfully.

Oct 15 '24 08:10 maelvls

By the way, I don't understand why we have set a client timeout of 1 minute in Key Pair service account mode... Timeouts are a good practice in most cases, but I don't think this is such a case:

https://github.com/jetstack/jetstack-secure/blob/64a925584cc91e6b4ab977f70788bac0a9db67be/pkg/client/client_venafi_cloud.go#L114

I also noted that Venafi Kubernetes Agent's user agent is off:

user-agent: Go-http-client/2.0

It should be smth like

user-agent: venafi-kubernetes-agent/1.1.0

Oct 15 '24 16:10 maelvls

Add some comments in the code, explaining why the collected data is compressed using GZIP....to reduce the likelihood of exceeding the maximum request body size limit imposed by the server or by loadbalancers and proxies in between.

Good point, I forgot to explain why Content-Encoding was added to the Key Pair mode and the VenafiConnection mode, but not the Jetstack Secure OAuth mode nor the Jetstack Secure API Token mode.

I've added a few comments in 22def46 to explain that.

Show evidence that this still works with the old Jetstack TLSPK API server (does it handle requests with Content-Encoding: gzip?), or explain why that is not important.

I haven't added Content-Encoding: gzip to the Jetstack Secure OAuth mode nor the Jetstack Secure API Token mode. I haven't added them in these two modes because I have heard that the 413 Entity Too Large error was only happening with the VCP API. I don't think it's worth changing how the way data is pushed to the Jetstack Secure API ("Standalone") when we found no issue with it.

Did you consider adding a flag to allow users to turn off the gzip content encoding, in case it doesn't work with a corporate proxy?

Good question... I haven't thought about that. I've looked around and found one instance of a person struggling with using IIS as a reverse proxy in front of SABnzbd; the problem seemed to be that SABnzbd's responses are GZIP-encoded; IIS would respond "406 Not Acceptable": https://github.com/sabnzbd/sabnzbd/issues/915.

I haven't found other evidence of people struggling with GZIP-encoded requests being a problem in proxies. I conclude that the large majority of transparent proxies, HTTP proxies, or SOCKS proxies, and the like found in companies handle GZIP-encoded requests well.

I don't think there is a need to add a flag to disable GZIP encoding.

Did you consider checking for HTTP server error 415 Unsupported Media Type and resending the request uncompressed?

I haven't considered this. It seems like some work to add this logic, and I'm not sure how useful it will be.

Note: Looks like 415 Unsupported Media Type isn't the only possible error that may show up: 406 Not Acceptable may also be one (as seen in the SABnzbd issue I linked above).

Does the API documentation explain that the API supports gzip encoded requests?

I had not thought of doing that. I just tried the google search site:developer.venafi.com gzip and didn't find anything; I also tried site:docs.venafi.cloud gzip, and I didn't find anything either.

I also maintain an ad-hoc internal documentation at API Documentation for /v1/tlspk/upload/clusterdata, nothing there either.

I conclude that the VCP API doesn't officially support gzip-encoded requests... should I let the Production Engineering team know about this? I'm not sure what to do about it.

Oct 17 '24 14:10 maelvls

I realize that this PR is trickier than anticipated. @j-fuentes told "if it is low hanging fruit, just enable compression; if it is not low hanging fruit, discuss further actions". I'll have a chat with Jose to know what to do next.

Update: we have decided to continue even though it is a little more difficult than anticipated. We have decided to go with --disable-compression instead of trying to trying to retry in plain text on 415 errors (if they ever happen).

Oct 17 '24 14:10 maelvls

@tfadeyi I'm done with the PR! I have added the necessary unit tests and have run a smoke test in Venafi Connection mode as well as in Key Pair mode. I am confident that this will work and will not break customers. You can proceed with the review and with merging and releasing v1.2.0 once you feel confident.

Before releasing, can you proceed with the "Test 7"? I've sent data to the tenant https://ven-tlspk.venafi.cloud/ at around 2024/10/18 19:21:37 UTC, can you check that the bucket contains the data?

Oct 18 '24 19:10 maelvls

Before releasing, can you proceed with the "Test 7"? I've sent data to the tenant https://ven-tlspk.venafi.cloud/ at around 2024/10/18 19:21:37 UTC, can you check that the bucket contains the data?

I've tested e2e (pushing data from this branch of the agent to checking the blob storage) and the data is successfully uploaded to the backend side, with both --disable-compression and without. Note: the backend already compresses data before storing it to the storage.

I'm going to look into the failing unit tests.

When setting --install-namespace venafi I still see the following line at the start of the agent logs:

2024/10/23 13:57:21 error messages will not show in the pod's events because the POD_NAME environment variable is empty

Oct 23 '24 14:10 tfadeyi