Split Performance Tests into their Own Sub-Command

Open learnitall opened this issue 3 years ago • 0 comments

Proposal / RFE

Is your feature request related to a problem?

When running performance tests through cilium connectivity test --perf, all of the deployments required for the non-perf connectivity checks are created and validated, giving some extra bloat and execution time:

❯ ./cilium connectivity test --perf
ℹ️  Monitor aggregation detected, will skip some flow validation steps
⚠️  Each zone only has a single node - could impact the performance test results
--> (Needed for perf)⌛ [kind-kind] Waiting for deployments [perf-client] to become ready...
--> (Needed for perf)⌛ [kind-kind] Waiting for deployments [perf-server] to become ready...
⌛ [kind-kind] Waiting for CiliumEndpoint for pod cilium-test/client-668b56fb44-swmr9 to appear...
⌛ [kind-kind] Waiting for CiliumEndpoint for pod cilium-test/client2-bf46d8f5d-sb49v to appear...
⌛ [kind-kind] Waiting for pod cilium-test/client-668b56fb44-swmr9 to reach DNS server on cilium-test/echo-same-node-5557788d8-pvtr5 pod...
⌛ [kind-kind] Waiting for pod cilium-test/client2-bf46d8f5d-sb49v to reach DNS server on cilium-test/echo-same-node-5557788d8-pvtr5 pod...
⌛ [kind-kind] Waiting for pod cilium-test/client-668b56fb44-swmr9 to reach DNS server on cilium-test/echo-other-node-7dc896756-jsgr2 pod...
⌛ [kind-kind] Waiting for pod cilium-test/client2-bf46d8f5d-sb49v to reach DNS server on cilium-test/echo-other-node-7dc896756-jsgr2 pod...
⌛ [kind-kind] Waiting for pod cilium-test/client-668b56fb44-swmr9 to reach default/kubernetes service...
⌛ [kind-kind] Waiting for pod cilium-test/client2-bf46d8f5d-sb49v to reach default/kubernetes service...
--> (Needed for perf)⌛ [kind-kind] Waiting for CiliumEndpoint for pod cilium-test/perf-client-6b76f6645-jvnk4 to appear...
--> (Needed for perf)⌛ [kind-kind] Waiting for CiliumEndpoint for pod cilium-test/perf-client-other-node-55c54fb756-qhnlv to appear...
--> (Needed for perf)⌛ [kind-kind] Waiting for CiliumEndpoint for pod cilium-test/perf-server-74dc77c7-2xbns to appear...
⌛ [kind-kind] Waiting for CiliumEndpoint for pod cilium-test/echo-other-node-7dc896756-jsgr2 to appear...
⌛ [kind-kind] Waiting for CiliumEndpoint for pod cilium-test/echo-same-node-5557788d8-pvtr5 to appear...
⌛ [kind-kind] Waiting for Service cilium-test/echo-other-node to become ready...
⌛ [kind-kind] Waiting for Service cilium-test/echo-same-node to become ready...
⌛ [kind-kind] Waiting for NodePort 172.18.0.5:30760 (cilium-test/echo-other-node) to become ready...
⌛ [kind-kind] Waiting for NodePort 172.18.0.5:30547 (cilium-test/echo-same-node) to become ready...
⌛ [kind-kind] Waiting for NodePort 172.18.0.2:30760 (cilium-test/echo-other-node) to become ready...
⌛ [kind-kind] Waiting for NodePort 172.18.0.2:30547 (cilium-test/echo-same-node) to become ready...
⌛ [kind-kind] Waiting for NodePort 172.18.0.3:30760 (cilium-test/echo-other-node) to become ready...
⌛ [kind-kind] Waiting for NodePort 172.18.0.3:30547 (cilium-test/echo-same-node) to become ready...
⌛ [kind-kind] Waiting for NodePort 172.18.0.4:30547 (cilium-test/echo-same-node) to become ready...
⌛ [kind-kind] Waiting for NodePort 172.18.0.4:30760 (cilium-test/echo-other-node) to become ready...
ℹ️  Skipping IPCache check
🔭 Enabling Hubble telescope...
⚠️  Unable to contact Hubble Relay, disabling Hubble telescope and flow validation: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp [::1]:4245: connect: connection refused"
ℹ️  Expose Relay locally with:
   cilium hubble enable
   cilium hubble port-forward&
ℹ️  Cilium version: 1.12.2
🏃 Running tests...

[=] Test [network-perf]

Describe the solution you'd like

Splitting the performance tests into their own sub-command will help to alleviate this issue and others that may arise as performance tests are expanded upon. I understand that a performance test is technically a connectivity test, but there are two main reasons why I think this is worth doing:

Isolate performance test related options from connectivity test options. Performance test options are "namespaced" already anyways, prefixed with a --perf-. As the suite expands and is developed, more of these options may need to be added, such as for adding new test types or configuring the output format of test results. Having a split between performance options and non-performance related options in the output for a connectivity test --help might make things difficult to read in the CLI. Additionally, not all of the non-perf options are implemented for the performance tests when intuitively they can still apply. For instance, the --multi-cluster is a no-op.
Help with code organization. If we wanted to fix the issue where non-performance resources are created when toggling the performance tests, then when we're deploying resources in connectivity.ConnectivityTest.deploy we could check if performance tests are enabled in an if statement towards the top of the function, and then add a return statement at the end of the branch to prevent other resources from being created:

// deploy ensures the test Namespace, Services and Deployments are running on the cluster.
func (ct *ConnectivityTest) deploy(ctx context.Context) error {
    if ct.params.Perf {
        // deploy perf stuff
        return nil    
    }
}

(This if statement already exists, it's just in the middle of the function and doesn't have a return statement.)

But then we need to add one of these if statements at the top of connectivity.ConnectivityTest.validateDeployment since this function will check for the resources we skipped deploying earlier in deploy and fail the run:

// validateDeployment checks if the Deployments we created have the expected Pods in them.
func (ct *ConnectivityTest) validateDeployment(ctx context.Context) error {
    if ct.params.Perf {
        // validate perf stuff
        return nil
    }
}

This pattern is also repeated in connectivity.Run, which kicks off performance tests if they are enabled and returns when they are done:

func Run(ctx context.Context, ct *check.ConnectivityTest) error {
    // ---
    // some setup tasks
    // ---

    if ct.Params().Perf {
        // run perf stuff
        return nil
    }
}

All of this essentially splits our core functions for connectivity test in half: one part of performance and one part for the connectivity tests. If we need to split our functions up anyways, then I think it would make more sense to just split out the performance-related stuff into its own sub-command. This will help a lot with readability, and preventing and finding bugs. For instance, for some reason I get a timeout error when running performance tests on a fresh cluster. I have to invoke the command twice to get it going:

❯ kubectl delete ns cilium-test
namespace "cilium-test" deleted
❯ ./cilium connectivity test --perf
...
connectivity test failed: timeout reached waiting lookup for localhost from pod cilium-test/client2-bf46d8f5d-r22q8 to server on pod cilium-test/echo-same-node-5557788d8-vbrkz to succeed
❯ ./cilium connectivity test --perf  # now it works
...
ℹ️  Cilium version: 1.12.2
🏃 Running tests...

This doesn't happen for the connectivity tests themselves:

❯ kubectl delete ns cilium-test
namespace "cilium-test" deleted
❯ ./cilium connectivity test
...
ℹ️  Cilium version: 1.12.2
🏃 Running tests...

(I'm on kind version 0.15.0, using image kindest/node:v1.24.0, cilium-cli d9371d3dc6dbe9f8d8437bd34ee848d7358ff4e0, cilium v1.12.1, cluster has three workers and one control-plane node.)

I'd suggest we create a new folder under the root called benchmark and add a benchmark sub-command:

benchmark/
internal/cli/cmd/
    benchmark.go  <--- $ cilium benchmark

I think benchmark is a better choice than perf, as it's a bit more precise and may avoid confusion with profiling.

I'd love to take up this task, just want to see what folks think and get a thumbs up. Thanks! :tada:

CC: @christarazi

Sep 26 '22 23:09 learnitall