krane icon indicating copy to clipboard operation
krane copied to clipboard

NetworkPolicy resources are not predeployed when using calico as the network provider

Open trueskawka opened this issue 5 years ago • 11 comments

We've been using your tooling for a while, thanks for making it open-source <3 Appreciate all the work you put into it!

We recently went through the full update path from kubernetes-deploy in 0.25.0 and wanted to update to the newest krane, but this bug is preventing us from updating beyond major version 1.

Bug report

When using calico as the network provider, NetworkPolicy resources are no longer pre-deployed.

Expected behavior: NetworkPolicy resources are pre-deployed as priority.

Actual behavior: NetworkPolicy resources are being deployed with non-priority resources.

Version(s) affected: This has been introduced when bumping to major 2, in https://github.com/Shopify/krane/pull/735

Steps to Reproduce

You can test it in minikube using your test suite, after starting it with calico as the network provider

$ minikube start --network-plugin=cni --cni=calico
$ bundle exec rake test TEST=test/integration/krane_deploy_test.rb TESTOPTS="--name=test_network_policies_are_deployed_first -v"
Preparing test cluster... Done.

# Running tests with run options --name=test_network_policies_are_deployed_first -v --seed 3651:

F

Finished tests in 8.788226s, 0.1138 tests/s, 0.1138 assertions/s.


Failure:
KraneDeployTest#test_network_policies_are_deployed_first [/Users/alicjaraszkowska/Documents/Repos/krane/test/integration/krane_deploy_test.rb:1385]
Minitest::Assertion: 'Predeploying priority resources' not found in the expected sequence in the following logs:
[INFO][2020-12-14 19:02:12 -0800]
[INFO][2020-12-14 19:02:12 -0800]	------------------------------------Phase 1: Initializing deploy------------------------------------
[INFO][2020-12-14 19:02:13 -0800]	All required parameters and files are present
[INFO][2020-12-14 19:02:13 -0800]	Discovering resources:
[INFO][2020-12-14 19:02:15 -0800]	  - NetworkPolicy/allow-all-network-policy
[INFO][2020-12-14 19:02:16 -0800]
[INFO][2020-12-14 19:02:16 -0800]	----------------------------Phase 2: Checking initial resource statuses-----------------------------
[INFO][2020-12-14 19:02:16 -0800]	NetworkPolicy/allow-all-network-policy            Not Found
[INFO][2020-12-14 19:02:16 -0800]
[INFO][2020-12-14 19:02:16 -0800]	----------------------------------Phase 3: Deploying all resources----------------------------------
[INFO][2020-12-14 19:02:17 -0800]	Deploying NetworkPolicy/allow-all-network-policy (timeout: 30s)
[INFO][2020-12-14 19:02:21 -0800]	Successfully deployed in 3.7s: NetworkPolicy/allow-all-network-policy
[INFO][2020-12-14 19:02:21 -0800]
[INFO][2020-12-14 19:02:21 -0800]	------------------------------------------Result: SUCCESS-------------------------------------------
[INFO][2020-12-14 19:02:21 -0800]	Successfully deployed 1 resource
[INFO][2020-12-14 19:02:21 -0800]
[INFO][2020-12-14 19:02:21 -0800]	Successful resources
[INFO][2020-12-14 19:02:21 -0800]	NetworkPolicy/allow-all-network-policy            Created


Slowest tests:

8.719095s test_network_policies_are_deployed_first#KraneDeployTest

Feature request

  • [x] If the maintainers agree with the feature as described here, I intend to submit a Pull Request myself.1

Proposal: Changing the way resources are being prioritized. I'd prefer to go with approach 1, since it would make it more flexible, but I may not have enough context and maybe making a strong assumption about NetworkPolicy resources proposed in approach 2 is sufficient.

  1. Make predeploy sequence configurable Since the change that introduced this issue has been implemented to address a request for a reverse priority order (i.e. custom resource over core, https://github.com/Shopify/krane/issues/728), it seems like having a strong assumption on which resources should be prioritized written directly in code could potentially not fit all the use cases.

    Providing a configurable way for the user to define which resources should be predeployed would allow for allowing any use cases they need.

  2. Ensure NetworkPolicy resources are always predeployed Filter out NetworkPolicy resources from this hash:

    crs = cluster_resource_discoverer.crds.select(&:predeployed?).map { |cr| [cr.kind, { group: cr.group }] }
    

1 This is the quickest way to get a new feature! We reserve the right to close feature requests, even ones we like, if the proposer does not intend to contribute to the feature and it doesn't fit in our current roadmap.

trueskawka avatar Dec 15 '20 03:12 trueskawka

I'm surprised this is happening. By default CRs are pre-deployed. This behavior can be turned off by an annotation on the CRD https://github.com/Shopify/krane#customizing-behaviour-with-annotations krane.shopify.io/predeploye: false .

Can you add the example CR and CRD to a comment in this issue?

dturn avatar Dec 15 '20 19:12 dturn

We're setting up calico as a separate workload in every cluster. Whenever we deploy a different workload to a cluster, calico is already there and the krane discovery is going to find a CRD NetworkPolicy resource coming from it. That overrides the predeployment for the CR NetworkPolicy for any workload we're deploying.

That happens in every workload we deploy to the cluster, no matter what the CR NetworkPolicy is, because the hash for CRDs comes after the hash for CRs when concatenating in DeployTask.predeploysequence (https://github.com/Shopify/krane/blob/master/lib/krane/deploy_task.rb#L59-L78), specifically:

Hash[before_crs + crs + after_crs]

Examples

  1. CRD NetworkPolicy We're using calico CRDs, e.g.:
---
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: globalnetworkpolicies.crd.projectcalico.org
  labels:
    workload: calico
spec:
  scope: Cluster
  group: crd.projectcalico.org
  version: v1
  names:
    kind: GlobalNetworkPolicy
    plural: globalnetworkpolicies
    singular: globalnetworkpolicy
  1. CR NetworkPolicy Any CR NetworkPolicy will have this issue, a minimal example:
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: example-network-policy
spec:
  podSelector: {}
  ingress:
    - {}
  policyTypes:
    - Ingress

trueskawka avatar Dec 15 '20 20:12 trueskawka

Oh, I see NetworkPolicy exists in extensions / networking.k8s.io depending on k8s version.

When you use Calico it also adds a NetworkPolicy CRD and because of the way we construct the hash: https://github.com/Shopify/krane/blob/master/lib/krane/deploy_task.rb#L77 we only consider Calico's NetworkPolicy as something to predeploy. Yep this is 100% a bug.

So fixing this is a bit tricky. My two stabs at it are in https://github.com/Shopify/krane/pull/774, but neither are quite right. I've got a 3rd idea (we have to track the groups to pre-deploy and the groups not to pre-deploy), but might not finish it today.

dturn avatar Dec 16 '20 21:12 dturn

Yay, glad it makes sense with more context 🎉

I couldn't think of an elegant way to solve it, because in some cases you may want to predeploy CRDs and not CRs (seemed like that was the case for the feature request that the grouping was addressing, i.e. https://github.com/Shopify/krane/issues/728).

I was wondering if there is a definite set of cases for which resources you may or may not want to predeploy, or can it depend on the workload/how the cluster is set up?

trueskawka avatar Dec 16 '20 23:12 trueskawka

I was wondering if there is a definite set of cases for which resources you may or may not want to predeploy, or can it depend on the workload/how the cluster is set up?

Its up to each operator if they want to pre-deploy CRs or not. As an example: If you use a CR to do canary deploys, you wouldn't want that to be pre-deployed.

dturn avatar Dec 17 '20 22:12 dturn

Hi @dturn, happy new year 🎉

Did you end up making any progress on a different approach? Anything I could do to help?

Happy to provide more info and brainstorm :)

trueskawka avatar Jan 04 '21 17:01 trueskawka

Following up @dturn :) Anything I could do to help?

trueskawka avatar Jan 11 '21 19:01 trueskawka

I'm slowly making progress on #774 but when I went to write the test I realized this is going to take longer than expected. It turns out there are at least 3 places we assume Kinds are unique: predeploy, KubernetesResource#build, and ResourceCache.

The PR originally was trying to solve the first issue. Solving the second one is done locally, but still haven't had enough time to finish up the 3rd. Assuming no more surprises, I'm optimistic I'll get it done this week.

dturn avatar Jan 11 '21 19:01 dturn

@trueskawka you've probably figured this out, but I haven't had as much time to work on this as needed. I can't commit to getting this fixed in a reasonable time frame. If you're interested in taking over let me know and I'll do my best to support you. Otherwise, you can follow this this issue and I'll keep it updated.

dturn avatar Feb 02 '21 18:02 dturn

Hi @dturn!

We were able to work around it by doing our networking a bit differently and not having to depend on calico as much. I'm happy to try and PR a change to fix it for others who might be still running into this issue.

Gonna do a PR sometime in the next two weeks and @-mention you :)

trueskawka avatar Mar 31 '21 16:03 trueskawka

@dturn @trueskawka Has anything changed since the last comments here? I'm running into this same problem.

@dturn I know you stated there's three bugs to fix, but I wonder if it's necessary to fix them all together. If you have a solution for the specific issue described here, is it not possible to ship that on its own, acknowledging that other similar issues will remain?

benlangfeld avatar May 30 '22 18:05 benlangfeld