standards Refine CVE check in scs-0210-v2 test script.

The test script currently does not really check whether any patch-level update that targets any critical CVEs is deployed in time.

Furthermore, the standard is a bit vague about whether this part is actually required or recommended.

Thirdly, could you make some kind of suggestion of how to best integrate with CVE check tools? For instance, the test script could accept a log file by one of these tools and just verify that the tool ran fine. You could then add this to the standard as a recommendation; I think we might get this in even with the now stable standard because it wouldn't turn any compliant clouds non-compliant.

Mar 18 '24 10:03 mbuechse

Note to myself: while working on #476 I noticed that the VersionRange class doesn't cover all use cases nicely. Without workarounds, it covers the cases:

a single version is affected (upper_version is None)
a range is affected, but both ends must be known

It doesn't cover the case that all versions prior upper_version are affected, which I have worked around using a version 0.0.0 as the lower version. And also the other way around (which would mean, there is no patched version available/known).

Mar 22 '24 14:03 martinmo

And a general note: I would actually like to replace our custom CVE retrieval and parsing with an existing library, if possible. An obvious candidate I want to evaluate in this regard is cvelib.

Mar 22 '24 14:03 martinmo

@martinmo Yes, very well. Also (as stated in the description of this issue) we might require the use of some external CVE check tool, if you can find something appropriate.

Mar 22 '24 14:03 mbuechse

Oh, and one other note: maybe in the course of this issue, you can also try to make the check work for any given date, instead of just the current, so that unit tests can work without monkeypatching.

Mar 22 '24 14:03 mbuechse

I did some research on the CVE/vulnerability scanning part of this issue.

An additional candidate for a Python CVE query library is nvdlib, which uses the "National Vulnerability Database". But there is a big caveat and the database is not reliable at the moment (see https://heise.de/-9656574, for example).

However, before digging deeper into the Python libraries, I decided to look for vulnerability scanning solutions in the K8s ecosystem:

Because even if we have a good CVE query library, we still need to scan the K8s cluster ourselves and match the results. I am sure this is an already solved problem.
Furthermore, while experimenting with this I noticed that our current approach has another shortcoming. We just compare the K8s version of one particular component when we connect to the cluster with the kubernetes-asyncio package, and not the complete cluster. (Nodes could, in theory, run slightly different versions of kubelet and container images in the kube-system namespace, such as kube-apiserver.)

The proper way to address point 2 would be to create an inventory and check it, for example with the cluster-inventory plugin for sonobuoy or the KBOM ("Kubernetes Bill of Materials").

A promising solution to tackle points 2 and 1 seems to be trivy, which conveniently is Apache-2.0 licensed. For example, the experimental trivy k8s subcommand can be used to scan a cluster. I successfully tried the following on a test cluster:

trivy k8s --report=summary --scanners=vuln cluster
trivy k8s --format=json --scanners=vuln --namespace=kube-system all
trivy k8s --scanners=vuln --namespace=kube-system --format=json -o result.json nodes,pods

JSON output is supported, which means we can further process the information.

Trivy can also be run in a k8s native fashion as an operator (trivy-operator). However I think this doesn't make sense if we only test short lived clusters which only exist during the conformance tests.

Mar 26 '24 17:03 martinmo

Today I brought the question about which scanning tools could be used into the Team Container call. However, because of holidays/vacation, we were only two people and this couldn't be discussed with a broader audience (I'll try again in the next week if necessary).

In the meantime, I picked up another tool that I will evaluate: kubescape (https://kubescape.io/).

Mar 28 '24 10:03 martinmo

I performend some evaluation on more CVE scanner tools for K8s. Unfortunately, most of them are not suited for our purpose – they either do not scan cluster components (e.g., the kubelet or apiserver) or they cannot easily be included in a CI pipeline (some of them are nice UI dashboards):

kubescape (https://kubescape.io) cli does only vulnerability scanning of the container images
clair (https://github.com/quay/clair) does the same
kubeclarity (https://github.com/openclarity/kubeclarity, formerly known as kubei) does use trivy under the hood, which I already have written about above (and which is a good candidate)
kube-hunter (https://github.com/aquasecurity/kube-hunter/) is EOL

So all in all, trivy seems to be the best option. One thing to keep in mind though: the trivy k8s cli is still experimental and the format of the JSON export not be stable.

Furthermore, yesterday I prototyped with the Python libraries cvelib and nvdlib to see how much of an effort the library approach is:

cvelib is not suitable, it doesn't provide sophisticated search by product name. It is more aimed towards security professionals who want to assign/reserve/issue CVEs (e.g., the cve_api module provides functions to publish and reserve entries and lookup a specific CVE using the id). Furthermore, an API key is needed to interact with the CVE Services API.

nvdlib could be used if the trivy solution doesn't work out. It is more effort than the trivy solution but still an improvement over our current custom solution. Some facts:

There is a rate limiting if used without API key (without: 6s delay)
We can use searchCVE(…) with cpeName kwarg.
CPE (Common Platform Enumeration) is a standardized way to identify affected products
CPE Dictionary XML is the official listing where we can get the (partial) CPE (https://nvd.nist.gov/products/cpe)

For example, knowing that our cluster runs v1.27.2, with

import nvdlib

results = nvdlib.searchCVE(
  cpeName='cpe:2.3:a:kubernetes:kubernetes:1.27.2',
  isVulnerable=True,
  cvssV3Severity="HIGH",
  limit=10
)

we can get the CVEs this version is affected by. The library also wraps the CVE records data in a nice data structure.

According to my research, it should be sufficient to search only for

cpe:2.3:a:kubernetes:kubernetes:<version>:<update>`

The <version> part is something like 1.27.2 and the <update> part is used for prereleases and should be - instead of * (wildcard). I grepped through the CPE dictionary and only found that historically, the apiserver had a separate <product> in its CPE (it was cpe:2.3:a:kubernetes:apiserver), but only until v1.25rc1.

Apr 04 '24 09:04 martinmo

In today's Team Container call I brought the issue up again. Sven confirmed that trivy is a good approach. It was decided that I try the trivy approach with a MVP first. If it doesn't work out, I can still switch to the library approach.

FTR, we also had a short discussion whether the standard is feasible for CSPs, i.e., whether the timeframes that are set out are too short. It was concluded that it is feasible and that in reality a CSP needs to react quickly anyway. Also, in practice, critical K8s vulnerabilities do not appear often. (Nevertheless, this issue here can be tackled independently as it just deals with the implementation of the check.)

Apr 04 '24 09:04 martinmo

For the prototype using trivy with the k8s subcommand my first goal was to "narrow"/filter the command invocation as much as possible. With the --help flag and some trial and error I arrived at:

trivy k8s --scanners=vuln --components=infra --report=summary \
    --severity=HIGH --exit-code=1 --format=json -o trivy-cluster-infra-scan.yml cluster

The "narrowing" happens with --scanners=vuln, --components=infra and --severity=HIGH flags. I could not find a JSON schema for the resulting output, however the format is simple enough and is codified in the ConsolidatedReport struct (because of --report=summary) in https://github.com/aquasecurity/trivy/blob/v0.50.1/pkg/k8s/report/report.go.

Now I have two problems:

Testing the trivy invocations, I quickly reached the Docker Hub rate limit (TOOMANYREQUESTS: You have reached your pull rate limit.). I can raise the limit a bit by using a Docker hub account. However, I'm still concerned about this. It seems the images are not cached. Flags such as --offline-scan and --skip-db-update didn't help.
When tested against a cluster with the latest patch release (Kubernetes v1.27.12), I still get findings with severity "HIGH". For example, my kube-proxy pods run the image registry.k8s.io/kube-proxy:v1.27.12 and get flagged for being vulnerable because of an issue in runc (CVE-2024-21626). This is unexpected noise, here I am concerned about how to correctly filter/handle this.

Apr 08 '24 15:04 martinmo

FTR, the EOL check failed for the first time in the Zuul E2E tests for cluster-stacks because I did not update the k8s-eol-data.yml after 1.30 was released. This approach alone is prone to this error. Thus, new requirement for the refinement worked on in this issue: add some kind of interpolation of EOLs if data is missing (releases happen every four months on 28th).

Jun 04 '24 06:06 martinmo

During my research about the task I have tried to dive into all of the mentioned technologies.

Started with setting up the k8s cluster with yaook, unfortunately without any success, so moved to kind approach, on which I have installed openstack and on that openstack I was able to use capi.

EDIT: after attempt to create Kubernetes cluster on top of the OpenStack using yaook there was an error found also by @michal-gubricky:

TASK [bootstrap/ssh-known-hosts : Trigger SSH certificate renewal] **********************************************************************************************************************
fatal: [managed-k8s-gw-0]: FAILED! => changed=false 
  msg: |-
    Unable to start service renew-ssh-certificates: Job for renew-ssh-certificates.service failed because the control process exited with error code.
    See "systemctl status renew-ssh-certificates.service" and "journalctl -xeu renew-ssh-certificates.service" for details.

from the logs of that service renew-ssh-certificates.service at the gateway instance visible was error:

Error writing data to auth/yaook/nodes/login: Put "https://127.0.0.1:32769/v1/auth/yaook/nodes/login": dial tcp 127.0.0.1:32769: connect: connection refused
failed to log into vault!

Oct 10 '24 11:10 piobig2871

What I have done right now is restore the original standard text and drop the changes based on the review comment.

According to the code, there were several changes made:

- Integrated Trivy for scanning Kubernetes pod images for security vulnerabilities.
- Fixed issue with ClusterInfo object being incorrectly passed where kubeconfig path was expected.
- Added logging improvements to provide clearer insights during version compliance checks.
- Refined the code structure to handle K8s image scanning and cluster versioning in an async manner.

I have found some problems with SSL certificates as well on my side, for MacOS users there is a simple solution with

/Applications/Python\ 3.10/Install\ Certificates.command

What stopped me mostly was an error that told me: AttributeError: 'ClusterInfo' object has no attribute 'split', therefore I have added a field with kubeconfig variable to ensure that I am passing path to kubeconfig instead of class which indeed does not have attribute split()

EDIT: I haven't met the problems with request quantity but I was able to check that it is:

Limited to 100 pulls per 6 hours from a single IP address for unauthorized accounts.
Limited to 200 pulls per 6 hours from a single Docker Hub account.

Oct 18 '24 13:10 piobig2871

codes are waiting for review

Oct 24 '24 09:10 piobig2871