standards icon indicating copy to clipboard operation
standards copied to clipboard

Prepare scs-0125-secure-connections for stabilization

Open mbuechse opened this issue 4 months ago • 4 comments

We have a draft in main: https://docs.scs.community/standards/scs-0125-v1-secure-connections

Feedback from C&H was positive. It might be desirable to also cover memcached, see https://gitlab.com/yaook/operator/-/issues/226

mbuechse avatar Sep 12 '25 07:09 mbuechse

Let me give our (ScaleUp) perspective here:

Control plane traffic (e. g. API, Database, Messaging) is easy to encrypt and can therefor be done without any problem. We do most of the suggested measured already.

External Neutron Traffic - within the data centre the communication lines are physically secured quite well, but that all depends on the facility and it's specs and certifications. Once traffic hits our (and probably most other provider's) backbone it is considered "the internet", which is unencrypted by nature. I think it is mainly a customer's job to secure their workload.

VPNaaS is a nice way to help customers, but I feel like this is out of scope for this. We don't offer it yet, and most customers opt for VM based solutions or even upstream hardware firewalls.

Internal Neutron Traffic - This is usually segmented with OVS/OVN and sent as VxLAN or GENEVE traffic across a network. The attack vector that can be mitigated here would be MITM at the network level. It is the same risk as a physical server setup connected to switches. Also a VPN underlay would hurt much needed performance.

VM Migration Traffic - could be classified as more secure than storage as this contains the RAM as well. Easy to implement.

Storage network traffic - I don't see this happening for Ceph (and possibly other providers) as it would probably hurt performance quite a lot. Also, ALL storage and compute nodes would need to be able to en- and decrypt traffic. If the customer needs encryption on the storage level, it is better to encrypt with e. g. LUKS.

A note about "secure" zones - Generally it makes sense to separate traffic by e. g. Air-gapping compute and storage nodes with defined jump nodes. Within a data centre the connections usually can be trusted to a certain level as the physical security is usually very high (security guard, CCTV, badge with PIN, turnstile).

Also, some providers including ScaleUp offer interconnects between regions. Once private traffic leaves a site, it is no longer considered "secure" by us, unless a form of layer 2 encryption (e. g. MACSec) is used by the provider. This is often overlooked by customers.

Physical security is something that can only be mandated to a certain extend. It would probably make sense to look in to the measures taken by certain providers, as this is part of the security concept as well, but I also think this is one of the things where providers can also differentiate and appeal to a certain group of customers.

fzakfeld avatar Sep 12 '25 10:09 fzakfeld

@horazont @markus-hentsch Can you please comment on what Freerk wrote here?

mbuechse avatar Sep 16 '25 09:09 mbuechse

@horazont @markus-hentsch Can you please comment on what Freerk wrote here?

I think there is no disagreement with the standard draft. More details below:

Control plane traffic (e. g. API, Database, Messaging) is easy to encrypt and can therefor be done without any problem. We do most of the suggested measured already.

This layer has the most existing upstream documentation and mechanisms already offered by its components. So, yes, this should be pretty straightforward and is also the only part of the standard with hard requirements.

External Neutron Traffic - within the data centre the communication lines are physically secured quite well, but that all depends on the facility and it's specs and certifications. Once traffic hits our (and probably most other provider's) backbone it is considered "the internet", which is unencrypted by nature. I think it is mainly a customer's job to secure their workload.

VPNaaS is a nice way to help customers, but I feel like this is out of scope for this. We don't offer it yet, and most customers opt for VM based solutions or even upstream hardware firewalls.

This does not conflict with the current standard draft, as VPNaaS is classified as an entirely optional measure.

Internal Neutron Traffic - This is usually segmented with OVS/OVN and sent as VxLAN or GENEVE traffic across a network. The attack vector that can be mitigated here would be MITM at the network level. It is the same risk as a physical server setup connected to switches. Also a VPN underlay would hurt much needed performance.

Internal network structure and the overhead cost of encryption may differ greatly. That's why it is also classified as optional in the standard.

VM Migration Traffic - could be classified as more secure than storage as this contains the RAM as well. Easy to implement.

Good.

Storage network traffic - I don't see this happening for Ceph (and possibly other providers) as it would probably hurt performance quite a lot. Also, ALL storage and compute nodes would need to be able to en- and decrypt traffic. If the customer needs encryption on the storage level, it is better to encrypt with e. g. LUKS.

It's pretty specific to the storage backend and network architecture used. And like you mentioned might have a major performance impact. That's why encryption here is classified as a "may" instead of "should" in the standard.

A note about "secure" zones - Generally it makes sense to separate traffic by e. g. Air-gapping compute and storage nodes with defined jump nodes. Within a data centre the connections usually can be trusted to a certain level as the physical security is usually very high (security guard, CCTV, badge with PIN, turnstile).

Also, some providers including ScaleUp offer interconnects between regions. Once private traffic leaves a site, it is no longer considered "secure" by us, unless a form of layer 2 encryption (e. g. MACSec) is used by the provider. This is often overlooked by customers.

Physical security is something that can only be mandated to a certain extend. It would probably make sense to look in to the measures taken by certain providers, as this is part of the security concept as well, but I also think this is one of the things where providers can also differentiate and appeal to a certain group of customers.

The air-gapping between compute and storage node sounds like an interesting approach. However, I worry more about the interlink between compute and control plane to be honest (e.g. compute node being able to access message queue) and I think this is hard to air-gap due to the tight coupling.

Mandating any kind of air gap or specific physical layout is out of scope for a core SCS standard I think. Maybe a security-focused certification level above the basic ones could address that for high security requirements. Furthermore, attesting the fulfillment of those requirements would require in-depth analysis of the internal cloud infrastructure and cannot be verified remotely, which is already a problem for some of the optional contents of this standard.

markus-hentsch avatar Sep 18 '25 11:09 markus-hentsch

@markus-hentsch Thanks a ton! Seems to me like we can go ahead and stabilize soon :)

mbuechse avatar Sep 18 '25 11:09 mbuechse