spec `GetCapacityResponse` should contain "total capacity"

GetCapacityResponse should contain "total capacity" in addition to available_capacity so that caller can make decisions about provisioning.

Nov 06 '18 20:11 saad-ali

But GetCapacityResponse is a controller call, without attaching the volume somewhere it may be tough for storageProvider to say how much is available capacity. For example - I don't recall AWS/EBS api supports returning available capacity when we describe a volume.

Nov 06 '18 21:11 gnufied

This GetCapacity response is used for determining how much capacity is available for provisioning. It's not for reporting how much capacity is used/available in a single volume.

Some more detail on the motivation for being able to report a total capacity. If a plugin only reports current available capacity, that adds limitations on the performance of the controller using that information:

Additional scheduling latency because you need to wait for the RPC round trip for the available capacity to be updated after a CreateVolume() call
Parallelism of scheduling and provisioning is limited due to this round trip dependency
Attempts to cache requested capacity for "in-flight" CreateVolume operations is challenging when you consider plugin restarts. It's not clear from an observer how much of the reported available capacity included the outstanding requests or not.

Nov 06 '18 21:11 msau42

Thanks for providing additional context @msau42.

It's still not clear to me how including the "total" capacity helps resolve these things. If the CO can't reason about "available" capacity due to parallel operations then it's not obvious to me how having the "total" capacity helps: the CO is still in the same position re: being unable to reason about operations executing in parallel or that may/may-not have completed after a plugin restart.

Given that storage provisioning/quota policy parameters could likely be the governance of the backend storage system itself (and invisible to the CO) I think that relying on "stable" cached values for "total" capacity is probably fraught w/ error for some set of backends. I suppose the same could be said of "available" capacity - caching this value for very long might not be a very good idea.

Nov 06 '18 22:11 jdef

With total capacity reported, the CO can keep track of what volumes it has created and what is outstanding. Plugin restarts are fine because that base number doesn't change, and the rest of the information can be persisted and reconstructed as needed. However, with only available capacity, we can't tell how many of the volumes we know about are accounted for in the reported capacity.

This does have the limitation that the total capacity reported is completely allocated to the CO cluster and not shared with other clusters or allocated out of band.

Nov 06 '18 22:11 msau42

Let me try to convey the difficulty with an example.

When the plugin is being initialized, we can query it for the available capacity, and it returns 100 GB out of 500.
Say there are some 50 volume creation operations all in flight
At the same time an administrator decides to add more available capacity to the storage backend.
Also at the same time, volumes are getting deleted and their capacity will be added back to the pool.

As a CO, when I periodically query the plugin for available capacity, how do I know which operations have been accounted for in the number that the plugin gives me? There is a timing delay where the CO's view can be out of sync with the plugin's view.

If there was a total capacity field, then it doesn't matter what the plugin's view is of 2) or 4). The CO can calculate available capacity based only on its view of the allocated volumes and operations in flight. When it queries the plugin for capacity, a change in total capacity means something like 3) occurred, which is also not as frequent of an event as 2) or 4)

Let me know if this makes any more sense.

Nov 07 '18 01:11 msau42

Thanks for the example, your use case is much more clear now. I think using total_capacity in the way that you want will break down for cases where storage backends can provision different flavors of volumes (depending on create params) such that those volumes will consume "raw" storage capacity in different ratios. For example, if an LVM2 VG is a storage backend and supports carving both both linear and RAID1 volumes, then total_capacity changes depending on the params passed to GetCapacity. So if there are two storage classes that consume storage in different ratios from the same backend (and the CO doesn't have the formulas to calculate how total_capacity will be affected) then what's the value of reporting total_capacity? In other words, linear volume creation consumes storage such that it would affect the total_capacity reported by the GetCapacity(params=RAID1-flavor) and vice-versa.

On Tue, Nov 6, 2018 at 8:10 PM Michelle Au [email protected] wrote:

Let me try to convey the difficulty with an example.

When the plugin is being initialized, we can query it for the available capacity, and it returns 100 GB out of 500.

Say there are some 50 volume creation operations all in flight

At the same time an administrator decides to add more available capacity to the storage backend.

Also at the same time, volumes are getting deleted and their capacity will be added back to the pool.

As a CO, when I periodically query the plugin for available capacity, how do I know which operations have been accounted for in the number that the plugin gives me? There is a timing delay where the CO's view can be out of sync with the plugin's view.

If there was a total capacity field, then it doesn't matter what the plugin's view is of 2) or 4). The CO can calculate available capacity based only on its view of the allocated volumes and operations in flight. When it queries the plugin for capacity, a change in total capacity means something like 3) occurred, which is also not as frequent of an event as 2) or 4)

Let me know if this makes any more sense.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/container-storage-interface/spec/issues/301#issuecomment-436467246, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPVLB--ca7WgxCpN2QcJjPmE-VcZiKJks5usjLrgaJpZM4YRQwo .

Nov 07 '18 18:11 jdef

Agree with jdef. Storage class parameters will become richer over time, and it will be the job of the backend to optimally map volume requests to available "generalized capacity" - by which I mean not just storage capacity, but IOPs, network bandwidth, and many other constrained resources. Trying to report capacity as a single value isn't going to be meaningful.

May 09 '19 19:05 jhdedrick

The question is whether the capacity can be cached by the CO for volume scheduling or not.

From my understanding, there are two things here:

Adhere to one storage allocation way in one storage class

To calculate the available capacity, we need to know the way how the storage (e.g. VG, ceph image pool) is consumed (e.g. linear, raid1 or filesystems with a fixed ratio of the filesystem to block size).

If there are multiple ways to consume the backend storage for one storage class. The avaiable_capacity cannot be calculated for the storage class too.

If there is only one way to allocate the volume from a storage class, then we can report the capacity for this storage class. The CO can cache them to make scheduling decisions.

I think this is the only way to do volume scheduling, otherwise, the capacity cannot be calculated and used by the CO. The CO can only assume all nodes have enough capacity, which is the status of current volume scheduling in Kubernetes.

The storage driver or plugin can support carving multiple types of volumes (e.g. linear and raid1 volumes in LVM), but for each storage class, it should support only one allocation type.

It's best to report total capacity to improve the experience of volume scheduling

Described by @msau42 here. Without total capacity per topology segment per storage class (for local storage, each node is a topology segment), the scheduler may make bad decisions when its state is out of sync of the storage backends.

It is possible to recover by rescheduling. However, the scheduler may not choose the best-fit node when its state is out of sync (e.g. the storage of best-fit node is occupied by terminating PVs).

As in 1) I clarified if we have available capacity per topology segment per storage class, there is only one allocation way for this storage class. We can calculate the total capacity in most cases.

For linear volumes, it's easy. For raid1 volumes, LVM will allocate some space for metadata and the ratio compared to total space is not fixed. For these kinds of volumes, we can reserve some space for metadata e.g. <volume count limit> * copies * sizeof(extent). The number of copies determines the allocation way which cannot be updated after the storage class is created. The volume count limit can be hard-coded or specified in storage class parameters too.

To report total capacity, there are two ways:

Return total capacity in avaiable_capacity field when a special parameter is given. Only one of available or total capacity is needed for volume scheduling.
Add total_capacity field

I suggest adding total_capacity field.

I think the total_capacity is the total allocatable capacity for the backend storage by using the allocation way specified in GetCapacity.

For each storage class, if we adhere to one allocation way in one storage class, we can cache the reported available capacity for storage classes in CO to do dynamic volume provisioning. If the driver has a way to report total capacity, the experience will be better.

May 24 '19 07:05 cofyc

If there is only one way to allocate the volume from a storage class, then we can report the capacity for this storage class. The CO can cache them to make scheduling decisions.

Right, in Mesos we cache this result for brief periods, and requery every CSI plugin instance, for every StorageClass (we call them profiles), every couple of seconds (10s or 30s, I can't recall) to remain reasonably up-to-date.

The storage driver or plugin can support carving multiple types of volumes (e.g. linear and raid1 volumes in LVM), but for each storage class, it should support only one allocation type.

This makes sense: a StorageClass definition is transformed into GetCapacityRequest.Parameters. Multiple StorageClass'es result in the CO sending multiple GetCapacity RPCs to the same CSI plugin instance and getting back a potentially different value for available_capacity for each StorageClass.

For each storage class, if we adhere to one allocation way in one storage class, we can cache the reported available capacity for storage classes in CO to do dynamic volume provisioning. If the driver has a way to report total capacity, the experience will be better.

It sounds like the design you are proposing assumes the following limitation: available capacity for a given StorageClass cannot change at runtime other than through creating or deleting a volume of that StorageClass.

This is problematic.

Given two StorageClasses: StorageClass[raid1] and StorageClass[raid0]. Creating a volume of StorageClass[raid1] results in a RAID-1 LV being created on a specific VG. That implicitly reduces the available_capacity (or used capacity, if you rely on total_capacity, instead) of StorageClass[raid0].
This implicit relation between StorageClass'es and the Create/DeleteVolume RPCs needs to live somewhere.

2.1 It can live in the CSI plugin. In Mesos we chose to delegate that knowledge to the CSI plugin: we do no capacity calculation and instead reissue the GetCapacity RPC for every StorageClass, for every CSI plugin instance, at regular intervals.

2.2 It can live in the CO. The CO can encode knowledge of how Create/DeleteVolume influences the available_capacity of a StorageClass as well as related StorageClasses.

You can try and sidestep the issue by restricting every instance of a CSI plugin (e.g., every LVM Volume Group) to a single StorageClass. In that case, you will still make incorrect calculations of available_capacity as the per-volume overhead is not accounted for: 1x10GiB volume uses less storage than 2x5GiB volumes, since the 2 volumes each have some metadata overhead in addition to their 5GiB available volume size. The reason is that the CreateVolume RPC does not interpret the volume size as "the amount by how much available_capacity was reduced", but rather "what is the addressable size of the resulting volume".

I imagine that Create/DeleteVolume alone are also not enough to model correctly: Create/DeleteSnapshot and the soon-to-be-introduced volume resize functionality also have unexpected impact on capacity.

It is still possible to encode all this wisdom into the CO, but it would have to be done for every kind of CSI plugin, which defeats some of the purpose of the CSI specification.

Another issue with the CO calculating available_capacity as total_capacity - volumesizes, which I think is more fundamental, is what to do in the case of a CSI plugin that performs inline data compression. In that case, not even the CSI plugin will know how much available_capacity will be left after it creates a volume as not all data compresses equally well.

I think the proper solution must be to issue GetCapacity calls and to accept the reality that the CO does not have a perfect view of the amount of available capacity at any given time, but devise strategies to reduce that delta.

One such strategy is to periodically poll CSI plugins for available_capacity for every StorageClass. This does not scale well with the number of StorageClasses.

Such a strategy could be tempered by only requesting available capacity if some capacity-changing RPC like Create/Delete/ResizeVolume or Create/DeleteSnapshot has been performed against that CSI plugin instance since available capacity was last requested.

This has the disadvantage of CO state becoming outdated in the case where the CSI plugin instance's backing storage increases/decreases out-of-band, such as when the administrator extends the LVM VG or adds more disks to a Ceph installation, etc. Perhaps that's OK.

May 24 '19 13:05 gpaul