DSC Handle resources that require restarts

Summary of the new feature / enhancement

Some changes to the system may require a reboot before additional changes can be applied. dsc itself shouldn't perform a reboot, but we may need a standardized way for resources to return that a reboot is required like _rebootRequested: true and then dsc just reports that as a standard result. Then something else performs the reboot when ready and re-runs the config until it completes or hits another reboot.

Proposed technical implementation details (optional)

No response

Apr 04 '23 19:04 SteveL-MSFT

Do we have a spec to document output requirements? If not, maybe that is another ask to PM?

Jun 21 '23 18:06 mgreenegit

My understanding/preference for the output requirements of a DSC Resource is that it must return a JSON blob that is valid by its own schema.

So, using an arbitrary and idealized example:

{
    "$schema": "https://aka.ms/dsc/schemas/resource_manifest",
    "manifestVersion": "1.0",
    "type": "TSToy.Example/gotstoy",
    "version": "0.1.0",
    "get": {
        // truncated
    },
    "set": {
        // truncated
    },
    "schema": {
        "embedded": {
            "$schema": "https://aka.ms/dsc/schemas/resource_definition",
            "title": "Golang TSToy Resource",
            "type": "object",
            "required": [
                "scope"
            ],
            "properties": {
                // truncated property list
                "_rebootRequired": {
                    "$ref": "https://aka.ms/dsc/schemas/keywords/rebootRequired"
                }
            }
        }
    }
}

I think for now we can just define that resources that need to report on reboots need to have a _rebootRequired key with a boolean value, but in the long term if people can use a reference to a well-defined keyword, that will help a lot. I've been thinking about this a bit wrt to other keywords, like _ensure and _sensitive too. Having the published schema with some well-known keywords would make authoring easier and help ensure consistency across the resource implementations, rather than relying on people to correctly implement those things by hand.

Jun 22 '23 14:06 michaeltlombardi

We have a few scenarios where we want the reboot behavior to be variable based on the current context.

Currently when a resource needs a reboot, we set $Global:DSCMachineStatus = 1 within the SET block.

Initial deployment / first config run / prior to the first full compliance -- We want DSC to reboot and continue config in this scenario. For instance, initial physical machine configuration, we want to make sure the firmware and drivers are up to date and windows updates are installed, including reboots (as sometimes multiple reboots are needed). After the initial configuration, we don't want reboots handled automatically by DSC. (in DSC 1.1)
Disruptive changes - Some changes are disruptive during the SET operation. It would be great if we could flag a resource as being disruptive at SET time and subscribe the machines to a specific maintenance window or workflow, so the SET operations are only performed during the prescribed window. IE - Updating mellanox drivers / firmware causes a port reset.
Clustered operations for changes not disruptive at SET invocation, but still need a reboot. If we could set the pending reboot flag in DSC and have Cluster Aware Updating recognize that flag, we could do updates to clustered hosts non-disruptively. (Drain / Reboot / Resume / Repeat) EX: BIOS settings changes / version updates.
Flighting changes - Similar to the maintenance subscriptions above. Flight this change out to x machines at a time. Might include pre and post change operations. IE: Before running SET for x resource, run a prechange operation first, and run the post change operation after. For example, before updating the bits for a webapp, set the flag that removes node from the load balancer prior to the change, and add it back after the change. Only do x machines per LB cluster at a time and halt if there are issues.

Outside of reboots, variable resource execution timing would be great too. We don't really need to make sure the drivers are up to date every 15 min, once per week or once per month would be fine. We do however want the machines to be autocorrected for other things on a faster cadence, and in some instances, immediately. EX: Someone enabled sql extended stored procedures on a db node, turn it off immediately. Someone adds a new local administrator manually, remove it immediately. The config definition is the source of truth.

In summary, I feel rebootRequired is not descriptive enough to get robust behavior when using DSC for ongoing daily DevOps management. Maybe also some additional metadata like disruptiveSet subscribedWindow preSetResourceInvoke postSetResourceInvoke would also be useful.

Hope this feedback helps. Please reach out if you'd like some examples of how we've solved some of these challenges.

Jul 18 '23 00:07 jambar42

The actual reboot will need to be initiated by a higher level agent (Machine Config, Ansible, Chef/Puppet, etc...) or even a script using task scheduler/cron. DSC should aggregate the reboot information to a top level _rebootRequired as part fo the output config (probably under a metadata section to keep it aligned with ARM).

For disruptive changes, like reseting hardware, this is probably better discussed in a new issue, created https://github.com/PowerShell/DSC/issues/103

For cluster operations, we might just need a specific discussion with that team since it seems very domain specific.

Flighting is also better suited for higher level agents like Machine Config to handle as DSC wouldn't have knowledge of other systems.

Jul 18 '23 15:07 SteveL-MSFT

Even if a higher level agent is managing the reboots there is a need for a configuration/resource to indicate that the reboot must occur first before continuing applying the configuration.

Aug 09 '23 19:08 ThomasNieto

WG discussed this and current proposal is for resources that require a reboot to emit _rebootRequired=true as part of their output and dsc will inform the user that a reboot is required. If the user ignores that and reruns the config, it will be up to the resource to know if a reboot has been satisfied or do nothing except re-emit _rebootRequired=true. dsc will simply re-run config after the reboot and as resources are expected to be idempotent, those resources that are in desired state would be no-op until config execution gets past the reboot part.

Feb 20 '24 21:02 SteveL-MSFT

Based on a new scenario that came up, I'm renaming this to restart instead of reboot to handle other cases like restarting a service, window manager, or even computer system.

In this case, my proposal is if a restart is required, we handle 3 cases (more can be added later, like clusters):

If a process needs to be restarted, the resource returns a _restartRequired property with contents:

_restartRequired:
  process:
    name: explorer.exe
    id: 1234

if a service needs to be restarted, the resource returns:

_restartRequired:
  service:
    name: sshd
    id: 1234

if the computer system needs to be restarted:

_restartRequired:
  computer:
    name: MyComputer

Resources that can handle restarting should have a property that indicates a restart should be performed if needed:

_restartIfNeeded: true

The default value is false and returns the _restartRequired as above.

Feb 21 '24 19:02 SteveL-MSFT

Is there a case for a single resource returning an array of processes or services that need to be restarted?

Feb 27 '24 22:02 michaeltlombardi

Is there a case for a single resource returning an array of processes or services that need to be restarted?

I can't think of one, but we can allow a choice of an array if that situation comes up.

Feb 27 '24 22:02 SteveL-MSFT

I think it may be better to have this returned as metadata than a property

Jul 31 '24 05:07 SteveL-MSFT