LUKS encryption escrow taking longer than expected
Fleet version: Observed in Fleet's dogfood environment
💥 Actual behavior
customer-olympus is reporting that it takes up to 8 minutes to escrow the disk encryption key.
@allenhouchins has also seen this behavior in dogfood using a Linux workstation (real hardware not a virtual machine)
🧑💻 Steps to reproduce
- Enroll a Linux workstation (real hardware not a virtual machine)
- Turn on disk encryption in Fleet
- On the workstation's My device page, follow steps to create a disk encryption key
- Observe that after you enter your passphrase, sometimes you sit in front of the loading modal for 5-8 minutes.
🕯️ More info (optional)
N/A
🛠️ To fix
@noahtalerman: We want this process to be as fast as possible. Ideally the end user sits in front of the loading modal for less than 5 seconds.
Problem
My end users are experiencing a longer than expected wait time when escrowing keys to fleet server. Users are not sure if the process has stalled or failing. We estimate the told time to complete being about ~8 minutes.
What have you tried?
Leveraging the LUKS feature of Fleet on Ubuntu 22.04.05
Potential solutions
See the Slack channel with engineering team for potential areas to look at to improve.
What is the expected workflow as a result of your proposal?
This is not an expensive operation on the endpoints os we would expect this process to complete quicker.
FWIW dismissing the escrow progress spinner dialog doesn't cancel the escrow process, and the success dialog will show up once successful whether or not the progress dialog was showing earlier.
FWIW dismissing the escrow dialog doesn't cancel the escrow process, and the success dialog will show up once successful whether or not the progress dialog was showing earlier.
@iansltx Thank you for clarifying that!
Hey team! Please add your planning poker estimate with Zenhub @dantecatalfamo @jacobshandling @lucasmrod @sgress454
I just ran this on real hardware, it took ~2 minutes to complete. Will be looking into it
I've tracked down the source of the issue: the process is getting stuck when we shell out to cryptsetup, and appears to be caused by the CPUQuota we place on orbit and its children in the systemd unit file.
I ran a test here to confirm my resutls.
If we ignore the ~2 seconds it took me to type in the passphrase, we can see a massive time difference here. I believe under normal circumstances, cryptsetup will try to use multiple cores to speed up the password decryption, but we're currently limiting fleetd and all its child processes to 20% of a single core.
In my test, it took roughly 40 seconds to check the validity of the password by decrypting the LUKS header. Considering we do at least three operations on the LUKS container (conform password, add password, confirm added password), that accounts slowdown we're seeing.
@noahtalerman I see two options:
- Increase the amount of system resources orbit is allowed to consume
- Don't show progress to the user. Let the password prompt show up, and re-dislpay if the password was wrong, but otherwise hide the progress from the user.
I assume we have the CPUQuota to stop orbit from slamming the system during intensive queries, so I think the second is the better option. The loading bar is completely inconsequential in any case and doesn't need to be shown at all
I'm going with option 2
I've tracked down the source of the issue: the process is getting stuck when we shell out to cryptsetup, and appears to be caused by the CPUQuota we place on orbit and its children in the systemd unit file.
Great finding! (How did you know it was the CPUQuota?)
I'm going with option 2
Totally agree.
@dantecatalfamo updated copy for the UI: https://www.figma.com/design/UFZuydNgCYKeILHmjju8Lu/-22074-Disk-encryption---key-escrow-for-Ubuntu-and-Fedora-Linux?node-id=5957-1428&t=L2qcOG4VCbmfVVdM-1
Encryption key delay, Fleet ensures swift response play, Secured, no dismay.
UPDATE: @noahtalerman: Re-opened because
customer-leidenran into this. Not sure when but we can try to reproduce in the meantime. More context: https://fleetdm.slack.com/archives/C07NZ7B02AF/p1754526414598789
FYI @rachaelshaw @sharon-fdm @dantecatalfamo
IIRC we decided to not speed it up and instead remove the loading modal. This is so that the slow key escrowing happens in the background and the loading modal doesn't bother the user.
The reason we did not speed it up is because validating the user password + generating a new keyslot is a somewhat expensive operation and Orbit runs by design with very little CPU available to not cause performance issues on end-user devices. This low allocation of CPU causes the key escrow to be very slow.
@dantecatalfamo please correct me if I'm wrong.
@noahtalerman @rachaelshaw We removed the dialog that the end user sees and updated the copy to reflect that the process will take 5-10 minutes. We cannot easily speed it up and only the IT admins will know it takes 5-10 minutes. The process takes a lot of CPU and orbit is limited to max 20% CPU usage.
Ah, I see the new copy:
@rachaelshaw, based on @getvictor's comment, it sounds like the end user doesn't have to wait 10 mins for Fleet to create the key. It will happen on Fleet's next refetch interval.
The IT admin cares about this because they don't want to make the end user do extra work. Right now it reads like Fleet is telling them they have to wait and do another thing.
What do y'all think about this copy instead?
Lets the user know they can be done. Fleet will handle clearing the banner on the next refetch interval. It assumes the end user is able to clear the banner sooner (up to 10 minutes)
Encryption key waits, Swift as clouds, Fleet will mend, Secures, not frustrates.