fleet icon indicating copy to clipboard operation
fleet copied to clipboard

LUKS encryption escrow taking longer than expected

Open harrisonravazzolo opened this issue 1 year ago • 5 comments

Fleet version: Observed in Fleet's dogfood environment


💥  Actual behavior

customer-olympus is reporting that it takes up to 8 minutes to escrow the disk encryption key.

@allenhouchins has also seen this behavior in dogfood using a Linux workstation (real hardware not a virtual machine)

🧑‍💻  Steps to reproduce

  1. Enroll a Linux workstation (real hardware not a virtual machine)
  2. Turn on disk encryption in Fleet
  3. On the workstation's My device page, follow steps to create a disk encryption key
  4. Observe that after you enter your passphrase, sometimes you sit in front of the loading modal for 5-8 minutes.

🕯️ More info (optional)

N/A

🛠️ To fix

@noahtalerman: We want this process to be as fast as possible. Ideally the end user sits in front of the loading modal for less than 5 seconds.

harrisonravazzolo avatar Jan 22 '25 21:01 harrisonravazzolo

Problem

My end users are experiencing a longer than expected wait time when escrowing keys to fleet server. Users are not sure if the process has stalled or failing. We estimate the told time to complete being about ~8 minutes.

What have you tried?

Leveraging the LUKS feature of Fleet on Ubuntu 22.04.05

Potential solutions

See the Slack channel with engineering team for potential areas to look at to improve.

What is the expected workflow as a result of your proposal?

This is not an expensive operation on the endpoints os we would expect this process to complete quicker.

noahtalerman avatar Jan 28 '25 19:01 noahtalerman

FWIW dismissing the escrow progress spinner dialog doesn't cancel the escrow process, and the success dialog will show up once successful whether or not the progress dialog was showing earlier.

iansltx avatar Jan 28 '25 23:01 iansltx

FWIW dismissing the escrow dialog doesn't cancel the escrow process, and the success dialog will show up once successful whether or not the progress dialog was showing earlier.

@iansltx Thank you for clarifying that!

allenhouchins avatar Jan 28 '25 23:01 allenhouchins

Hey team! Please add your planning poker estimate with Zenhub @dantecatalfamo @jacobshandling @lucasmrod @sgress454

sharon-fdm avatar Feb 05 '25 19:02 sharon-fdm

I just ran this on real hardware, it took ~2 minutes to complete. Will be looking into it

dantecatalfamo avatar Apr 10 '25 21:04 dantecatalfamo

I've tracked down the source of the issue: the process is getting stuck when we shell out to cryptsetup, and appears to be caused by the CPUQuota we place on orbit and its children in the systemd unit file.

I ran a test here to confirm my resutls.

Image

If we ignore the ~2 seconds it took me to type in the passphrase, we can see a massive time difference here. I believe under normal circumstances, cryptsetup will try to use multiple cores to speed up the password decryption, but we're currently limiting fleetd and all its child processes to 20% of a single core.

In my test, it took roughly 40 seconds to check the validity of the password by decrypting the LUKS header. Considering we do at least three operations on the LUKS container (conform password, add password, confirm added password), that accounts slowdown we're seeing.

dantecatalfamo avatar Apr 14 '25 20:04 dantecatalfamo

@noahtalerman I see two options:

  1. Increase the amount of system resources orbit is allowed to consume
  2. Don't show progress to the user. Let the password prompt show up, and re-dislpay if the password was wrong, but otherwise hide the progress from the user.

I assume we have the CPUQuota to stop orbit from slamming the system during intensive queries, so I think the second is the better option. The loading bar is completely inconsequential in any case and doesn't need to be shown at all

dantecatalfamo avatar Apr 14 '25 20:04 dantecatalfamo

I'm going with option 2

dantecatalfamo avatar Apr 14 '25 20:04 dantecatalfamo

I've tracked down the source of the issue: the process is getting stuck when we shell out to cryptsetup, and appears to be caused by the CPUQuota we place on orbit and its children in the systemd unit file.

Great finding! (How did you know it was the CPUQuota?)

I'm going with option 2

Totally agree.

lucasmrod avatar Apr 14 '25 21:04 lucasmrod

@dantecatalfamo updated copy for the UI: https://www.figma.com/design/UFZuydNgCYKeILHmjju8Lu/-22074-Disk-encryption---key-escrow-for-Ubuntu-and-Fedora-Linux?node-id=5957-1428&t=L2qcOG4VCbmfVVdM-1

rachaelshaw avatar Apr 15 '25 15:04 rachaelshaw

Encryption key delay, Fleet ensures swift response play, Secured, no dismay.

fleet-release avatar May 22 '25 19:05 fleet-release

UPDATE: @noahtalerman: Re-opened because customer-leiden ran into this. Not sure when but we can try to reproduce in the meantime. More context: https://fleetdm.slack.com/archives/C07NZ7B02AF/p1754526414598789

FYI @rachaelshaw @sharon-fdm @dantecatalfamo

noahtalerman avatar Aug 07 '25 16:08 noahtalerman

IIRC we decided to not speed it up and instead remove the loading modal. This is so that the slow key escrowing happens in the background and the loading modal doesn't bother the user.

The reason we did not speed it up is because validating the user password + generating a new keyslot is a somewhat expensive operation and Orbit runs by design with very little CPU available to not cause performance issues on end-user devices. This low allocation of CPU causes the key escrow to be very slow.

@dantecatalfamo please correct me if I'm wrong.

lucasmrod avatar Aug 07 '25 16:08 lucasmrod

@noahtalerman @rachaelshaw We removed the dialog that the end user sees and updated the copy to reflect that the process will take 5-10 minutes. We cannot easily speed it up and only the IT admins will know it takes 5-10 minutes. The process takes a lot of CPU and orbit is limited to max 20% CPU usage.

getvictor avatar Aug 13 '25 19:08 getvictor

Ah, I see the new copy:

Image

@rachaelshaw, based on @getvictor's comment, it sounds like the end user doesn't have to wait 10 mins for Fleet to create the key. It will happen on Fleet's next refetch interval.

The IT admin cares about this because they don't want to make the end user do extra work. Right now it reads like Fleet is telling them they have to wait and do another thing.

What do y'all think about this copy instead?

Image

Lets the user know they can be done. Fleet will handle clearing the banner on the next refetch interval. It assumes the end user is able to clear the banner sooner (up to 10 minutes)

noahtalerman avatar Aug 14 '25 21:08 noahtalerman

Encryption key waits, Swift as clouds, Fleet will mend, Secures, not frustrates.

fleet-release avatar Oct 17 '25 22:10 fleet-release