fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Linux Install Script Queue

Open mason-buettner opened this issue 9 months ago • 1 comments

Gong snippet: none

Problem

customer-beethoven wants to queue multiple installations at once for their Linux hosts, but when they do so one will succeed and the rest fail because the package manager is busy. customer-beethoven can work around this by building a timeout into the install script. Is it possible for Fleet to queue the installs for the user?

What have you tried?

I have built a timeout into the script, but this results in an arbitrary wait time and an extra step in my scripts. This also requires me to edit the automatically generated install script for uploaded packages to include the timeout.

Potential solutions

What is the expected workflow as a result of your proposal?

1: I trigger multiple installation scripts for a Linux host. 2: Fleet queues the installations, triggering the next script after the previous either completes or otherwise exits.

mason-buettner avatar Apr 11 '25 21:04 mason-buettner

Added requisite tags so this gets moved through triage; as mentioned in Slack, this is absolutely a bug. @mason-buettner if you've repro'd this you can drop that tag.

iansltx avatar Apr 12 '25 04:04 iansltx

@mostlikelee, moved it to your team. TMWYT

sharon-fdm avatar Apr 16 '25 18:04 sharon-fdm

Original issue description here:

Gong snippet: none

Problem

customer-beethoven wants to queue multiple installations at once for their Linux hosts, but when they do so one will succeed and the rest fail because the package manager is busy. customer-beethoven can work around this by building a timeout into the install script. Is it possible for Fleet to queue the installs for the user?

What have you tried?

I have built a timeout into the script, but this results in an arbitrary wait time and an extra step in my scripts. This also requires me to edit the automatically generated install script for uploaded packages to include the timeout.

Potential solutions

What is the expected workflow as a result of your proposal?

1: I trigger multiple installation scripts for a Linux host. 2: Fleet queues the installations, triggering the next script after the previous either completes or otherwise exits.

noahtalerman avatar Apr 17 '25 13:04 noahtalerman

Fleet version: TODO

@mason-buettner thanks for tracking this! What version of Fleet is customer-beethoven on? That will help us with debugging and get to a fix sooner.

We think this means that Fleet is trying to install the packages simultaneously.

This is definitely a bug. @lukeheath I'd argue this is a P2 that's worthy of disrupting the current sprint. We could be building on top of a shaky foundation.

@mostlikelee have we confirmed that this is only an issue on Linux hosts? Not macOS nor Windows?

cc @ksatter

noahtalerman avatar Apr 17 '25 13:04 noahtalerman

FYI @georgekarrv @mna this looks like a unified queue bug.

noahtalerman avatar Apr 17 '25 13:04 noahtalerman

This is reproduced with the default .deb install script? I have a hard time thinking this is a unified queue bug because even if UQ sends all to the script table to execute the logic in orbit has always been to run one at a time unless that has changed.

georgekarrv avatar Apr 17 '25 13:04 georgekarrv

@mason-buettner Just want to make sure you've been able to reproduce this.

@noahtalerman I'll defer to @georgekarrv on the P2 designation for this one based on what we think the cause of the bug is.

lukeheath avatar Apr 17 '25 16:04 lukeheath

@lukeheath Yes, we've been able to reproduce. Including a screenshot of the error from the Fleet dashboard we observed for an install that failed:

Image

And the associated logs:

ubuntu-vm-software-fail.txt

Just to note, we've seen inconsistent failure rates - sometimes 2/3 installs will complete, sometimes 1/3.

mason-buettner avatar Apr 17 '25 21:04 mason-buettner

Hey team! Please add your planning poker estimate with Zenhub @iansltx @jahzielv @ksykulev

mostlikelee avatar Apr 18 '25 21:04 mostlikelee

@noahtalerman @lukeheath just to be explicit, right now this is top priority for next sprint unless you all think different. Bringing in a P2 bug into the current sprint will be disruptive.

mostlikelee avatar Apr 21 '25 23:04 mostlikelee

@mostlikelee Thanks for the clarity. Since the customer has a workaround for now, next sprint should be fine.

lukeheath avatar Apr 22 '25 15:04 lukeheath

I haven't been able to reproduce this on latest main, but Kathy and Mason are working on getting data from when they reproduced + asking the customer to try again to reproduce.

jahzielv avatar May 13 '25 19:05 jahzielv

@zayhanlon heads up we're having issues replicating this issue. Waiting to hear back from the customer.

mostlikelee avatar May 14 '25 15:05 mostlikelee

@mostlikelee the customer said he'll get back to us at the end of the week

zayhanlon avatar May 14 '25 16:05 zayhanlon

@zayhanlon any updates here?

mostlikelee avatar May 19 '25 15:05 mostlikelee

@mostlikelee i think we need to update the issue. @mason-buettner will address when he's online today. heres the customer feedback - https://fleetdm.slack.com/archives/C0867SDM4F8/p1747624232615499?thread_ts=1747158858.308319&cid=C0867SDM4F8

zayhanlon avatar May 19 '25 15:05 zayhanlon

Hi folks! After some more research/discussion, we found that this issue was actually two separate bugs. Because of that, we're going to close here.

Once the new bugs are filed, we'll link them in the comments here.

cc @zayhanlon @mostlikelee @mason-buettner

jahzielv avatar May 22 '25 15:05 jahzielv

Linux hosts in sync, One by one, installs take flight. Fleet brings calm from chaos.

fleet-release avatar May 22 '25 15:05 fleet-release