fleet icon indicating copy to clipboard operation
fleet copied to clipboard

Combine query and schedule features

Open noahtalerman opened this issue 3 years ago • 4 comments

Goal

Combine the query and schedule features to provide a single interface for creating, scheduling, and tweaking queries at the global and team level.

This way, as a team level user, I can create, schedule, and tweak queries that run on my teams' hosts without having to ask my a global admin or other members of my team for assistance.

Parent epic

  • #6716

Requirements

  • Queries are global ("All teams") or on one team. This means, on upgrade:
    • If a query is not pointed at by any schedule or pack(s), then the query stays global.
    • If a query was created by a team user (on one or more teams), then the query is duplicated as a team query on any teams the user is a member of. The global query is kept (in case it's pointed at a pack).
    • If a query is pointed at by global schedule or pack(s), then the query stays global.
    • If a query is pointed at by one or more teams' schedules, then the query is duplicated as a team query on any teams the query is pointed at. The global query is kept (in case it's pointed at a pack).
  • Queries have automations on or off. This means, on upgrade:
    • If a query is pointed at a schedule (global or team), the query has automations "on".
    • If a query is not pointed at a schedule or is pointed at one or more packs, the query has automations "off".
  • Users can manage automations in UI and fleetctl.
  • Queries have a platform. This means, on upgrade:
    • If a query does not have platforms, the query is the default platforms (all).
  • Queries have an interval. This means, on upgrade:
    • If a query does not have an interval, the query is assigned the default interval (never)
  • Users can manage interval and platform in the UI and fleetctl.
  • For backwards compatibility, any query that points to any schedule or pack(s) maintains the team, interval, snapshot, removed, platform, shard, and version properties. These properties can be managed in the API.
  • Team admins and team maintainers can only add, edit, and delete queries on their team.
    • Previously, these users could create, edit, and delete queries that they authored.
  • The Fleet API is backwards compatible. On upgrade, all schedule, query, and packs API routes work.
  • Remove the Schedule page from the Fleet UI.
    • Note, this removes all Packs pages from the UI.
  • Hide the Schedule tab on the Host details page.
  • Remove the Query button from the Host details page.

Figma

https://www.figma.com/file/hdALBDsrti77QuDNSzLdkx/%F0%9F%9A%A7-Fleet-EE-(dev-ready%2C-scratchpad)?node-id=9588%3A315617

API

Current query API routes that are relevant:

  • GET /queries
  • GET /queries/{id}
  • POST /queries/{id}
  • PATCH /queries/{id}

Current schedule API routes that are relevant:

  • GET /schedule
  • GET /teams/{id}/schedule
  • POST /schedule
  • PATCH /schedule
  • POST /teams/{id}/schedule
  • PATCH /teams/{id}/schedule

Child issues

  • TODO Frontend issue
  • TODO Backend issue
  • #6024

noahtalerman avatar Sep 14 '22 18:09 noahtalerman

@lukeheath I assigned you this issue.

This is a large update. I think it introduces a difficult API problem: How will Fleet support the new UI and YAML while maintaining backwards compatibility with the current query and schedule API routes?

If it's helpful, I'm happy to hop on a call to discuss synchronously.

noahtalerman avatar Sep 14 '22 20:09 noahtalerman

Decision: Cut shard, logging, and osquery version settings form the UI and fleetctl.

Discussion on "Shard," logging, and osquery version settings from product design review 2022-09-16

Noah: Let's cut "Shard," “Logging” (snapshot, removed), and “osquery version” settings from the UI and fleetctl. This lets us later decide how to address these later if at all.

  • We’ve heard that users/customers are confused about snapshot v. differential from many users. Thought snapshot was the defailt
  • mikermcneil: Keeping settings in the API is necessary for compatibility
  • mikermcneil: Doesn't have to be included in the UI for now
  • mikermcneil: But it would live at the per-query automation level (like the checkbox for disable/enable automation)
  • mikermcneil: I think this will have to come back to the UI eventually- just a matter of whether we do it now or not

Noah: Let's cut “osquery version” settings from the UI and fleetctl. This lets us later decide how to address these later if at all

  • Noah: Do we want osquery version to live on the automation level (per team)? Do we want osquery version at all? It may circumvent teams like "Shard."
  • Noah: Potential best practice use case: Only canary teams get the latest stable version of osquery (not possible today). Add queries that only run on new osquery versions to this team. Upgrade osquery version on all hosts and move queries to all hosts.
  • mikermcneil: On one hand:
    • Ideally it would "just work"
    • You type a query that is only supported on a certain version of osquery, (like platform compat.), and then the osquery version is auto-set.
    • But it's more complexity and.. is it even a good practice?
    • Have to support this for compatibility no matter what for now
    • But not necessarily the UI
  • mikermcneil: Versus Noah's thoughts:
    • Could we recommend setting osquery version per team and then remove this from the UI altogether?
    • After more thought, I agree with Noah on this. I think we could remove it from the UI altogether

noahtalerman avatar Sep 19 '22 14:09 noahtalerman

Decision: Schedule and query API routes will still work.

noahtalerman avatar Sep 19 '22 14:09 noahtalerman

Decision:

Add "Never" to Frequency options.

When we address "See cached query results" #7766, remove "Never" in such a way we can message why it's being removed.

  • Instead canary teams are used to test queries. Or, you can just run query as a live query and not save.
    • Requisites:
      • Improvement for Fleet Premium users: Transfer query between teams (e.g a checkbox in the list view)
      • Improvement for Fleet Free users: History of queries that you ran (can still play, reference old queries, without creating a mess)

(noahtalerman 2022-09-22)

noahtalerman avatar Sep 22 '22 15:09 noahtalerman

@mna I'm co-assigning this ticket to you to begin assisting with research and specifications for the epic. This is going to be a pretty substantial change. Before we take it into active development, I'd like to make sure we fully understand the steps we need to take and the order of operations to implement this change.

Spec'ing this epic will be a group effort, so I'm also leaving myself assigned. I'll be leaning on you to help us understand the implications of this change on the backend. Please review the specs and the associated Figma to answer the high-level questions below. We'll be discussing this ticket on the Wednesday estimation session this week.

  1. What changes will be necessary to support this change on the backend?
  • I've started a list of TODOs under the "Child Issues" header in the issue description. Feel free to add or modify. Once we have a list of TODOs we can create them as child issues and link them here.
  1. How much impact does maintaining backward compatibility have on implementation? Will it take a lot longer, or be more prone to bugs, if we attempt to keep the API backward compatible?
  • This would mean maintaining the schedule and packs endpoints and ensuring interoperability with the modified query endpoints/database.
  1. Would moving this change to v5 and accepting breaking API changes make the implementation substantially cleaner?
  • This would mean removing all schedule and pack endpoints.

lukeheath avatar Oct 10 '22 21:10 lukeheath

@lukeheath

Some early comments about this, I still have a lot of things to investigate:

  1. What changes will be necessary to support this change on the backend?

This is a very open-ended question, taken literally it's basically the complete specs for the backend :D I can't really answer that without much more investigation and better understanding of where we are and where we want to go. In the meantime, I've updated the ticket description with some missing API endpoints that are impacted by the change, and I added a list of impacted DB tables.

  1. How much impact does maintaining backward compatibility have on implementation?

As I understand it, today we can use the same query in multiple packs that target different teams, platforms, different schedules, etc. Supporting this behaviour is fundamentally at odds with the new requirements which imply that a query belongs to either a single team or none and has a single schedule. How would we list and display queries that do not fit this constraint? What schedule would we pick, what team, etc.? My guess is that it will have significant impact on implementation.

  1. Would moving this change to v5 and accepting breaking API changes make the implementation substantially cleaner?

Yes, definitely, and much less error-prone. We could have a one-time migration of existing queries, duplicating them as needed if used in multiple packs/schedules, and even do something with edge cases that we can't easily migrate for some reason (e.g. dumping in a temporary table with a UI to edit, or just to a YAML file they can edit and re-apply after fixes, etc.), without having to worry about those queries being used again in packs or any incompatible legacy feature. We can then enforce strict validation that would not be possible otherwise if we need to be backwards-compatible.

mna avatar Oct 12 '22 14:10 mna

@noahtalerman @lukeheath It seems like we assume a relatively small number of queries (in the Figma, the Manage automations page reserves a small space to manually select by checkboxes which queries will send data to the log destination, no pagination). If this is the case, we should think about enforcing a limit on the number of queries that can be created by team or globally.

mna avatar Oct 12 '22 14:10 mna

@mna Thanks for your feedback! Based on your feedback, as well as the team's conversation during estimation, we are all in agreement that the best path forward is to include this change as part of Fleet v5 and not support backward compatibility. @noahtalerman and I will start moving that conversation forward so we can spec this epic around the assumption of v5 and no backward compatibility. I'll follow up in this thread when we have more info.

lukeheath avatar Oct 12 '22 16:10 lukeheath

Martin: For implementation, it would be best to leave the current packs and schedules endpoints in place and focus on adding functionality to the queries endpoints. Then, once we have queries in a good place, we can go back and remove the unnecessary endpoints.

lukeheath avatar Oct 12 '22 20:10 lukeheath

It seems like we assume a relatively small number of queries

@mna I think we'll have to update the Queries > Manage automations modal to support a large number of queries.

I prefer to not enforce a limit on the number of queries.

noahtalerman avatar Oct 19 '22 14:10 noahtalerman

@mna After speaking with Mike and Zach, we have the 👍 on including this in v5 with breaking changes.

lukeheath avatar Oct 19 '22 16:10 lukeheath

@lukeheath @noahtalerman

The Fleet API is backwards compatible. On upgrade, all schedule, query, and packs API routes work.

This is not true anymore, now that we're doing this as a v5 breaking change, right?

mna avatar Oct 26 '22 13:10 mna

@lukeheath @noahtalerman

For backwards compatibility, any query that points to any schedule or pack(s) maintains the team, interval, snapshot, removed, platform, shard, and version properties. These properties can be managed in the API.

As discussed on the v5 queries google doc, we won't support snapshot, removed, shard, and version properties anymore (just to confirm, and I'll remove them from the ticket).

mna avatar Oct 26 '22 13:10 mna

This is not true anymore, now that we're doing this as a v5 breaking change, right?

@mna Correct. The API will not be backward compatible. Migrations only need to migrate queries. How we're going to handle packs is still TBD.

lukeheath avatar Oct 26 '22 14:10 lukeheath

@RachelElysia: @noahtalerman @mike-j-thomas don't forget empty state and error state <3

Thanks! I think the error state for the Queries page will be the same as the current error state.

@mike-j-thomas can you please help us with a new empty state for the Queries page?

Linking to the current empty state in the deprecated Fleet EE current Figma here: https://www.figma.com/file/qpdty1e2n22uZntKUZKEJl/DEPRECATED---Archived-for-reference---see-dogfood.fleetdm.com-instead----%E2%9C%85-Fleet-EE-(current)?node-id=4388%3A88805

Here's some copy that you could use in the new empty state

  • Collect host telemetry in your log destination (via query automations)
  • Ask real-time questions about your hosts (via live queries)

noahtalerman avatar Oct 27 '22 13:10 noahtalerman

@mike-j-thomas can you please help us with a new empty state for the Queries page?

Thanks Mike T! I added you as an assignee and added this issue to the g-marketing board.

FYI Luke and Martin are also assigned because they are working on the engineering specifications for this issue.

noahtalerman avatar Oct 31 '22 13:10 noahtalerman

Hey @noahtalerman, it might be helpful if we could grab some time together to review the empty state. I have questions.

I've used our default messaging, but I've hidden the helper text about what a query is behind a tooltip to avoid having too much text to process.

I'm also unsure how the messaging changes as the users filter through the platforms. The message should probably switch to be platform-specific and maybe not include the link to GitHub.

I'm also unsure what should be displayed if there are no inherited queries. Would that section just be hidden?

image

mike-j-thomas avatar Nov 01 '22 08:11 mike-j-thomas

Mo:

We are declining to go forward with the query + [schedule] as currently spec’d. Instead, we’re going to break this issue down into smaller problems to be solved separately, with the hopes that that approach would generate a series of smaller iterative changes. In priority order (but needs to be prioritized against other priorities):

  1. “Activation”: How can we get new customers to quickly get running on Fleet? a. Maximize perception of value during POC b. Maximize perception of value during Fleet Sandbox c. Spread Fleet usage to other departments in the org d. Make new users more comfortable writing queries
  2. How do we encourage people to use scheduled queries? (Stop advertising packs it in the UI and in the docs) a. How to still let people import packs from the internet? Import packs but strip out the interval. You can go schedule it later, if you want.
  3. Queries not grouped by teams
  4. It’s confusing what is query and what is schedule
  5. Allow small/medium-sized orgs to start using Fleet w/ less infrastructure requirements
  6. 2 am security incident, everyone’s offline problem
  7. How do we get existing customers to migrate out of packs?

Noah and I will be in touch about spinning up efforts to address those individual problems as they come up.

Tomas:

does this mean that we are not doing v5 at all?

Mo:

Not for this feature. But you can see we still want to refine how packs work in some way going forward.

So probably at some point still. But not for the foreseeable future

noahtalerman avatar Nov 02 '22 17:11 noahtalerman

Noah and I will be in touch about spinning up efforts to address those individual problems as they come up.

@zhumo @mikermcneil FYI I'm deprioritizing this issue for now. I removed it from the roadmap board.

noahtalerman avatar Nov 03 '22 14:11 noahtalerman

@mike-j-thomas heads up, we deprioritized this issue.

However, I think we should keep your awesome empty state 🔥 Soon, we can add this empty state to the existing Queries page.

Can you please move the empty state screens to a spiffier Figma page?

I'm also unsure how the messaging changes as the users filter through the platforms. The message should probably switch to be platform-specific and maybe not include the link to GitHub.

I don't think the messaging needs to change when the platform filter is active. If we want to have platform specific messaging, like "create a query for your macOS hosts" we can come back to this later.

I'm also unsure what should be displayed if there are no inherited queries. Would that section just be hidden?

Yes. I think the expected behavior today is that the inherited queries UI is hidden if there are no inherited queries.

noahtalerman avatar Nov 03 '22 14:11 noahtalerman

@noahtalerman, I moved the queries team empty state to this spiffier page

https://www.figma.com/file/hdALBDsrti77QuDNSzLdkx/%F0%9F%9A%A7-Fleet-EE-(dev-ready%2C-scratchpad)?node-id=10791%3A319238

mike-j-thomas avatar Nov 09 '22 05:11 mike-j-thomas

Update: Backend - 5 pts ( @lucasmrod ) Frontend - not estimated.

sharon-fdm avatar Mar 20 '23 18:03 sharon-fdm

Just reviewed the latest designs for this with @mikermcneil; here are the decisions that came out of it:

DECISION: We do not keep ≤2022 “packs” in the database, except as artifact. This means:

  • Remove APIs for all pack CRUD, read and write. All of it.
  • Breaking change because create pack endpoint (etc) no longer works
  • Fleet 5, ships ASAP in July
  • Database tables related to packs remain, but are no longer accessible except in the database. This is just so there is a “just in case” backup, in case the queries converted after migration don’t match expectations
  • TODO: document every detail of migration in the spec (what will happen to queries in existing packs). There's a start here and in Figma; bring them to a single place - SSOT
  • Migration script will need to be very thoroughly tested

DECISION: To reiterate: A “pack” is just a word for a certain kind of yml file. Period. We will act as if it has always been that way, because it is simpler to understand that way, and the thing called “packs” in Fleet was actually a halfway-step towards something different, that never precisely fit the concept of “packs” from osquery.

(cc @zhumo @lukeheath)

rachaelshaw avatar Jun 26 '23 23:06 rachaelshaw

@rachaelshaw @mikermcneil Should we make the breaking change later and give people some time to make the changeover. That's been our messaging for a while now. Say, a quarter of transition time after we provide them the ability to schedule via yml and convert packs jsons.

cc @zayhanlon

zhumo avatar Jun 27 '23 00:06 zhumo

@rachaelshaw FYI I added two new things related to the --policies-team flag in the main description:

  • Remove the --policies-team flag in fleetctl in favor of using the team assignment in the new yml. Put a deprecated note.
  • Update the CIS policies document to stop referencing policies-team and reference the new yml for how to assign the benchmarks to a specific team.

zhumo avatar Jun 27 '23 22:06 zhumo

@rachaelshaw I also put in Sept. 22 release date as the formal switchover to Fleet 5

zhumo avatar Jun 27 '23 23:06 zhumo

@rachaelshaw I added the above requirement:

  • Add 5, 10, and 30 min options in the UI for schedule query (Note somewhere in the UI that 5 and 10 min, there may be performance impacts) #11996

zhumo avatar Jun 29 '23 20:06 zhumo

I forgot, @sharon-fdm we need a ticket to update the website, specifically the permissions and the docs around our UI

RachelElysia avatar Jul 05 '23 18:07 RachelElysia

@sharon-fdm need more info about if we're updating a specific host details schedule tab (/hosts/291/schedule)

RachelElysia avatar Jul 05 '23 18:07 RachelElysia

Moving this back to In Progress until all sub-tasks are "In review".

lukeheath avatar Jul 24 '23 18:07 lukeheath