Add AgentOps project type and vector search data preparation workflows
Overview
This PR adds a new project type "AgentOps" to the existing MLOps Stacks template. Users can now select between two project types when initializing a stack:
- mlops - Existing template; traditional ML pipeline for model training and batch inference
- agentops - NEW template; agent-specific workflows for data ingestion + eventually agent development/deployment.
This PR also adds the vector search data ingestion pipeline for the AgentOps projects.
Features
AgentOps Template Updates
1. Project type selection
-
Added
input_project_typeparameter todatabricks_template_schema.json- Options:
mlops(default) oragentops - First-order parameter in template initialization
- Options:
-
Updated minimum Databricks CLI version to
v0.266.0to support new features -
Default project name now reflects selected project type:
my_{{ .input_project_type }}_project -
Other changes:
- Reordered parameters with
input_project_typeas order 1 - Updated all subsequent parameter orders
- Conditional parameter display (e.g.,
input_include_models_in_unity_catalogskipped for agentops) - Updated default values to be project-type aware
- Reordered parameters with
2. Updating project structure layout
-
Added conditional logic to generate appropriate project structure based on
input_project_typetoupdate_layout.tmpl- Ensures MLOps-specific files are only generated for MLOps projects
-
Added conditional logic to certain files:
-
Separate code structure sections for MLOps vs AgentOps, which conditionally renders based on
input_project_type-
requirements.txt.tmpl- Adds dependencies (e.g. vector search SDK)
-
README.md.tmpl- Adds basic documentation for agentops project
-
databricks.yml.tmpl- Extends bundle configuration to support agentops resources
- Adds data preparation workflow targets
-
All CI/CD pipelines (more on this later)
-
3. Updating CI/CD workflows
-
Extended CI/CD pipelines to handle AgentOps projects and test the correct workflows:
-
GitHub Actions (
.github/workflows/{{.input_project_name}}-run-tests.yml.tmpl) -
Azure DevOps (
.azure/devops-pipelines/{{.input_project_name}}-tests-ci.yml.tmpl) -
GitLab CI (
.gitlab/pipelines/{{.input_project_name}}-bundle-ci.yml.tmpl)
-
GitHub Actions (
Data preparation with vector search for AgentOps
1. Data preparation code
- Notebook:
DataIngestion.py.tmpl- Processes raw documentation from data source URLs and stores data in UC
- uses utility function
fetch_data.py.tmplfor retrieval
- Notebook:
DataPreprocessing.py.tmpl- Cleans and chunks documentation to prepare for vector search
- uses utiltiy function
create_chunk.py.tmplfor chunking logic - define configs for chunking in
config.py.tmpl
- Notebook:
VectorSearch.py.tmpl- Creates Vector Search endpoint and index using delta sync (TRIGGERED mode)
- uses utility function
vector_search_utils.py.tmplfor management + waiting for endpoint to be ready
2. Workflow resource configuration
- Defined the data preparation workflow in
data-preparation-resource.yml.tmpl, which includes each notebook as a separate task (sequential execution)- Parameters for notebooks are given here
- Scheduled for running everyday in the morning 5am
- Severless environment and dependencies are also defined here
- Included resource in list of resources in
databricks.yml.tmpl
3. Defined variables in databricks.yml
- Included variables (that will feed into data preparation workflow parameters)
- catalog_name
- Defined uniquely for each deployment target using template input (e.g.
databricks_staging_workspace_host)
- Defined uniquely for each deployment target using template input (e.g.
- schema
- Defined the same for each deployment target using template input
input_schema_name
- Defined the same for each deployment target using template input
- raw_data_table
- Will automatically populate as "raw_documentation"
- preprocessed_data_table
- Will automatically populate as "databricks_documentation"
- eval_table
- Will automatically populate as "databricks_documentation_eval"
- vector_search_endpoint
- Will automatically populate as "ai_agent_endpoint"
- vector_search_index
- Will automatically populate as "databricks_documentation_vs_index"
- catalog_name
What I have tested:
- Validated project generation for both
mlopsandagentopsproject types - Tested original mlops-stacks project + confirmed that the default behavior is unchanged
- Validated data preparation pipeline works end-to-end
- Validated that bundle variables are used properly by resources