mlops-stacks icon indicating copy to clipboard operation
mlops-stacks copied to clipboard

Add AgentOps project type and vector search data preparation workflows

Open veenaramesh opened this issue 2 months ago • 0 comments

Overview

This PR adds a new project type "AgentOps" to the existing MLOps Stacks template. Users can now select between two project types when initializing a stack:

  • mlops - Existing template; traditional ML pipeline for model training and batch inference
  • agentops - NEW template; agent-specific workflows for data ingestion + eventually agent development/deployment.

This PR also adds the vector search data ingestion pipeline for the AgentOps projects.

Features

AgentOps Template Updates

1. Project type selection

  • Added input_project_type parameter to databricks_template_schema.json

    • Options: mlops (default) or agentops
    • First-order parameter in template initialization
  • Updated minimum Databricks CLI version to v0.266.0 to support new features

  • Default project name now reflects selected project type: my_{{ .input_project_type }}_project

  • Other changes:

    • Reordered parameters with input_project_type as order 1
    • Updated all subsequent parameter orders
    • Conditional parameter display (e.g., input_include_models_in_unity_catalog skipped for agentops)
    • Updated default values to be project-type aware

2. Updating project structure layout

  • Added conditional logic to generate appropriate project structure based on input_project_type to update_layout.tmpl

    • Ensures MLOps-specific files are only generated for MLOps projects
  • Added conditional logic to certain files:

  • Separate code structure sections for MLOps vs AgentOps, which conditionally renders based on input_project_type

    • requirements.txt.tmpl

      • Adds dependencies (e.g. vector search SDK)
    • README.md.tmpl

      • Adds basic documentation for agentops project
    • databricks.yml.tmpl

      • Extends bundle configuration to support agentops resources
      • Adds data preparation workflow targets
    • All CI/CD pipelines (more on this later)

3. Updating CI/CD workflows

  • Extended CI/CD pipelines to handle AgentOps projects and test the correct workflows:

    • GitHub Actions (.github/workflows/{{.input_project_name}}-run-tests.yml.tmpl)
    • Azure DevOps (.azure/devops-pipelines/{{.input_project_name}}-tests-ci.yml.tmpl)
    • GitLab CI (.gitlab/pipelines/{{.input_project_name}}-bundle-ci.yml.tmpl)

Data preparation with vector search for AgentOps

1. Data preparation code

  • Notebook: DataIngestion.py.tmpl
    • Processes raw documentation from data source URLs and stores data in UC
    • uses utility function fetch_data.py.tmpl for retrieval
  • Notebook: DataPreprocessing.py.tmpl
    • Cleans and chunks documentation to prepare for vector search
    • uses utiltiy function create_chunk.py.tmpl for chunking logic
    • define configs for chunking in config.py.tmpl
  • Notebook: VectorSearch.py.tmpl
    • Creates Vector Search endpoint and index using delta sync (TRIGGERED mode)
    • uses utility function vector_search_utils.py.tmpl for management + waiting for endpoint to be ready

2. Workflow resource configuration

  • Defined the data preparation workflow in data-preparation-resource.yml.tmpl, which includes each notebook as a separate task (sequential execution)
    • Parameters for notebooks are given here
    • Scheduled for running everyday in the morning 5am
    • Severless environment and dependencies are also defined here
  • Included resource in list of resources in databricks.yml.tmpl

3. Defined variables in databricks.yml

  • Included variables (that will feed into data preparation workflow parameters)
    • catalog_name
      • Defined uniquely for each deployment target using template input (e.g.databricks_staging_workspace_host)
    • schema
      • Defined the same for each deployment target using template input input_schema_name
    • raw_data_table
      • Will automatically populate as "raw_documentation"
    • preprocessed_data_table
      • Will automatically populate as "databricks_documentation"
    • eval_table
      • Will automatically populate as "databricks_documentation_eval"
    • vector_search_endpoint
      • Will automatically populate as "ai_agent_endpoint"
    • vector_search_index
      • Will automatically populate as "databricks_documentation_vs_index"

What I have tested:

  • Validated project generation for both mlops and agentops project types
  • Tested original mlops-stacks project + confirmed that the default behavior is unchanged
  • Validated data preparation pipeline works end-to-end
  • Validated that bundle variables are used properly by resources

veenaramesh avatar Nov 19 '25 20:11 veenaramesh