dash-network-deploy icon indicating copy to clipboard operation
dash-network-deploy copied to clipboard

fix: deployment failures in terraform AMI lookup, rolling restart, and elasticsearch (devnet/testnet)

Open vivekgsharma opened this issue 3 months ago • 1 comments

Summary

Fixes deployment failures that occur during initial network deployments.

Changes

  1. Terraform AMI lookup fix (terraform/aws/main.tf)

    • Fixed Ubuntu AMI name pattern from server* to server-* for both AMD64 and ARM64
    • Prevents "AMI not found" errors by correctly matching Ubuntu image names like ubuntu-jammy-22.04-amd64-server-20240927
  2. Rolling restart improvements (ansible/roles/dashmate/tasks/rolling_restart.yml)

    • Made --safe flag conditional via dashmate_safe_restart variable
    • Added failed_when: false to prevent DKG timeout failures on fresh deployments
  3. Elasticsearch certificate handling (ansible/roles/elastic_stack/tasks/main.yml)

    • Added file existence checks before fetching ca.zip and certs.zip
    • Prevents fetch errors when certificates haven't been generated yet

Testing

Tested on devnet-mahua deployment - all services deployed successfully without errors and chain progressing.

Summary by CodeRabbit

  • Chores
    • Enhanced restart command to support a conditional safety flag and adjusted error handling so non-zero exits are ignored
    • Added pre-validation checks for certificate archives to only fetch when present
    • Refined cloud image selection with more precise name-matching patterns for instances

vivekgsharma avatar Oct 19 '25 21:10 vivekgsharma

Walkthrough

Three configuration files updated: dashmate restart command is rebuilt as a multiline command and now ignores non‑zero exits; elastic_stack adds stat checks before fetching ca.zip and certs.zip; Terraform AMI name filters gain an extra hyphen for more specific matches.

Changes

Cohort / File(s) Summary
Dashmate restart logic
ansible/roles/dashmate/tasks/rolling_restart.yml
Rewrote restart command composition into a multiline block, conditionally injects --safe when dashmate_safe_restart is true, keeps conditional --platform and --verbose placement, and sets failed_when: false to ignore non-zero exit codes.
Elastic stack pre-flight checks
ansible/roles/elastic_stack/tasks/main.yml
Added ansible.builtin.stat checks for ca.zip and certs.zip (registered as ca_zip_file and certs_zip_file) and made the existing fetch tasks conditional on those results; unarchive/install logic unchanged.
AWS AMI filter refinement
terraform/aws/main.tf
Tightened AMI name filters from "ubuntu-jammy-22.04-*-server*" to "ubuntu-jammy-22.04-*-server-*" in the ubuntu_amd and ubuntu_arm data sources.

Sequence Diagrams

sequenceDiagram
    participant Task as Dashmate Restart Task
    participant Cond as Eval: dashmate_safe_restart
    participant Cmd as Build Command
    participant Exec as Execute Restart
    participant Handle as Result Handler

    Task->>Cond: Check dashmate_safe_restart
    alt safe = true
        Cond->>Cmd: include --safe
    else safe = false
        Cond->>Cmd: omit --safe
    end
    Cmd->>Cmd: ensure --verbose and conditional --platform
    Cmd->>Exec: run restart command
    Exec->>Handle: return exit code
    Note over Handle: failed_when: false -> ignore non-zero exit
sequenceDiagram
    participant Task as Elastic Stack Tasks
    participant StatCA as Stat: ca.zip
    participant StatCert as Stat: certs.zip
    participant FetchCA as Fetch CA (if missing)
    participant FetchCert as Fetch certs (if missing)
    participant Unarchive as Unarchive / Install

    Task->>StatCA: stat ca.zip
    StatCA-->>Task: exists? true/false
    Task->>StatCert: stat certs.zip
    StatCert-->>Task: exists? true/false

    alt ca.zip not exists
        Task->>FetchCA: fetch ca.zip
    end
    alt certs.zip not exists
        Task->>FetchCert: fetch certs.zip
    end
    FetchCA->>Unarchive: pass files
    FetchCert->>Unarchive: pass files
    Unarchive->>Unarchive: proceed with existing logic

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I nibbled flags and stitched a line,

--safe hops in when fences align,
I peeked in sacks before I fetch,
tuned AMI names to a tighter stretch,
a quiet restart, errors left behind.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix: deployment failures in terraform AMI lookup, rolling restart, and elasticsearch (devnet/testnet)" directly aligns with the three main changes in the changeset. The title accurately identifies the three files modified (terraform/aws/main.tf, ansible/roles/dashmate/tasks/rolling_restart.yml, and ansible/roles/elastic_stack/tasks/main.yml) and clearly conveys that these changes address deployment failures. The title is concise, uses the conventional "fix:" prefix, specifies the scope (devnet/testnet), and avoids vague terminology or noise. A developer scanning git history would immediately understand that this PR fixes specific deployment issues across infrastructure provisioning and orchestration tasks.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • [ ] Create PR with unit tests
  • [ ] Post copyable unit tests in a comment
  • [ ] Commit unit tests in branch vivek/devnet-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot] avatar Oct 19 '25 21:10 coderabbitai[bot]