fix: deployment failures in terraform AMI lookup, rolling restart, and elasticsearch (devnet/testnet)
Summary
Fixes deployment failures that occur during initial network deployments.
Changes
-
Terraform AMI lookup fix (
terraform/aws/main.tf)- Fixed Ubuntu AMI name pattern from
server*toserver-*for both AMD64 and ARM64 - Prevents "AMI not found" errors by correctly matching Ubuntu image names like
ubuntu-jammy-22.04-amd64-server-20240927
- Fixed Ubuntu AMI name pattern from
-
Rolling restart improvements (
ansible/roles/dashmate/tasks/rolling_restart.yml)- Made
--safeflag conditional viadashmate_safe_restartvariable - Added
failed_when: falseto prevent DKG timeout failures on fresh deployments
- Made
-
Elasticsearch certificate handling (
ansible/roles/elastic_stack/tasks/main.yml)- Added file existence checks before fetching ca.zip and certs.zip
- Prevents fetch errors when certificates haven't been generated yet
Testing
Tested on devnet-mahua deployment - all services deployed successfully without errors and chain progressing.
Summary by CodeRabbit
-
Chores
- Enhanced restart command to support a conditional safety flag and adjusted error handling so non-zero exits are ignored
- Added pre-validation checks for certificate archives to only fetch when present
- Refined cloud image selection with more precise name-matching patterns for instances
Walkthrough
Three configuration files updated: dashmate restart command is rebuilt as a multiline command and now ignores non‑zero exits; elastic_stack adds stat checks before fetching ca.zip and certs.zip; Terraform AMI name filters gain an extra hyphen for more specific matches.
Changes
| Cohort / File(s) | Summary |
|---|---|
Dashmate restart logic ansible/roles/dashmate/tasks/rolling_restart.yml |
Rewrote restart command composition into a multiline block, conditionally injects --safe when dashmate_safe_restart is true, keeps conditional --platform and --verbose placement, and sets failed_when: false to ignore non-zero exit codes. |
Elastic stack pre-flight checks ansible/roles/elastic_stack/tasks/main.yml |
Added ansible.builtin.stat checks for ca.zip and certs.zip (registered as ca_zip_file and certs_zip_file) and made the existing fetch tasks conditional on those results; unarchive/install logic unchanged. |
AWS AMI filter refinement terraform/aws/main.tf |
Tightened AMI name filters from "ubuntu-jammy-22.04-*-server*" to "ubuntu-jammy-22.04-*-server-*" in the ubuntu_amd and ubuntu_arm data sources. |
Sequence Diagrams
sequenceDiagram
participant Task as Dashmate Restart Task
participant Cond as Eval: dashmate_safe_restart
participant Cmd as Build Command
participant Exec as Execute Restart
participant Handle as Result Handler
Task->>Cond: Check dashmate_safe_restart
alt safe = true
Cond->>Cmd: include --safe
else safe = false
Cond->>Cmd: omit --safe
end
Cmd->>Cmd: ensure --verbose and conditional --platform
Cmd->>Exec: run restart command
Exec->>Handle: return exit code
Note over Handle: failed_when: false -> ignore non-zero exit
sequenceDiagram
participant Task as Elastic Stack Tasks
participant StatCA as Stat: ca.zip
participant StatCert as Stat: certs.zip
participant FetchCA as Fetch CA (if missing)
participant FetchCert as Fetch certs (if missing)
participant Unarchive as Unarchive / Install
Task->>StatCA: stat ca.zip
StatCA-->>Task: exists? true/false
Task->>StatCert: stat certs.zip
StatCert-->>Task: exists? true/false
alt ca.zip not exists
Task->>FetchCA: fetch ca.zip
end
alt certs.zip not exists
Task->>FetchCert: fetch certs.zip
end
FetchCA->>Unarchive: pass files
FetchCert->>Unarchive: pass files
Unarchive->>Unarchive: proceed with existing logic
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
🐇 I nibbled flags and stitched a line,
--safe hops in when fences align,
I peeked in sacks before I fetch,
tuned AMI names to a tighter stretch,
a quiet restart, errors left behind.
Pre-merge checks and finishing touches
✅ Passed checks (3 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Title Check | ✅ Passed | The pull request title "fix: deployment failures in terraform AMI lookup, rolling restart, and elasticsearch (devnet/testnet)" directly aligns with the three main changes in the changeset. The title accurately identifies the three files modified (terraform/aws/main.tf, ansible/roles/dashmate/tasks/rolling_restart.yml, and ansible/roles/elastic_stack/tasks/main.yml) and clearly conveys that these changes address deployment failures. The title is concise, uses the conventional "fix:" prefix, specifies the scope (devnet/testnet), and avoids vague terminology or noise. A developer scanning git history would immediately understand that this PR fixes specific deployment issues across infrastructure provisioning and orchestration tasks. |
| Docstring Coverage | ✅ Passed | No functions found in the changes. Docstring coverage check skipped. |
✨ Finishing touches
🧪 Generate unit tests (beta)
- [ ] Create PR with unit tests
- [ ] Post copyable unit tests in a comment
- [ ] Commit unit tests in branch
vivek/devnet-fixes
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
Comment @coderabbitai help to get the list of available commands and usage tips.