Fix hostname parsing and improve decommission process
Summary of changes
Added
-
Persistent Logging: Dual logging to both STDOUT and persistent file with automatic rotation
- New
--log-fileCLI flag (default:/resource/heapdump/dc_util.log) - Automatic file rotation when approaching 1MB to prevent disk space issues
- Failsafe design - continues STDOUT logging even if file logging fails
- Essential for debugging Kubernetes lifecycle hooks where container logs may not be accessible
- Creates directory structure if it doesn't exist
- New
-
PostStart Hook Detection: Intelligent detection of StatefulSet PostStart hooks
- Automatically scans StatefulSet containers for PostStart hooks with
dc_util --reset-routing - Prevents routing allocation changes when no PostStart hook exists to reset them
- Solves historical issue where
NEW_PRIMARIESrouting allocation could not be reliably reset - Supports both single dash (
-reset-routing) and double dash (--reset-routing) flag formats - Precise word boundary matching prevents false positives from similar flag names
- Logs clear messages when PostStart hooks are found or missing
- Automatically scans StatefulSet containers for PostStart hooks with
-
Single Node Cluster Detection: Automatic detection and handling of single node clusters
- Detects when StatefulSet has exactly 1 replica and skips decommission
- Prevents unnecessary overhead and potential failures in single node deployments
- Clear logging explains why decommission was skipped
- Maintains existing behavior for multi-node clusters (≥2 replicas)
-
Configurable Lock File Path: New
--lock-fileCLI flag- Default:
/resource/heapdump/dc_util.lock - Allows customization for different deployment scenarios
- All lock file operations now use configurable path
- Default:
-
Enhanced Flag Support: Improved command-line flag handling
- Both
-reset-routingand--reset-routingformats now supported - Maintains backward compatibility with existing deployments
- Better error handling and validation
- Both
-
Multi-Architecture Support: Automatic CPU architecture detection in hook configurations
- Hook examples now include automatic detection of x86_64/amd64 and aarch64/arm64 architectures
- Downloads appropriate binary based on detected architecture (
dc_util-linux-amd64ordc_util-linux-arm64) - Eliminates need for separate configuration files for different node architectures
- Graceful error handling for unsupported architectures
Changed
-
Routing Allocation Logic: Enhanced PreStop process with PostStart hook detection
- Routing allocation changes now only occur when corresponding PostStart hook exists
- Prevents permanent cluster misconfiguration in deployments without PostStart hooks
- More intelligent decision making based on actual StatefulSet configuration
-
Replica Count Handling: Improved logic for different cluster sizes
- Zero replicas (scaled down): Skips decommission with clear logging
- Single replica: Skips decommission to prevent failures
- Multiple replicas: Proceeds with normal decommission process
- Better log messages explaining the decision for each scenario
-
Function Signatures: Updated internal functions to support configurable paths
-
createLockFile()now accepts lock file path parameter -
removeLockFile()now accepts lock file path parameter -
lockFileExists()now accepts lock file path parameter -
handleResetRouting()now accepts lock file path parameter
-
Improved
-
Logging Experience: Comprehensive logging improvements
- All log messages now appear in both STDOUT and persistent file
- Better visibility into hook execution for debugging
- Historical logs available even after pod restarts
- Easier troubleshooting and operations monitoring
-
Documentation: Extensively updated README.md
- Added "Recent Updates" section highlighting new features
- New "Replica Count Logic" section with examples
- Updated CLI parameter table with new flags
- Enhanced "PostStart Hook Detection" documentation
- Added complete "Persistent Logging" section with usage examples
- Updated sample logs sections to reflect new capabilities
- All hook configuration examples now include automatic architecture detection
- Clear separation between basic (preStop only) and complete (both hooks) configurations
-
Testing: Comprehensive test coverage for all new features
-
TestHasPostStartHookWithResetRouting: PostStart hook detection with various scenarios -
TestPostStopRoutingAllocationIntegration: Integration tests for routing allocation logic -
TestLoggingIntegration: Dual logging functionality verification -
TestLogRotation: File rotation behavior validation -
TestSingleNodeClusterBehavior: Single node cluster detection tests -
TestReplicaCountBehavior: Comprehensive replica count handling tests - All existing tests updated to work with new function signatures
-
Checklist
- [x] Link to issue this PR refers to: https://github.com/crate/cloud/issues/2755
- [x] Relevant changes are reflected in
CHANGES.rst - [x] Added or changed code is covered by tests
- [ ] Documentation has been updated if necessary
- [ ] Changed code does not contain any breaking changes (or this is a major version change)