[Feature Request] Enhanced Segment Reload Status Tracking with Failure Details
Problem Statement
Current Limitation
The existing segment reload status API (GET /segments/segmentReloadStatus/{jobId}) uses timestamp-based heuristics to determine reload success by comparing segmentLoadTimeMs >= jobSubmissionTime. This approach has several limitations:
- Cannot distinguish between states: Pending, in-progress, and failed segments all appear as "not successful"
- No failure visibility: When reloads fail, errors are logged but not queryable via API
- No error details: Users must access pod logs to understand why segments failed to reload
- False positives: Segments reloaded for unrelated reasons may be incorrectly counted as successful
Impact
- Debugging difficulty: Users cannot determine which segments failed or why without log access
- Monitoring gaps: No programmatic way to alert on reload failures
- Incomplete observability: Controllers and clients lack visibility into reload operation health
Proposed Solution
Implement an in-memory status cache on Pinot servers to track reload failures with complete error details and aggregate success statistics, enabling comprehensive reload observability through the existing API.
Design Principles
- Memory-efficient: Store only failures individually, aggregate statistics for successes
- Failure-focused: Complete debugging context (stack traces) for failures only
- Scalable: Memory grows O(failures) not O(total_segments)
- Thread-safe: Handle concurrent reload operations safely
- Backward compatible: Enhance existing APIs without breaking changes
- Operationally friendly: Configurable limits and auto-cleanup
Implementation Plan
Phase 1: Failure Tracking with Aggregate Success Metrics (PR #17099)
Status: In review
Scope:
- In-memory cache tracking failures per reload job with full stack traces
- Aggregate success statistics (count, min/max/avg duration)
- Job ID preservation in Helix messages
- Enhanced server API returning failure details and performance metrics
- Controller aggregation of results
Data Model:
class ReloadJobStatus {
// Only failed segments stored individually
Map<String, SegmentFailureInfo> failedSegments;
// Success metrics aggregated (fixed 28 bytes)
ReloadSuccessStats successStats;
}
class SegmentFailureInfo {
String segmentName;
String message; // Exception message
String stackTrace; // Full stack trace for debugging
}
class ReloadSuccessStats {
int successCount;
long minDurationMs;
long maxDurationMs;
double avgDurationMs; // Exponential moving average
}
API Enhancement:
{
"jobId": "uuid",
"successStats": {
"successCount": 97,
"minDurationMs": 45,
"maxDurationMs": 1250,
"avgDurationMs": 312.5
},
"failedSegments": [
{
"segmentName": "seg2",
"message": "Connection timeout after 30 seconds",
"stackTrace": "java.io.IOException: Connection timeout...\n at ..."
}
],
"totalSegments": 100,
"failureCount": 3
}
Memory Budget (significantly optimized):
- Typical case (1% failure): 88 MB for 10,000 jobs (vs. 230 MB with per-segment tracking)
- Best case (0% failure): 4.4 MB for 10,000 jobs
- Worst case (10% failure): 805 MB for 10,000 jobs
- Memory reduction: 60% less than per-segment status tracking
- Scalability: Grows only with failures, not total segment count
Key Optimization:
- No storage for successful segments (99% of cases)
- Only aggregate statistics: 28 bytes fixed overhead regardless of success count
- 100 successful segments: 28 bytes vs. 12 KB = 99.8% reduction
Current PR: #17099
Phase 2: Complete Failure Path Coverage (Planned)
Scope:
- Fix gaps in failure tracking coverage:
- Single segment reload path (currently untracked)
- Config fetch failures (occurs before instrumented try-catch)
- Semaphore acquire failures (partial coverage)
- Add safety net at message handler level to catch all failure types
Coverage Matrix (Current State):
| Failure Type | Single Segment | Batch Reload | Coverage |
|---|---|---|---|
| Config fetch | ❌ Not tracked | ❌ Not tracked | 0% |
| Semaphore acquire | ❌ Not tracked | ✅ Tracked | Partial |
| Download failures | ❌ Not tracked | ✅ Tracked | Partial |
| Index building | ❌ Not tracked | ✅ Tracked | Partial |
Phase 3: Enhanced Statistics and Metrics (Future)
Scope:
- Additional performance metrics (percentiles, histograms)
- Failure categorization by error type
- Retry tracking for failed segments
- Time-series trends for reload performance
Design Details
Cache Structure
Cache<String, ReloadJobStatus> cache =
CacheBuilder.newBuilder()
.maximumSize(10000) // Configurable
.expireAfterWrite(30, TimeUnit.DAYS) // Configurable
.recordStats() // Enable metrics
.build();
Configuration
# Server configuration
pinot.server.table.reload.status.cache.max.size=10000
pinot.server.table.reload.status.cache.ttl.days=30
pinot.server.table.reload.status.stats.ema.alpha=0.1
Thread Safety
- Guava Cache protects job-level access
- Synchronized methods in ReloadJobStatus for updates
- ConcurrentHashMap for failed segments map
- Atomic statistics updates using synchronized recordSuccess/recordFailure
Exponential Moving Average
- Configurable alpha parameter (default: 0.1)
- Formula:
newAvg = alpha × newDuration + (1 - alpha) × oldAvg - Balances recent vs. historical performance
- Single value, no history storage required
Benefits
- Improved Debugging: Full stack traces for failures queryable via API
- Memory Efficient: 60% less memory than per-segment tracking
- Better Monitoring: Programmatic access to failure details and performance metrics
- Operational Visibility: Clear insight into reload job health across the cluster
- Reduced MTTR: Faster incident diagnosis with accessible error details
- Performance Insights: Min/max/avg duration metrics for reload operations
- Scalable Design: Memory grows only with failures, not total segment count
Alternatives Considered
- Per-segment status tracking: Rejected due to memory overhead (99% waste for successful segments)
- ZooKeeper-based persistence: Rejected due to write overhead and ZK load concerns
- Compressed stack traces: Rejected for simplicity; full traces fit acceptable memory budget
- Optional cache: Rejected; making cache required simplifies code and improves reliability
- Status enum (PENDING/IN_PROGRESS): Rejected; only failures need tracking for debugging
Open Questions
- Should failed entries have longer TTL than successful ones for debugging?
- Should we expose cache admin APIs (clear, stats) for operational management?
- What metrics/alerts should be added for reload failure monitoring?
- Should EMA alpha be configurable per-table or global only?
References
- Related PR: #17099 (Phase 1 implementation)
- Design Principle: Memory-first with failure focus
- Inspiration: Similar tracking in other distributed systems (Kubernetes Job status, Spark task tracking)
Community Feedback Welcome
We welcome feedback on:
- Phase priorities and scope
- API response format
- Memory budget concerns
- Success statistics usefulness (min/max/avg)
- Alternative approaches
- Additional use cases
Note: This is a phased enhancement designed to incrementally improve reload observability while maintaining backward compatibility and operational stability. Phase 1 provides immediate value with minimal risk and dramatically reduced memory footprint by focusing on what operators actually need: detailed failure information and aggregate performance metrics.