[Feature Request] Enhanced Segment Reload Status Tracking with Failure Details

Open suvodeep-pyne opened this issue 3 months ago • 0 comments

Problem Statement

Current Limitation

The existing segment reload status API (GET /segments/segmentReloadStatus/{jobId}) uses timestamp-based heuristics to determine reload success by comparing segmentLoadTimeMs >= jobSubmissionTime. This approach has several limitations:

Cannot distinguish between states: Pending, in-progress, and failed segments all appear as "not successful"
No failure visibility: When reloads fail, errors are logged but not queryable via API
No error details: Users must access pod logs to understand why segments failed to reload
False positives: Segments reloaded for unrelated reasons may be incorrectly counted as successful

Impact

Debugging difficulty: Users cannot determine which segments failed or why without log access
Monitoring gaps: No programmatic way to alert on reload failures
Incomplete observability: Controllers and clients lack visibility into reload operation health

Proposed Solution

Implement an in-memory status cache on Pinot servers to track reload failures with complete error details and aggregate success statistics, enabling comprehensive reload observability through the existing API.

Design Principles

Memory-efficient: Store only failures individually, aggregate statistics for successes
Failure-focused: Complete debugging context (stack traces) for failures only
Scalable: Memory grows O(failures) not O(total_segments)
Thread-safe: Handle concurrent reload operations safely
Backward compatible: Enhance existing APIs without breaking changes
Operationally friendly: Configurable limits and auto-cleanup

Implementation Plan

Phase 1: Failure Tracking with Aggregate Success Metrics (PR #17099)

Status: In review

Scope:

In-memory cache tracking failures per reload job with full stack traces
Aggregate success statistics (count, min/max/avg duration)
Job ID preservation in Helix messages
Enhanced server API returning failure details and performance metrics
Controller aggregation of results

Data Model:

class ReloadJobStatus {
  // Only failed segments stored individually
  Map<String, SegmentFailureInfo> failedSegments;
  
  // Success metrics aggregated (fixed 28 bytes)
  ReloadSuccessStats successStats;
}

class SegmentFailureInfo {
  String segmentName;
  String message;        // Exception message
  String stackTrace;     // Full stack trace for debugging
}

class ReloadSuccessStats {
  int successCount;
  long minDurationMs;
  long maxDurationMs;
  double avgDurationMs;  // Exponential moving average
}

API Enhancement:

{
  "jobId": "uuid",
  "successStats": {
    "successCount": 97,
    "minDurationMs": 45,
    "maxDurationMs": 1250,
    "avgDurationMs": 312.5
  },
  "failedSegments": [
    {
      "segmentName": "seg2",
      "message": "Connection timeout after 30 seconds",
      "stackTrace": "java.io.IOException: Connection timeout...\n  at ..."
    }
  ],
  "totalSegments": 100,
  "failureCount": 3
}

Memory Budget (significantly optimized):

Typical case (1% failure): 88 MB for 10,000 jobs (vs. 230 MB with per-segment tracking)
Best case (0% failure): 4.4 MB for 10,000 jobs
Worst case (10% failure): 805 MB for 10,000 jobs
Memory reduction: 60% less than per-segment status tracking
Scalability: Grows only with failures, not total segment count

Key Optimization:

No storage for successful segments (99% of cases)
Only aggregate statistics: 28 bytes fixed overhead regardless of success count
100 successful segments: 28 bytes vs. 12 KB = 99.8% reduction

Current PR: #17099

Phase 2: Complete Failure Path Coverage (Planned)

Scope:

Fix gaps in failure tracking coverage:
- Single segment reload path (currently untracked)
- Config fetch failures (occurs before instrumented try-catch)
- Semaphore acquire failures (partial coverage)
Add safety net at message handler level to catch all failure types

Coverage Matrix (Current State):

Failure Type	Single Segment	Batch Reload	Coverage
Config fetch	❌ Not tracked	❌ Not tracked	0%
Semaphore acquire	❌ Not tracked	✅ Tracked	Partial
Download failures	❌ Not tracked	✅ Tracked	Partial
Index building	❌ Not tracked	✅ Tracked	Partial

Phase 3: Enhanced Statistics and Metrics (Future)

Scope:

Additional performance metrics (percentiles, histograms)
Failure categorization by error type
Retry tracking for failed segments
Time-series trends for reload performance

Design Details

Cache Structure

Cache<String, ReloadJobStatus> cache =
  CacheBuilder.newBuilder()
    .maximumSize(10000)                       // Configurable
    .expireAfterWrite(30, TimeUnit.DAYS)      // Configurable
    .recordStats()                            // Enable metrics
    .build();

Configuration

# Server configuration
pinot.server.table.reload.status.cache.max.size=10000
pinot.server.table.reload.status.cache.ttl.days=30
pinot.server.table.reload.status.stats.ema.alpha=0.1

Thread Safety

Guava Cache protects job-level access
Synchronized methods in ReloadJobStatus for updates
ConcurrentHashMap for failed segments map
Atomic statistics updates using synchronized recordSuccess/recordFailure

Exponential Moving Average

Configurable alpha parameter (default: 0.1)
Formula: newAvg = alpha × newDuration + (1 - alpha) × oldAvg
Balances recent vs. historical performance
Single value, no history storage required

Benefits

Improved Debugging: Full stack traces for failures queryable via API
Memory Efficient: 60% less memory than per-segment tracking
Better Monitoring: Programmatic access to failure details and performance metrics
Operational Visibility: Clear insight into reload job health across the cluster
Reduced MTTR: Faster incident diagnosis with accessible error details
Performance Insights: Min/max/avg duration metrics for reload operations
Scalable Design: Memory grows only with failures, not total segment count

Alternatives Considered

Per-segment status tracking: Rejected due to memory overhead (99% waste for successful segments)
ZooKeeper-based persistence: Rejected due to write overhead and ZK load concerns
Compressed stack traces: Rejected for simplicity; full traces fit acceptable memory budget
Optional cache: Rejected; making cache required simplifies code and improves reliability
Status enum (PENDING/IN_PROGRESS): Rejected; only failures need tracking for debugging

Open Questions

Should failed entries have longer TTL than successful ones for debugging?
Should we expose cache admin APIs (clear, stats) for operational management?
What metrics/alerts should be added for reload failure monitoring?
Should EMA alpha be configurable per-table or global only?

References

Related PR: #17099 (Phase 1 implementation)
Design Principle: Memory-first with failure focus
Inspiration: Similar tracking in other distributed systems (Kubernetes Job status, Spark task tracking)

Community Feedback Welcome

We welcome feedback on:

Phase priorities and scope
API response format
Memory budget concerns
Success statistics usefulness (min/max/avg)
Alternative approaches
Additional use cases

Note: This is a phased enhancement designed to incrementally improve reload observability while maintaining backward compatibility and operational stability. Phase 1 provides immediate value with minimal risk and dramatically reduced memory footprint by focusing on what operators actually need: detailed failure information and aggregate performance metrics.

Oct 31 '25 18:10 suvodeep-pyne