pinot icon indicating copy to clipboard operation
pinot copied to clipboard

[Feature Request] Enhanced Segment Reload Status Tracking with Failure Details

Open suvodeep-pyne opened this issue 3 months ago • 0 comments

Problem Statement

Current Limitation

The existing segment reload status API (GET /segments/segmentReloadStatus/{jobId}) uses timestamp-based heuristics to determine reload success by comparing segmentLoadTimeMs >= jobSubmissionTime. This approach has several limitations:

  1. Cannot distinguish between states: Pending, in-progress, and failed segments all appear as "not successful"
  2. No failure visibility: When reloads fail, errors are logged but not queryable via API
  3. No error details: Users must access pod logs to understand why segments failed to reload
  4. False positives: Segments reloaded for unrelated reasons may be incorrectly counted as successful

Impact

  • Debugging difficulty: Users cannot determine which segments failed or why without log access
  • Monitoring gaps: No programmatic way to alert on reload failures
  • Incomplete observability: Controllers and clients lack visibility into reload operation health

Proposed Solution

Implement an in-memory status cache on Pinot servers to track reload failures with complete error details and aggregate success statistics, enabling comprehensive reload observability through the existing API.

Design Principles

  1. Memory-efficient: Store only failures individually, aggregate statistics for successes
  2. Failure-focused: Complete debugging context (stack traces) for failures only
  3. Scalable: Memory grows O(failures) not O(total_segments)
  4. Thread-safe: Handle concurrent reload operations safely
  5. Backward compatible: Enhance existing APIs without breaking changes
  6. Operationally friendly: Configurable limits and auto-cleanup

Implementation Plan

Phase 1: Failure Tracking with Aggregate Success Metrics (PR #17099)

Status: In review

Scope:

  • In-memory cache tracking failures per reload job with full stack traces
  • Aggregate success statistics (count, min/max/avg duration)
  • Job ID preservation in Helix messages
  • Enhanced server API returning failure details and performance metrics
  • Controller aggregation of results

Data Model:

class ReloadJobStatus {
  // Only failed segments stored individually
  Map<String, SegmentFailureInfo> failedSegments;
  
  // Success metrics aggregated (fixed 28 bytes)
  ReloadSuccessStats successStats;
}

class SegmentFailureInfo {
  String segmentName;
  String message;        // Exception message
  String stackTrace;     // Full stack trace for debugging
}

class ReloadSuccessStats {
  int successCount;
  long minDurationMs;
  long maxDurationMs;
  double avgDurationMs;  // Exponential moving average
}

API Enhancement:

{
  "jobId": "uuid",
  "successStats": {
    "successCount": 97,
    "minDurationMs": 45,
    "maxDurationMs": 1250,
    "avgDurationMs": 312.5
  },
  "failedSegments": [
    {
      "segmentName": "seg2",
      "message": "Connection timeout after 30 seconds",
      "stackTrace": "java.io.IOException: Connection timeout...\n  at ..."
    }
  ],
  "totalSegments": 100,
  "failureCount": 3
}

Memory Budget (significantly optimized):

  • Typical case (1% failure): 88 MB for 10,000 jobs (vs. 230 MB with per-segment tracking)
  • Best case (0% failure): 4.4 MB for 10,000 jobs
  • Worst case (10% failure): 805 MB for 10,000 jobs
  • Memory reduction: 60% less than per-segment status tracking
  • Scalability: Grows only with failures, not total segment count

Key Optimization:

  • No storage for successful segments (99% of cases)
  • Only aggregate statistics: 28 bytes fixed overhead regardless of success count
  • 100 successful segments: 28 bytes vs. 12 KB = 99.8% reduction

Current PR: #17099


Phase 2: Complete Failure Path Coverage (Planned)

Scope:

  • Fix gaps in failure tracking coverage:
    • Single segment reload path (currently untracked)
    • Config fetch failures (occurs before instrumented try-catch)
    • Semaphore acquire failures (partial coverage)
  • Add safety net at message handler level to catch all failure types

Coverage Matrix (Current State):

Failure Type Single Segment Batch Reload Coverage
Config fetch ❌ Not tracked ❌ Not tracked 0%
Semaphore acquire ❌ Not tracked ✅ Tracked Partial
Download failures ❌ Not tracked ✅ Tracked Partial
Index building ❌ Not tracked ✅ Tracked Partial

Phase 3: Enhanced Statistics and Metrics (Future)

Scope:

  • Additional performance metrics (percentiles, histograms)
  • Failure categorization by error type
  • Retry tracking for failed segments
  • Time-series trends for reload performance

Design Details

Cache Structure

Cache<String, ReloadJobStatus> cache =
  CacheBuilder.newBuilder()
    .maximumSize(10000)                       // Configurable
    .expireAfterWrite(30, TimeUnit.DAYS)      // Configurable
    .recordStats()                            // Enable metrics
    .build();

Configuration

# Server configuration
pinot.server.table.reload.status.cache.max.size=10000
pinot.server.table.reload.status.cache.ttl.days=30
pinot.server.table.reload.status.stats.ema.alpha=0.1

Thread Safety

  • Guava Cache protects job-level access
  • Synchronized methods in ReloadJobStatus for updates
  • ConcurrentHashMap for failed segments map
  • Atomic statistics updates using synchronized recordSuccess/recordFailure

Exponential Moving Average

  • Configurable alpha parameter (default: 0.1)
  • Formula: newAvg = alpha × newDuration + (1 - alpha) × oldAvg
  • Balances recent vs. historical performance
  • Single value, no history storage required

Benefits

  1. Improved Debugging: Full stack traces for failures queryable via API
  2. Memory Efficient: 60% less memory than per-segment tracking
  3. Better Monitoring: Programmatic access to failure details and performance metrics
  4. Operational Visibility: Clear insight into reload job health across the cluster
  5. Reduced MTTR: Faster incident diagnosis with accessible error details
  6. Performance Insights: Min/max/avg duration metrics for reload operations
  7. Scalable Design: Memory grows only with failures, not total segment count

Alternatives Considered

  1. Per-segment status tracking: Rejected due to memory overhead (99% waste for successful segments)
  2. ZooKeeper-based persistence: Rejected due to write overhead and ZK load concerns
  3. Compressed stack traces: Rejected for simplicity; full traces fit acceptable memory budget
  4. Optional cache: Rejected; making cache required simplifies code and improves reliability
  5. Status enum (PENDING/IN_PROGRESS): Rejected; only failures need tracking for debugging

Open Questions

  1. Should failed entries have longer TTL than successful ones for debugging?
  2. Should we expose cache admin APIs (clear, stats) for operational management?
  3. What metrics/alerts should be added for reload failure monitoring?
  4. Should EMA alpha be configurable per-table or global only?

References

  • Related PR: #17099 (Phase 1 implementation)
  • Design Principle: Memory-first with failure focus
  • Inspiration: Similar tracking in other distributed systems (Kubernetes Job status, Spark task tracking)

Community Feedback Welcome

We welcome feedback on:

  • Phase priorities and scope
  • API response format
  • Memory budget concerns
  • Success statistics usefulness (min/max/avg)
  • Alternative approaches
  • Additional use cases

Note: This is a phased enhancement designed to incrementally improve reload observability while maintaining backward compatibility and operational stability. Phase 1 provides immediate value with minimal risk and dramatically reduced memory footprint by focusing on what operators actually need: detailed failure information and aggregate performance metrics.

suvodeep-pyne avatar Oct 31 '25 18:10 suvodeep-pyne