wrangler icon indicating copy to clipboard operation
wrangler copied to clipboard

feat: Fix and Add byte size and time duration parsers with aggregation

Open smresponsibilities opened this issue 10 months ago • 0 comments

Byte Size and Time Duration Parsers with Aggregation Support

This PR introduces comprehensive support for parsing and aggregating byte size and time duration values in Wrangler, making it easier to work with file sizes and time measurements in data pipelines.

Key Features Added

1. Token Types and Parsing Classes

  • Added BYTE_SIZE and TIME_DURATION token types to the TokenType enum
  • Implemented ByteSize class to parse values like "10KB", "1.5MB", "4.2GB"
  • Implemented TimeDuration class to parse values like "100ms", "2.5s", "1.5h"

2. Unit Conversion Utilities

  • Support converting between byte units (B, KB, MB, GB, TB, PB)
  • Support converting between time units (ns, ms, s, m, h, d)
  • Proper handling of fractional values (e.g., "1.5MB", "2.5s")

3. Enhanced Directives

  • Added aggregate-stats directive for statistical analysis of byte sizes and time durations
  • Comprehensive validation and error handling for malformed inputs
  • Case-insensitive unit parsing for better usability

4. Documentation and Testing

  • Comprehensive unit tests for all new functionality
  • Updated documentation with examples and usage guidelines
  • JavaDoc for all public methods and classes

Usage Examples

Working with ByteSize in Directives

// Parse file sizes in a data set
parse-as-csv :data ","

// Aggregate statistics on file sizes
aggregate-stats :size :size_stats byte

// Results in size_stats containing:
// {
//   "count": 1000,
//   "sum": 2147483648,  // Total bytes
//   "min": 1024,        // Smallest file (1KB)
//   "max": 107374182,   // Largest file (102.4MB)
//   "avg": 2147484,     // Average size in bytes
//   "sum_kb": 2097152,  // Total in KB
//   "sum_mb": 2048,     // Total in MB
//   "sum_gb": 2,        // Total in GB
//   "units": {          // Count of each unit found
//     "KB": 250,
//     "MB": 700,
//     "GB": 50
//   }
// }

// Filter based on size statistics
filter-row-if-true exp:{ size > size_stats.avg }

ByteSize size = new ByteSize("1.5GB");

// Get values in different units
long bytes = size.getBytes();         // 1610612736
double kb = size.getKilobytes();      // 1572864.0
double mb = size.getMegabytes();      // 1536.0
double gb = size.getGigabytes();      // 1.5
double tb = size.getTerabytes();      // 0.001464...

// Get original properties
String unit = size.getUnit();         // "GB"
double value = size.getNumericValue(); // 1.5

TimeDuration duration = new TimeDuration("2.5m");

// Get values in different units
long nanos = duration.getNanos();     // 150000000000
double millis = duration.getMillis(); // 150000.0
double seconds = duration.getSeconds(); // 150.0
double minutes = duration.getMinutes(); // 2.5
double hours = duration.getHours();   // 0.041666...

// Get original properties
String unit = duration.getUnit();     // "m"
double value = duration.getNumericValue(); // 2.5

smresponsibilities avatar Apr 13 '25 18:04 smresponsibilities