wrangler
wrangler copied to clipboard
Merging the full correct implementation of Feature/adding size time parsing
Byte Size and Time Duration Parsers with Aggregation Support
This PR introduces comprehensive support for parsing and aggregating byte size and time duration values in Wrangler, making it easier to work with file sizes and time measurements in data pipelines.
Key Features Added
1. Token Types and Parsing Classes
- Added
BYTE_SIZEandTIME_DURATIONtoken types to the TokenType enum - Implemented
ByteSizeclass to parse values like "10KB", "1.5MB", "4.2GB" - Implemented
TimeDurationclass to parse values like "100ms", "2.5s", "1.5h"
2. Unit Conversion Utilities
- Support converting between byte units (B, KB, MB, GB, TB, PB)
- Support converting between time units (ns, ms, s, m, h, d)
- Proper handling of fractional values (e.g., "1.5MB", "2.5s")
3. Enhanced Directives
- Added
aggregate-statsdirective for statistical analysis of byte sizes and time durations - Comprehensive validation and error handling for malformed inputs
- Case-insensitive unit parsing for better usability
4. Documentation and Testing
- Comprehensive unit tests for all new functionality
- Updated documentation with examples and usage guidelines
- JavaDoc for all public methods and classes
Usage Examples
Working with ByteSize in Directives
// Parse file sizes in a data set
parse-as-csv :data ","
// Aggregate statistics on file sizes
aggregate-stats :size :size_stats byte
// Results in size_stats containing:
// {
// "count": 1000,
// "sum": 2147483648, // Total bytes
// "min": 1024, // Smallest file (1KB)
// "max": 107374182, // Largest file (102.4MB)
// "avg": 2147484, // Average size in bytes
// "sum_kb": 2097152, // Total in KB
// "sum_mb": 2048, // Total in MB
// "sum_gb": 2, // Total in GB
// "units": { // Count of each unit found
// "KB": 250,
// "MB": 700,
// "GB": 50
// }
// }
// Filter based on size statistics
filter-row-if-true exp:{ size > size_stats.avg }
ByteSize size = new ByteSize("1.5GB");
// Get values in different units
long bytes = size.getBytes(); // 1610612736
double kb = size.getKilobytes(); // 1572864.0
double mb = size.getMegabytes(); // 1536.0
double gb = size.getGigabytes(); // 1.5
double tb = size.getTerabytes(); // 0.001464...
// Get original properties
String unit = size.getUnit(); // "GB"
double value = size.getNumericValue(); // 1.5
TimeDuration duration = new TimeDuration("2.5m");
// Get values in different units
long nanos = duration.getNanos(); // 150000000000
double millis = duration.getMillis(); // 150000.0
double seconds = duration.getSeconds(); // 150.0
double minutes = duration.getMinutes(); // 2.5
double hours = duration.getHours(); // 0.041666...
// Get original properties
String unit = duration.getUnit(); // "m"
double value = duration.getNumericValue(); // 2.5