Chris Crutchfield

Results 9 comments of Chris Crutchfield

Hi @andylou2 I believe it is `RunStatsGenerators/GenerateSlicedStatisticsImpl/RunCombinerStatsGenerators/CombinePerKey`. I did try using `experimental_use_sketch_based_topk_uniques` but it still ran out of memory unfortunately. I will try running it again with some of the...

I've tried a bunch of different things to try to fix this issue on my end, but haven't had too much success. The only thing that consistently seems to work...

Support case number is 28706271 and includes some more details, I can cc people on the ticket if need be

Can you clarify whether there are any workarounds for this issue? We are pretty constantly seeing this OOM issue with 1.2.0. The only solution seems to be to change `StatsOptions.sample_rate`...

Hi @rcrowe-google, thanks for the heads up. Could you point to the specific changes or commits that should help with OOM errors? Or let me know if there are any...

I tried TFDV 1.4.0 and compared to 1.2.0 on a relatively small dataset (~100m rows, ~5k columns) and still ran into OOM issues on dataflow with `n2-highmem-64` machine types for...

Thanks for the prompt response, @ppwwyyxx. I think the key difference is that for `interleave` the input `map_func` is described as "A function mapping a dataset *element* to a *dataset*",...

Right, I think the utility here would be for very large files (shards of some large dataset), most likely hosted on some sort of cloud storage (S3 or GCS), such...

@ppwwyyxx as a stopgap, any suggestions for how I might go about trying to implement this myself? Just make an analogous class to `_ParallelMapData` that interleaves instead of maps?