digest icon indicating copy to clipboard operation
digest copied to clipboard

draft approach to exporting hash methods directly

Open pearsonca opened this issue 1 year ago • 11 comments

This is a rough in idea for how {digest} might approach exposing access to the available hashing methods. The rough idea is:

  • want to make the methods directly available, with minimal overhead (which means the user accepting more of the burden of doing checks)
  • ... also for more basic R types directly
  • ... in a way that's easy to maintain / extend
  • ... and that can be incorporated in a backwards compatible way that eliminates code duplication

Marking this as draft, as the idea is to get a feel for the acceptability of this kind of approach before investing a fair chunk of refactor effort.

pearsonca avatar Sep 18 '24 18:09 pearsonca

N.B.: I'm not super-pumped about the macro-based approach. I like it conceptually, but the actual language support is dicey in C and macros seem like the best (of bad) ways to do it - this sort of approach is better in C++ and is more manageable with the linguistic support for classes / templates / etc.

Introducing C++ seems like a heavier overall lift in terms of refactor, however.

pearsonca avatar Sep 18 '24 18:09 pearsonca

Thanks for picking this back up. I have the ability to access C-level functions in a packages directly from other packages in a few places so this is surely something we can add. This PR is already a little on the long side so let's if we can it simply and in stages.

I put two really quick comments in but I need to come back to this with a bit more time.

eddelbuettel avatar Sep 18 '24 18:09 eddelbuettel

Thanks for picking this back up. I have the ability to access C-level functions in a packages directly from other packages in a few places so this is surely something we can add. This PR is already a little on the long side so let's if we can it simply and in stages.

I put two really quick comments in but I need to come back to this with a bit more time.

Thanks, will be on the lookout for more notes.

Re stages, I could see:

  • can isolate spliting digest.c => digest.h + digest.c, so that init.c can know about exportable methods
  • then introducing the macros for consistent signatures (even if we don't use Register_ALGO approach - tho, i think the main upside of that is being able to define how we'd name the algorithms for export in one place)
  • then cut over a single algorithm/input source (md5 + integer vec?) + getting testing right
  • then cut over the rest of the algorithms
  • then cut over more basic SEXP types

pearsonca avatar Sep 18 '24 20:09 pearsonca

That bulleted list was helpful. I had not clicked why we needed digest.h. What you wrote makes sense.

What were you thinking in terms of minimal input? Don't we always start from raw vector we get from the serializating function? Wouldn't that make it type-agnostic as we don't S3 dispatch in C code? Or do you want the S3 dispatchers each pick up a matching C function? That sounds ... like a lot more work.

eddelbuettel avatar Sep 18 '24 21:09 eddelbuettel

What were you thinking in terms of minimal input? Don't we always start from raw vector we get from the serializating function? Wouldn't that make it type-agnostic as we don't S3 dispatch in C code? Or do you want the S3 dispatchers each pick up a matching C function? That sounds ... like a lot more work.

I think the serialization is a potential opportunity to reduce overhead, but could be wrong! Seems like it should be cheaper to use an int* directly cast as an unsigned char* than to serialize the SEXP contents => use that unsigned char*. But if serialization is 1) doable on the C side (seems likely) and 2) small incremental benefit for the raw types (more skeptical on this), then the reduced engineering + portability gain seems a reasonable tradeoff.

pearsonca avatar Sep 18 '24 21:09 pearsonca

See package RApiSerialize which I use in package RcppRedis to turn any R expression, variable, ... handed down as SEXP into raw bytes (which Redis stores, we would hash them). Not going through the R side for two calls should help. We could cover the most common parameterisation as a start.

eddelbuettel avatar Sep 18 '24 21:09 eddelbuettel

Got it, will have a look - but so should expect just shifting serialization to the C side will yield useful gain? Definitely much easier engineering lift.

(and can be its own isolated improvement to the current digest / vdigests)

edit: though one advantage to breaking out the base types would be increased practicality for comparison to independent tools? possibly not worth the proverbial squeeze.

pearsonca avatar Sep 18 '24 21:09 pearsonca

Well that is what I was thinking we should have a 'demo client' too. :grinning: But as I understand it your use case is (many, small, frequent) calls to digest from the level of C code ? In that case we want the 'digest me this "thing"' approach.

It remains to be seen if we then need some of the convenience of Rcpp for conversion from C/C++ types to SEXP. For now, and for (say) and int or double vector we can do by hand, no?

eddelbuettel avatar Sep 18 '24 22:09 eddelbuettel

Roughly correct re my use case of interest, but the library I'm building also provides flexibility to hash somewhat arbitrary (though always small) objects. The users should be providing a small distinguishing key, but that needn't be a series of integers - but one choice they should be able to make to improve how the library works for them is to switch to (e.g.) using a series of integers.

pearsonca avatar Sep 18 '24 22:09 pearsonca

"Objects" are still SEXP-representable. If you can call digest::digest()) on it we should be able to do it at the C level too, with fewer checks and faster. Or maybe I miss something?

eddelbuettel avatar Sep 19 '24 01:09 eddelbuettel

"Objects" are still SEXP-representable. If you can call digest::digest()) on it we should be able to do it at the C level too, with fewer checks and faster. Or maybe I miss something?

I think that's right, yes.

Alright, I'm going to proceed a bit on the bullet list above, with the tweaks subsequently discussed. That'll come down in separate PRs, but going to leave this one drafted until that series is resolved.

pearsonca avatar Sep 19 '24 06:09 pearsonca

This got a little stale so I think it is best to close this. I am going to make a maintenance release in the next few days, I would be up for calmly reviewing where we're at and where we'd want to take the C interface. (For what it is worth I just helped a user of the my related crc32c repo with an example usage package but as it turns out he was equally well served just calling digest(x, algo="crc32c", serialize=FALSE) ... It all depends. We can offer more at the C level and should find a way.

eddelbuettel avatar Oct 04 '25 01:10 eddelbuettel