ChezScheme icon indicating copy to clipboard operation
ChezScheme copied to clipboard

mmaping files into bytevectors?

Open ecraven opened this issue 6 years ago • 8 comments

Is there a way to mmap some memory (a file for example, I can do this simply via the ffi) and use that memory without copying in a bytevector? Some sort of function that sets a bytevectors storage, and locks it, so the gc doesn't touch it?

ecraven avatar Jan 05 '20 18:01 ecraven

Simply put, no.

A bytevector is a tagged pointer that designate a block of memory. That block of memory consists of a header, 8 bytes, followed directly by the bytevector contents. It would be cool if there was a way to indirectly refer to a memory region, but that's not how it's done. So, to "mmap" a bytevector, that would require to mmap a memory block, and somehow use the 8 preceding bytes to store the header, and then form a properly tagged pointer. I don't know how to do this reliably, but maybe you know how.

Jarhmander avatar Feb 04 '21 21:02 Jarhmander

That is not supported, because GC, see https://github.com/cisco/ChezScheme/issues/414.

amirouche avatar Feb 08 '21 06:02 amirouche

@ecraven please close the issue, given https://github.com/cisco/ChezScheme/issues/414#issuecomment-498920960 that will not be implemented (even if I am also interested by that feature).

One way to workaround the absence of that feature, is to create procedures memoryview-ref et al. that mimic the bytevector procedures with the help of foreign-ref.

amirouche avatar Feb 08 '21 06:02 amirouche

I tried something else related to this. I allocate a bytevector, then expose it to C with #%$object-address

#!chezscheme
(library (blake3)
  (export blake3 make-blake3 blake3-update! blake3-finalize!)
  (import (chezscheme))

  (define libblake3 (load-shared-object "local/lib/libblake3.so"))


  (define-syntax define-syntax-rule
    (syntax-rules ()
      ((define-syntax-rule (keyword args ...) body)
       (define-syntax keyword
         (syntax-rules ()
           ((keyword args ...) body))))))

  (define (bytevector->pointer bv)
    (#%$object-address bv (+ (foreign-sizeof 'void*) 1)))

  (define-syntax-rule (foreign-procedure* return ptr args ...)
    (foreign-procedure __collect_safe ptr (args ...) return))


  (define-syntax-rule (with-lock obj body ...)
    (begin (lock-object obj)
           (call-with-values (lambda () body ...)
             (lambda out
               (unlock-object obj)
               (apply values out)))))

    (define-ftype %keyvalue
      (packed (struct
               (key void*)
               (key-length int)
               (value void*)
               (value-length int))))

    (define blake3-hasher-init
      (let ((func (foreign-procedure* void "blake3_hasher_init" void*)))
        (lambda (blake3)
          (func blake3))))

    (define (make-blake3)
      (define bv (make-bytevector 1912)) ;; sizeof blake3_hasher
      (with-lock bv
                 (blake3-hasher-init (bytevector->pointer bv)))
      (bytevector->pointer bv))

    (define blake3-update!
      (let ((func (foreign-procedure* void "blake3_hasher_update" void* void* size_t)))
        (lambda (blake3 bytevector)
          (with-lock bytevector
                     (func blake3
                           (bytevector->pointer bytevector)
                           (bytevector-length bytevector))))))

    (define blake3-finalize!
      (let ((func (foreign-procedure* void "blake3_hasher_finalize" void* void* size_t)))
        (lambda (blake3)
          (define bytevector (make-bytevector 32))
          (with-lock bytevector
                     (func blake3 (bytevector->pointer bytevector) 32))
          bytevector)))

    (define (blake3 bytevector)
      (define hasher (make-blake3))
      (blake3-update! hasher bytevector)
      (blake3-finalize! hasher)))

It works in cases where Scheme know the size or the max size of the bytevector such as in make-blake3

amirouche avatar Mar 11 '21 12:03 amirouche

I did a small benchmark again:

mmap(time (count (file-generator/mmap "CC-MAIN-20180618105733-20180618125538-00026.warc.wet")))
    no collections
    13.565826164s elapsed cpu time
    13.566017044s elapsed real time
    112 bytes allocated
369071864
port(time (count (file-generator/port "CC-MAIN-20180618105733-20180618125538-00026.warc.wet")))
    1 collection
    0.866473516s elapsed cpu time, including 0.000115993s collecting
    0.866486155s elapsed real time, including 0.000118302s collecting
    4334832 bytes allocated, including 8258736 bytes reclaimed
369071864

amirouche avatar Mar 24 '21 04:03 amirouche

Also, I read about mmap. It seems like this question does not make sense.

I think the next best thing to do if you need to peek into a (binary) file is use read somewhat like:

(define file-read
  (let ((proc (foreign-procedure __collect_safe
                                 "read" (int void* size_t) ssize_t)))
    (lambda (fd bv size)
      (lock-object bv)
      (let ((out (proc fd (bytevector-pointer bv) size)))
        (unlock-object bv)
        out))))

In that case there is no extra copy.

amirouche avatar Mar 24 '21 04:03 amirouche

Like I wrote on IRC, the other thing that would be possible and may proove itsefl useful for any memory region that are bring up by C side is a set of procedures memoryregion-foobar that mimic bytevector-foobar.

amirouche avatar Mar 24 '21 04:03 amirouche

It is worth noting that it would be fairly easy to adapt the foreign-ref and foreign-set! functions to provide the interface you describe.

For instance:

bytevector foreign-ref equivalent
(bytevector-u32-native-ref bv 0) (foreign-ref 'unsigned-32 mm 0)
(bytevector-ieee-double-native-ref bv 0) (foreign-ref 'double mm 0)
(bytevector-s56-native-ref bv 0) (foreign-ref 'integer-56 mm 0)
etc.
bytevector foreign-set! equivalent
(bytevector-u32-native-set! bv 0 10) (foreign-set! 'unsigned-32 mm 0 10)
(bytevector-ieee-double-native-set! bv 0 8.75) (foreign-set! 'double mm 0 8.75)
(bytevector-s56-native-set! bv 0 75) (foreign-set! 'integer-56 mm 0 75)
etc.

There are a few subtle differences here, for instances foreign-ref and foreign-set! are more permissive in their offsets which do not need to be a multiple of the type size, in the way the bytevector versions do. The foreign-ref and foreign-set! functions also do not support specifying endianness, but if this was desired it would be fairly straightforward to implement versions that took endianness into account.

akeep avatar Mar 28 '21 01:03 akeep