ChezScheme mmaping files into bytevectors?

Is there a way to mmap some memory (a file for example, I can do this simply via the ffi) and use that memory without copying in a bytevector? Some sort of function that sets a bytevectors storage, and locks it, so the gc doesn't touch it?

Jan 05 '20 18:01 ecraven

Simply put, no.

A bytevector is a tagged pointer that designate a block of memory. That block of memory consists of a header, 8 bytes, followed directly by the bytevector contents. It would be cool if there was a way to indirectly refer to a memory region, but that's not how it's done. So, to "mmap" a bytevector, that would require to mmap a memory block, and somehow use the 8 preceding bytes to store the header, and then form a properly tagged pointer. I don't know how to do this reliably, but maybe you know how.

Feb 04 '21 21:02 Jarhmander

That is not supported, because GC, see https://github.com/cisco/ChezScheme/issues/414.

Feb 08 '21 06:02 amirouche

@ecraven please close the issue, given https://github.com/cisco/ChezScheme/issues/414#issuecomment-498920960 that will not be implemented (even if I am also interested by that feature).

One way to workaround the absence of that feature, is to create procedures memoryview-ref et al. that mimic the bytevector procedures with the help of foreign-ref.

Feb 08 '21 06:02 amirouche

I tried something else related to this. I allocate a bytevector, then expose it to C with #%$object-address

#!chezscheme
(library (blake3)
  (export blake3 make-blake3 blake3-update! blake3-finalize!)
  (import (chezscheme))

  (define libblake3 (load-shared-object "local/lib/libblake3.so"))


  (define-syntax define-syntax-rule
    (syntax-rules ()
      ((define-syntax-rule (keyword args ...) body)
       (define-syntax keyword
         (syntax-rules ()
           ((keyword args ...) body))))))

  (define (bytevector->pointer bv)
    (#%$object-address bv (+ (foreign-sizeof 'void*) 1)))

  (define-syntax-rule (foreign-procedure* return ptr args ...)
    (foreign-procedure __collect_safe ptr (args ...) return))


  (define-syntax-rule (with-lock obj body ...)
    (begin (lock-object obj)
           (call-with-values (lambda () body ...)
             (lambda out
               (unlock-object obj)
               (apply values out)))))

    (define-ftype %keyvalue
      (packed (struct
               (key void*)
               (key-length int)
               (value void*)
               (value-length int))))

    (define blake3-hasher-init
      (let ((func (foreign-procedure* void "blake3_hasher_init" void*)))
        (lambda (blake3)
          (func blake3))))

    (define (make-blake3)
      (define bv (make-bytevector 1912)) ;; sizeof blake3_hasher
      (with-lock bv
                 (blake3-hasher-init (bytevector->pointer bv)))
      (bytevector->pointer bv))

    (define blake3-update!
      (let ((func (foreign-procedure* void "blake3_hasher_update" void* void* size_t)))
        (lambda (blake3 bytevector)
          (with-lock bytevector
                     (func blake3
                           (bytevector->pointer bytevector)
                           (bytevector-length bytevector))))))

    (define blake3-finalize!
      (let ((func (foreign-procedure* void "blake3_hasher_finalize" void* void* size_t)))
        (lambda (blake3)
          (define bytevector (make-bytevector 32))
          (with-lock bytevector
                     (func blake3 (bytevector->pointer bytevector) 32))
          bytevector)))

    (define (blake3 bytevector)
      (define hasher (make-blake3))
      (blake3-update! hasher bytevector)
      (blake3-finalize! hasher)))

It works in cases where Scheme know the size or the max size of the bytevector such as in make-blake3

Mar 11 '21 12:03 amirouche

I did a small benchmark again:

mmap(time (count (file-generator/mmap "CC-MAIN-20180618105733-20180618125538-00026.warc.wet")))
    no collections
    13.565826164s elapsed cpu time
    13.566017044s elapsed real time
    112 bytes allocated
369071864
port(time (count (file-generator/port "CC-MAIN-20180618105733-20180618125538-00026.warc.wet")))
    1 collection
    0.866473516s elapsed cpu time, including 0.000115993s collecting
    0.866486155s elapsed real time, including 0.000118302s collecting
    4334832 bytes allocated, including 8258736 bytes reclaimed
369071864

Mar 24 '21 04:03 amirouche

Also, I read about mmap. It seems like this question does not make sense.

I think the next best thing to do if you need to peek into a (binary) file is use read somewhat like:

(define file-read
  (let ((proc (foreign-procedure __collect_safe
                                 "read" (int void* size_t) ssize_t)))
    (lambda (fd bv size)
      (lock-object bv)
      (let ((out (proc fd (bytevector-pointer bv) size)))
        (unlock-object bv)
        out))))

In that case there is no extra copy.

Mar 24 '21 04:03 amirouche

Like I wrote on IRC, the other thing that would be possible and may proove itsefl useful for any memory region that are bring up by C side is a set of procedures memoryregion-foobar that mimic bytevector-foobar.

Mar 24 '21 04:03 amirouche

It is worth noting that it would be fairly easy to adapt the foreign-ref and foreign-set! functions to provide the interface you describe.

For instance:

`bytevector`	`foreign-ref` equivalent
`(bytevector-u32-native-ref bv 0)`	`(foreign-ref 'unsigned-32 mm 0)`
`(bytevector-ieee-double-native-ref bv 0)`	`(foreign-ref 'double mm 0)`
`(bytevector-s56-native-ref bv 0)`	`(foreign-ref 'integer-56 mm 0)`
etc.

`bytevector`	`foreign-set!` equivalent
`(bytevector-u32-native-set! bv 0 10)`	`(foreign-set! 'unsigned-32 mm 0 10)`
`(bytevector-ieee-double-native-set! bv 0 8.75)`	`(foreign-set! 'double mm 0 8.75)`
`(bytevector-s56-native-set! bv 0 75)`	`(foreign-set! 'integer-56 mm 0 75)`
etc.

There are a few subtle differences here, for instances foreign-ref and foreign-set! are more permissive in their offsets which do not need to be a multiple of the type size, in the way the bytevector versions do. The foreign-ref and foreign-set! functions also do not support specifying endianness, but if this was desired it would be fairly straightforward to implement versions that took endianness into account.

Mar 28 '21 01:03 akeep