mmaping files into bytevectors?
Is there a way to mmap some memory (a file for example, I can do this simply via the ffi) and use that memory without copying in a bytevector? Some sort of function that sets a bytevectors storage, and locks it, so the gc doesn't touch it?
Simply put, no.
A bytevector is a tagged pointer that designate a block of memory. That block of memory consists of a header, 8 bytes, followed directly by the bytevector contents. It would be cool if there was a way to indirectly refer to a memory region, but that's not how it's done. So, to "mmap" a bytevector, that would require to mmap a memory block, and somehow use the 8 preceding bytes to store the header, and then form a properly tagged pointer. I don't know how to do this reliably, but maybe you know how.
That is not supported, because GC, see https://github.com/cisco/ChezScheme/issues/414.
@ecraven please close the issue, given https://github.com/cisco/ChezScheme/issues/414#issuecomment-498920960 that will not be implemented (even if I am also interested by that feature).
One way to workaround the absence of that feature, is to create procedures memoryview-ref et al. that mimic the bytevector procedures with the help of foreign-ref.
I tried something else related to this. I allocate a bytevector, then expose it to C with #%$object-address
#!chezscheme
(library (blake3)
(export blake3 make-blake3 blake3-update! blake3-finalize!)
(import (chezscheme))
(define libblake3 (load-shared-object "local/lib/libblake3.so"))
(define-syntax define-syntax-rule
(syntax-rules ()
((define-syntax-rule (keyword args ...) body)
(define-syntax keyword
(syntax-rules ()
((keyword args ...) body))))))
(define (bytevector->pointer bv)
(#%$object-address bv (+ (foreign-sizeof 'void*) 1)))
(define-syntax-rule (foreign-procedure* return ptr args ...)
(foreign-procedure __collect_safe ptr (args ...) return))
(define-syntax-rule (with-lock obj body ...)
(begin (lock-object obj)
(call-with-values (lambda () body ...)
(lambda out
(unlock-object obj)
(apply values out)))))
(define-ftype %keyvalue
(packed (struct
(key void*)
(key-length int)
(value void*)
(value-length int))))
(define blake3-hasher-init
(let ((func (foreign-procedure* void "blake3_hasher_init" void*)))
(lambda (blake3)
(func blake3))))
(define (make-blake3)
(define bv (make-bytevector 1912)) ;; sizeof blake3_hasher
(with-lock bv
(blake3-hasher-init (bytevector->pointer bv)))
(bytevector->pointer bv))
(define blake3-update!
(let ((func (foreign-procedure* void "blake3_hasher_update" void* void* size_t)))
(lambda (blake3 bytevector)
(with-lock bytevector
(func blake3
(bytevector->pointer bytevector)
(bytevector-length bytevector))))))
(define blake3-finalize!
(let ((func (foreign-procedure* void "blake3_hasher_finalize" void* void* size_t)))
(lambda (blake3)
(define bytevector (make-bytevector 32))
(with-lock bytevector
(func blake3 (bytevector->pointer bytevector) 32))
bytevector)))
(define (blake3 bytevector)
(define hasher (make-blake3))
(blake3-update! hasher bytevector)
(blake3-finalize! hasher)))
It works in cases where Scheme know the size or the max size of the bytevector such as in make-blake3
I did a small benchmark again:
mmap(time (count (file-generator/mmap "CC-MAIN-20180618105733-20180618125538-00026.warc.wet")))
no collections
13.565826164s elapsed cpu time
13.566017044s elapsed real time
112 bytes allocated
369071864
port(time (count (file-generator/port "CC-MAIN-20180618105733-20180618125538-00026.warc.wet")))
1 collection
0.866473516s elapsed cpu time, including 0.000115993s collecting
0.866486155s elapsed real time, including 0.000118302s collecting
4334832 bytes allocated, including 8258736 bytes reclaimed
369071864
Also, I read about mmap. It seems like this question does not make sense.
I think the next best thing to do if you need to peek into a (binary) file is use read somewhat like:
(define file-read
(let ((proc (foreign-procedure __collect_safe
"read" (int void* size_t) ssize_t)))
(lambda (fd bv size)
(lock-object bv)
(let ((out (proc fd (bytevector-pointer bv) size)))
(unlock-object bv)
out))))
In that case there is no extra copy.
Like I wrote on IRC, the other thing that would be possible and may proove itsefl useful for any memory region that are bring up by C side is a set of procedures memoryregion-foobar that mimic bytevector-foobar.
It is worth noting that it would be fairly easy to adapt the foreign-ref and foreign-set! functions to provide the interface you describe.
For instance:
bytevector |
foreign-ref equivalent |
|---|---|
(bytevector-u32-native-ref bv 0) |
(foreign-ref 'unsigned-32 mm 0) |
(bytevector-ieee-double-native-ref bv 0) |
(foreign-ref 'double mm 0) |
(bytevector-s56-native-ref bv 0) |
(foreign-ref 'integer-56 mm 0) |
| etc. |
bytevector |
foreign-set! equivalent |
|---|---|
(bytevector-u32-native-set! bv 0 10) |
(foreign-set! 'unsigned-32 mm 0 10) |
(bytevector-ieee-double-native-set! bv 0 8.75) |
(foreign-set! 'double mm 0 8.75) |
(bytevector-s56-native-set! bv 0 75) |
(foreign-set! 'integer-56 mm 0 75) |
| etc. |
There are a few subtle differences here, for instances foreign-ref and foreign-set! are more permissive in their offsets which do not need to be a multiple of the type size, in the way the bytevector versions do. The foreign-ref and foreign-set! functions also do not support specifying endianness, but if this was desired it would be fairly straightforward to implement versions that took endianness into account.