pyo3 Improve performance for `PyByteArray::new`

Hello, recently i found the loading speed of safetensors is more than 50% slow than loading pickle.

https://github.com/huggingface/safetensors/issues/460

for a 1GB numpy array, loading by pickle only need 1s, but safetensors need 1.8s

safetensors load: 1.8608193397521973
pickle      load: 1.004188060760498

Then, i profiled the performance, and found that, the most slower is calling PyByteArray::new, consuming almost all time! https://github.com/ZHUI/safetensors/tree/performance/load_speed https://github.com/ZHUI/safetensors/commit/889c43219a0e38731c3045cbb0c4cc9fbb458ce2

                let data =
                    &mmap[info.data_offsets.0 + self.offset..info.data_offsets.1 + self.offset];

                error!("get_tensor mmap done");

                let array: PyObject = Python::with_gil(|py| PyByteArray::new(py, data).into_py(py));

                error!("get_tensor PyByteArray");

[2024-04-08T03:40:56.506Z ERROR safetensors_rust] Open new statrt
load weight 1712547656.5070753
[2024-04-08T03:40:56.507Z ERROR safetensors_rust] 710 get_tensor
[2024-04-08T03:40:56.507Z ERROR safetensors_rust] get_tensor
[2024-04-08T03:40:56.507Z ERROR safetensors_rust] get_tensor mmap done
[2024-04-08T03:40:58.375Z ERROR safetensors_rust] get_tensor PyByteArray
[2024-04-08T03:40:58.375Z ERROR safetensors_rust] create_tensor done!
sf load: 1.9328322410583496

for PyByteArray::new(py, data);, the data is mmap address, the PyByteArray::new simple called PyByteArray_FromStringAndSize https://github.com/python/cpython/blob/733e56ef9656dd79055acc2a3cecaf6054a45b6c/Objects/bytearrayobject.c#L134-L142

it seems just memcpy the mmap data to memory and the data is copying for disk->mem.

and i don't know why it is so slow, does it have additional mem->mem copy for this optional? thus disk->mem->mem

It there any way to create a PyByteArray from an exists memory without copy data?

Apr 08 '24 03:04 ZHUI

It there any way to create a PyByteArray from an exists memory without copy data?

I don't think that is possible, especially with PyByteArray which provides mutable access to the underlying data. Do you need mutability? If not, PyBytes might be a better choice but even that type will copy to move ownership of the data under the control of the Python heap.

I think a real zero-copy solution would be to implement a custom pyclass which implements the buffer protocol and is directly backed by the Mmap instance you already have. We recently discussed this in #4003 and we use something similar in rust-numpy, albeit as a NumPy base object, without the buffer protocol but used to perform type erasure.

Apr 08 '24 06:04 adamreichold

@adamreichold Thanks for you answer. I am not very familiar with rust.

even that type will copy to move ownership of the data under the control of the Python heap.

it mean the data will always copy since we pass data from rust to python?

I think a real zero-copy solution would be to implement a custom pyclass which implements the buffer protocol and is directly backed by the Mmap instance you already have. We recently discussed this in #4003 and we use something similar in rust-numpy, albeit as a NumPy base object, without the buffer protocol but used to perform type erasure.

https://github.com/PyO3/rust-numpy/blob/0832b28bf84a3f6053ca647a43625d715831db00/src/array.rs#L402-L421

I checked the PySliceContainer code, is it using rust data_ptr -> python PySliceContainer data_ptr addr -> python PY_ARRAY_API.PyArray_SetBaseObject to avoid memcpy?

        let mut dims = dims.into_dimension();
        let ptr = PY_ARRAY_API.PyArray_NewFromDescr(
            py,
            PY_ARRAY_API.get_type_object(py, npyffi::NpyTypes::PyArray_Type),
            T::get_dtype_bound(py).into_dtype_ptr(),
            dims.ndim_cint(),
            dims.as_dims_ptr(),
            strides as *mut npy_intp,    // strides
            data_ptr as *mut c_void,     // data
            npyffi::NPY_ARRAY_WRITEABLE, // flag
            ptr::null_mut(),             // obj
        );

        PY_ARRAY_API.PyArray_SetBaseObject(
            py,
            ptr as *mut npyffi::PyArrayObject,
            container as *mut ffi::PyObject,
        );

        Bound::from_owned_ptr(py, ptr).downcast_into_unchecked()

Can i use your PySliceContainer directly? or can your give me some pseudo-code to implement it？

Apr 08 '24 08:04 ZHUI

it mean the data will always copy since we pass data from rust to python?

Basically because someone has to free the data and know how to do that, i.e. in our case the main task of PySliceContainer is to call the correct drop functions when Python's reference count reaches zero because Python types like bytes do not know how to free Rust-allocated memory.

Can i use your PySliceContainer directly? or can your give me some pseudo-code to implement it？

I don't think so because this type is pretty much tied to backing a NumPy array by a rust-allocated Vec (or boxed slice), but if I understand you correctly you want bytes-like access to a Mmap instance.

I fear you will have to figure this out via https://pyo3.rs/v0.21.1/class/protocols.html?highlight=__getbuffer__#buffer-objects which is a thin layer over the CPython FFI objects until we provide a more ergonomic API to achieve this.

Thinking outside of the box, I am not sure how exactly you are using Rust here but you could also just use Python's mmap support directly and pass the resulting object to Rust as PyAny. But I am not sure how you want to access those bytes afterwards...

Apr 08 '24 09:04 adamreichold

Thanks for you advice. For my case copy mmap is reasonable for load weight to MEM.

The core problem is memcpy for mmap memory is very slow. see:

https://stackoverflow.com/questions/52845387/improving-mmap-memcpy-file-read-performance

for my case， open(filename); f.read() is 2 GB/s, for memcpy(mmap(filename)) is 1.3 GB/s.

which is mach more slower than read file.

Apr 10 '24 03:04 ZHUI

This seems reasonable: If you are going to read the whole file, f.read() knows this intent and can optimize accordingly while memcpy(mmap(filename)) needs to figure this out via the page faults incurred.

I guess you could reduce the difference by applying madvice(.., MADV_SEQUENTIAL); to the memory map to tell the kernel about what you are going to do, but there is little point in modifying the process' address space if all you going to do is pull the data into a buffer. Bypassing the page cache via direct I/O might be useful if you really read it only once (and not repeatedly in an interactive session).

Using std::fs::read and IntoPyArray should also provide a performant approach to turn this into a NumPy array without additional copies.

Apr 10 '24 17:04 adamreichold