PyClass Sequence of tuples or arrays not convertible to numpy array
Bug Description
When creating a custom pyclass that implements the sequence protocol with pyo3 0.16.x the output type is missing something (I haven't been able to figure out exactly what) that prevents numpy from interpreting it correctly to convert into an array. This only is an issue when using the creating a sequence type using #[pymethods]. It works as expected if you use the deprecated #[pyproto] with the PySequenceProtocol trait.
Steps to Reproduce
- create pyo3 pyclass that implement sequence protocol on array or tuple of numeric types, minimal example:
use pyo3::prelude::*;
use pyo3::exceptions::PyIndexError;
#[pyclass(module="reproduce")]
#[derive(Clone)]
pub struct MySequenceArray {
pub data: Vec<[usize; 2]>
}
#[pymethods]
impl MySequenceArray {
#[new]
fn new(data: Vec<[usize; 2]>) -> Self {
MySequenceArray {
data
}
}
fn __len__(&self) -> PyResult<usize> {
Ok(self.data.len())
}
fn __getitem__(&self, idx: isize) -> PyResult<[usize; 2]> {
if idx as usize >= self.data.len() {
Err(PyIndexError::new_err("out of bounds"))
} else {
Ok(self.data[idx as usize])
}
}
}
#[pyclass(module="reproduce")]
#[derive(Clone)]
pub struct MySequenceTuple {
pub data: Vec<(usize, usize)>
}
#[pymethods]
impl MySequenceTuple {
#[new]
fn new(data: Vec<(usize, usize)>) -> Self {
MySequenceTuple {
data
}
}
fn __len__(&self) -> PyResult<usize> {
Ok(self.data.len())
}
fn __getitem__(&self, idx: isize) -> PyResult<(usize, usize)> {
if idx as usize >= self.data.len() {
Err(PyIndexError::new_err("out of bounds"))
} else {
Ok(self.data[idx as usize])
}
}
}
#[pymodule]
fn reproduce(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
m.add_class::<MySequenceTuple>()?;
m.add_class::<MySequenceArray>()?;
Ok(())
}
- In python create instances of
MySequenceArrayorMySequenceTuple:
import reproduce
test_list = [[0, 1], [1, 2]]
reproduce_array = reproduce.MySequenceArray(test_list)
test_tuple = [(0, 1), (1, 2)]
reproduce_tuple = reproduce.MySequenceTuple(test_tuple)
- Build 2d numpy array from custom sequence objects:
import numpy as np
nd_array = np.asarray(reproduce_array, dtype=np.uintp)
nd_tuple = np.asarray(reproduce_tuple, dtype=np.uintp)
Backtrace
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'reproduce.MySequenceArray'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
Your operating system and version
Linux
Your Python version (python --version)
3.10.0
Your Rust version (rustc --version)
1.60.0
Your PyO3 version
0.16.x
How did you install python? Did you use a virtualenv?
pacman
Additional Info
This example worked as expected with the deprecated PySequenceProtocol trait and pyproto. So the pyclass code becomes:
use pyo3::prelude::*;
use pyo3::PySequenceProtocol;
use pyo3::exceptions::PyIndexError;
#[pyclass(module="reproduce")]
#[derive(Clone)]
pub struct MySequenceArray {
pub data: Vec<[usize; 2]>
}
#[pymethods]
impl MySequenceArray {
#[new]
fn new(data: Vec<[usize; 2]>) -> Self {
MySequenceArray {
data
}
}
}
#[pyproto]
impl PySequenceProtocol for MySequenceArray {
fn __len__(&self) -> PyResult<usize> {
Ok(self.data.len())
}
fn __getitem__(&'p self, idx: isize) -> PyResult<[usize; 2]> {
if idx as usize >= self.data.len() {
Err(PyIndexError::new_err("out of bounds"))
} else {
Ok(self.data[idx as usize])
}
}
}
#[pyclass(module="reproduce")]
#[derive(Clone)]
pub struct MySequenceTuple {
pub data: Vec<(usize, usize)>
}
#[pymethods]
impl MySequenceTuple {
#[new]
fn new(data: Vec<(usize, usize)>) -> Self {
MySequenceTuple {
data
}
}
}
#[pyproto]
impl PySequenceProtocol for MySequenceTuple {
fn __len__(&self) -> PyResult<usize> {
Ok(self.data.len())
}
fn __getitem__(&self, idx: isize) -> PyResult<(usize, usize)> {
if idx as usize >= self.data.len() {
Err(PyIndexError::new_err("out of bounds"))
} else {
Ok(self.data[idx as usize])
}
}
}
#[pymodule]
fn reproduce(_py: Python<'_>, m: &PyModule) -> PyResult<()> {
m.add_class::<MySequenceTuple>()?;
m.add_class::<MySequenceArray>()?;
Ok(())
}
then running:
import reproduce
import numpy as np
test_list = [[0, 1], [1, 2]]
reproduce_array = reproduce.MySequenceArray(test_list)
test_tuple = [(0, 1), (1, 2)]
reproduce_tuple = reproduce.MySequenceTuple(test_tuple)
nd_array = np.asarray(reproduce_array, dtype=np.uintp)
nd_tuple = np.asarray(reproduce_tuple, dtype=np.uintp)
print(nd_array)
print(nd_tuple)
outputs the expected:
[[0 1]
[1 2]]
[[0 1]
[1 2]]
As @georgios-ts pointed out in https://github.com/Qiskit/retworkx/issues/614#issuecomment-1132994184 the underlying issue might be that the new pymethods approach is setting mp_length and pyproto is setting sq_length.
Ugh, that's frustrating.
The problem I saw with sq_length is that if it's set then when indexing (from C only) with a negative index then CPython will add the result of sq_length to it. This is a subtle behavior that I perceived as potentially surprising for users.
TBH while annoying I've been coming round to the opinion that maybe it's the lesser evil compared to not setting sq_length.
What would your opinion be? Is CPython adding the length to negative indices a problem in practice?
TBH while annoying I've been coming round to the opinion that maybe it's the lesser evil compared to not setting
sq_length.What would your opinion be? Is CPython adding the length to negative indices a problem in practice?
I'm not the original reporter, but I've had a problem with my code not handling negative indexes correctly after migration from #[pyproto], solving which required adding several workarounds in my Rust code. On the contrary, I found the new PyO3 behavior surprising.
The "not handling correctly" just means that we aren't handling negative indices specially anymore, I think. This is intended.
I agree with @birkenfeld in that I felt having special handling for negative integers to be a footgun.
For example, imagine a sequence with length 5, and user attempts to index with -8.
If sq_len is implemented, then CPython will amend that -8 to -3 (by adding 5). This is done invisibly to the sq_item implementation. Not only would it be a bug for the sq_item implementation to amend that -3 to 2, but it's potentially done inconsistently; any other code which calls sq_item directly needs to reproduce CPython's behavior of conditionally adding sq_len to negative indices if present. CPython also doesn't adjust negative indices when trying the mp_subscript slot prior to trying sq_item.
Further, the sq_item implementation would probably not want to raise IndexError: -3 - I perceive that to be surprising to the user, given they had originally passed -8. Maybe the sq_item could correct the -3 to -8 before raising, but given the correction may not be applied consistently this won't help either.
To be honest, both having and not having sq_len have downsides, given we're now seeing the downsides of PyO3 having chosen to omit it.
My preferred solution would be to add a new type object flag to CPython to opt-out of this negative index adjustment so that implementations can handle it. That way PyO3 could return to implementing sq_len. I just haven't got around to proposing that.
Well, this sounds indeed like a problem in CPython but from my perspective pyo3 should not try to fight that (and it didn't a few releases back). It's best to fill in sq_length and document this behavior.
While I'm leaning towards this indeed, we need to be careful because anyone who is converting indices manually like @ravenexp will end up with broken code if we start emitting sq_len. If we revert this for PyO3 0.17, we need big bold text in the CHANGELOG to make it clear users need to check and fix behavior.
I was thinking about this a bit more, maybe in the interest of compatibility we can add an explicit sequence flag to pyclass like is already there for mapping. If it's set only the sequence slots, including sq_len, get filled. That way if the user doesn't specify anything it behaves as it does today, but if you want an explicit sequence type you can specify that just like you can with a mapping type.
I think that's a reasonable plan. If anyone's got availability to write a PR for this, would be much appreciated. Otherwise I'll get to it eventually.... (Maybe in a few weeks though)
Sequence users - we're thinking to change PySequence and PyMapping types to be based on isinstance(value, collections.abc.Sequence) and isinstance(value, collections.abc.Mapping) respectively - see #2072 for more.
I was wondering if you have any opinions on this?
Implementation for https://github.com/PyO3/pyo3/issues/2392#issuecomment-1146697738 now at #2567. @mtreinish are you willing to double-check that branch works for you before I merge it?
#2567 now merged, can test on main if desired. Release incoming soon, I will close this.
I can't get #[pyclass(sequence)] to work as documented. Negative indexes are still passed to __getitem__() as is even in PyO3 v0.17.1, even though the class is tagged as a sequence and __len__() is implemented.
Am I missing something?
I think https://github.com/PyO3/pyo3/issues/2392#issuecomment-1133955665 and your experience illustrates exactly why this CPython behaviour is very confusing. I will put a longer answer on #2601.