Add `take` specification for returning elements of an array along a specified axis
This PR
- adds a specification for
take, an API for returning elements of an array along a specified axis. This API was initially proposed in https://github.com/data-apis/array-api/issues/177. - supports only a single keyword argument:
axis. This is derived from comparison data. - requires that the
axiskeyword argument always be provided, as suggested in comment. - does not currently support providing a multi-dimensional array of indices. Array libraries supporting multi-dimensional arrays for
indicestreat the input array as a flattened 1-dimensional array. If we were to follow similarly, we'd need to specify the flattened layout. - does not currently support lists/tuples/sequences for the
indicesargument. NumPy et al allow array-like. Torch requiresLongTensor. Was not immediately clear to me how accepting we should be for non-arrays, so went with the most conservative choice. - for an integer
indicesargument, requires that behavior match multi-axis indexing semantics, as set forth in the specification indexing guidance.
Questions
- Should the API be extended to support multi-dimensional
indices? - Should the
axisargument be required? - Should the
indicesargument be more liberal in the types of allowed arguments?
Should this PR be blocked until an initial spec release?
Regarding axis and flattening, it's worth noting np.put doesn't have axis, so requiring it here would either break the symmetry with a potential put function, or require put to be modified from NumPy (see https://github.com/data-apis/array-api/issues/177#issuecomment-1078482747).
Two thoughts:
- We could make
axisoptional only in the case of 1D arrays - I would not necessarily be limited by constraints on
np.put, which in my opinion is much more niche functionality thantake. I don't even know if we wantputin the API standard.
@rgommers Before moving forward with this PR, we need to ensure buy-in from PyTorch. The current proposal deviates from supported PyTorch behavior in the following ways:
- PyTorch does not currently support an
axiskwarg. - PyTorch always flattens the input array.
We'd need to ensure that PyTorch is (1) okay with adding an axis kwarg to support taking along a specified axis and (2) making an axis kwarg mandatory, thus breaking current behavior (i.e., based on the current proposal, no way to index into a flattened array using take which is the current PyTorch behavior).
For NumPy-inspired libraries, the axis kwarg is supported and, provided the spec-compliant API is in a sub-namespace, xp.take(x, indices, axis=-1) can simply wrap a call to np.take. Dask already effectively requires an axis kwarg, as the default axis value is 0.
Accordingly, PyTorch is the odd one out here, atm, so would be good to get a feel for potential specification constraints in order to ensure broad API compatibility, while also addressing fancy indexing concerns (indexing arrays via arrays) and the proposal to require an axis.
For NumPy-inspired libraries, the
axiskwarg is supported and, provided the spec-compliant API is in a sub-namespace,xp.take(x, indices, axis=-1)can simply wrap a call tonp.take. Dask already effectively requires anaxiskwarg, as the defaultaxisvalue is0.
This requires a backwards-incompatible change in NumPy as well, when we want to move to support in the main namespace. It'd be a good change to make though. Dask not following the NumPy default is fairly exceptional for Dask, and was probably done because the flattening behavior is indeed not useful (EDIT: it was added in https://github.com/dask/dask/pull/26, no discussion there on axis default).
I just had a look at SciPy, and it has quite a few np.take usages but none with the default axis=None. Scikit-learn does have some, I need to look in a little more detail to verify whether those are all for 1-D arrays or not.
I looked at all usages in scikit-learn and most inputs to np.take are 1-D, in which case the axis argument is not used/needed, and for the one case with 2-D array input the code uses axis=0:
# `processed` is 1-D
sklearn/cluster/_optics.py: unproc = np.compress(~np.take(processed, indices), indices)
--
# `X` is 2-D, that's why this uses `axis=0`
sklearn/cluster/_optics.py: dists = pairwise_distances(P, np.take(X, unproc, axis=0),
--
# `reachability_` is 1-D
sklearn/cluster/_optics.py: improved = np.where(rdists < np.take(reachability_, unproc))
--
# `compute_class_weight` returns a 1-D array
sklearn/utils/class_weight.py: weight_k = np.take(compute_class_weight(class_weight_k,
sklearn/utils/class_weight.py- classes=classes_subsample,
sklearn/utils/class_weight.py- y=y_subsample),
sklearn/utils/class_weight.py- np.searchsorted(classes_subsample,
sklearn/utils/class_weight.py- classes_full),
sklearn/utils/class_weight.py- mode='clip')
--
# `labels_multiclass` and `labels_binary` are lists of strings (so 1-D)
sklearn/utils/estimator_checks.py: y_names_multiclass = np.take(labels_multiclass, y_multiclass)
sklearn/utils/estimator_checks.py: y_names_binary = np.take(labels_binary, y_binary)
sklearn/utils/estimator_checks.py: y_names_binary = np.take(labels_binary, y_binary)
--
# `nominal_attributes` is a dict, so first argument to `take` is a 1-D object array
sklearn/datasets/_openml.py: np.take(
sklearn/datasets/_openml.py- np.asarray(nominal_attributes.pop(col_name),
sklearn/datasets/_openml.py- dtype='O'),
sklearn/datasets/_openml.py- y[:, i:i + 1].astype(int, copy=False))
--
# `perm` is 1-D
sklearn/model_selection/tests/test_split.py: y_perm = np.take(perm, y)
--
# `perm` is 1-D
sklearn/metrics/tests/test_common.py: y_true_perm = np.take(perm, y_true)
--
# `labels_multiclass` is a list of strings (so 1-D)
sklearn/semi_supervised/tests/test_self_training.py: y_strings = np.take(labels_multiclass, y)
does not currently support lists/tuples/sequences for the
indicesargument. NumPy et al allow array-like. Torch requiresLongTensor. Was not immediately clear to me how accepting we should be for non-arrays, so went with the most conservative choice.
As a rule we don't allow array-like inputs in the array API standard, because (a) it's ill-defined, and (b) it turns out not to be the best idea to allow it anyway. So it's fine to be consistent with that here.
We could make
axisoptional only in the case of 1D arrays
This makes sense to me. And seems useful, based on my analysis of how scikit-learn uses np.take.
The current proposal deviates from supported PyTorch behavior in the following ways:
1. PyTorch does not currently support an `axis` kwarg. 2. PyTorch always flattens the input array.We'd need to ensure that PyTorch is (1) okay with adding an
axiskwarg to support taking along a specified axis and (2) making anaxiskwarg mandatory, thus breaking current behavior (i.e., based on the current proposal, no way to index into a flattened array usingtakewhich is the current PyTorch behavior).
I don't think there's an issue with progressing this PR:
-
torch.takewith two arguments is actually compatible withnumpy.takewith two arguments, - adding an
axiskeyword is not controversial for PyTorch, and would be fully backwards compatible if it were to default toNone, - the introduction path would add
axis=None(or better,dim=None) first, and then have the appropriate warnings raised before switching the default to0, - the bc-breaking change would be to add
axis=0; if as discussed above theaxiskeyword is mandatory for >=2-D array input, then there's still no risk of silent breakage, just some code churn. And this would be the same for PyTorch as for NumPy, - I'll note that this will be low-prio for PyTorch to work on though, because there's already
take_along_dimthat does what we propose here (again similar tonp.take_along_axis), so it's just some churn but doesn't add functionality.
We could make axis optional only in the case of 1D arrays
Seems OK, since it is compatible with None as a default. But, is there much reason for it as in that case arr[index] is practically identical (unless you need to ensure a copy and index is a scalar)?
(Plausibly, in a few cases sklearn uses it because many years ago take was faster and either the myth or the code stuck around. One of the example uses mode="clip" though.)
I am not immediately sure whether 0 or -1 would make the better default. None is useful for reduce-likes, but take is maybe not quite reduce-like enough (but probably inherits it from there).
take_along_dim that does what we propose here (again similar to np.take_along_axis), so it's just some churn but doesn't add functionality.
take_along_axis is a very different functionality, it is only identical for 1-D inputs. take_along_axis specificalaly takes a arr.ndim dimensional index array as input.
Seems OK, since it is compatible with
Noneas a default. But, is there much reason for it as in that casearr[index]is practically identical (unless you need to ensure a copy andindexis a scalar)?
Except integer array indexing is not in the standard (yet at least), which is why there was so much interest in adding take.
take_along_axisis a very different functionality,
Good point, I overlooked that.
As discussed in the Consortium meeting on 22 September 2022, consensus was to limit the indices argument to one-dimensional arrays.
For now, the API will not support multi-dimensional arrays for the indices kwarg. In the interim, array API consumers should be able to use reshape to achieve something similar. If, based on feedback and usage, multi-dimensional indices support is desired, we can revisit the specification and update accordingly.
This PR has been updated to reflect current consensus.
As discussed in the 3 November 2022 meeting, no objections were raised. This PR should be ready for merge.
@rgommers would you mind taking a last look and then approving?
I went through the whole discussion here again, and it looks like we're all happy and there aren't any remaining blockers. This wasn't an easy one! In it goes, thanks all!