array-api Add `take` specification for returning elements of an array along a specified axis

This PR

adds a specification for take, an API for returning elements of an array along a specified axis. This API was initially proposed in https://github.com/data-apis/array-api/issues/177.
supports only a single keyword argument: axis. This is derived from comparison data.
requires that the axis keyword argument always be provided, as suggested in comment.
does not currently support providing a multi-dimensional array of indices. Array libraries supporting multi-dimensional arrays for indices treat the input array as a flattened 1-dimensional array. If we were to follow similarly, we'd need to specify the flattened layout.
does not currently support lists/tuples/sequences for the indices argument. NumPy et al allow array-like. Torch requires LongTensor. Was not immediately clear to me how accepting we should be for non-arrays, so went with the most conservative choice.
for an integer indices argument, requires that behavior match multi-axis indexing semantics, as set forth in the specification indexing guidance.

Questions

Should the API be extended to support multi-dimensional indices?
Should the axis argument be required?
Should the indices argument be more liberal in the types of allowed arguments?

Apr 07 '22 09:04 kgryte

Should this PR be blocked until an initial spec release?

Apr 07 '22 11:04 honno

Regarding axis and flattening, it's worth noting np.put doesn't have axis, so requiring it here would either break the symmetry with a potential put function, or require put to be modified from NumPy (see https://github.com/data-apis/array-api/issues/177#issuecomment-1078482747).

Apr 07 '22 18:04 asmeurer

Two thoughts:

We could make axis optional only in the case of 1D arrays
I would not necessarily be limited by constraints on np.put, which in my opinion is much more niche functionality than take. I don't even know if we want put in the API standard.

Apr 07 '22 18:04 shoyer

@rgommers Before moving forward with this PR, we need to ensure buy-in from PyTorch. The current proposal deviates from supported PyTorch behavior in the following ways:

PyTorch does not currently support an axis kwarg.
PyTorch always flattens the input array.

We'd need to ensure that PyTorch is (1) okay with adding an axis kwarg to support taking along a specified axis and (2) making an axis kwarg mandatory, thus breaking current behavior (i.e., based on the current proposal, no way to index into a flattened array using take which is the current PyTorch behavior).

For NumPy-inspired libraries, the axis kwarg is supported and, provided the spec-compliant API is in a sub-namespace, xp.take(x, indices, axis=-1) can simply wrap a call to np.take. Dask already effectively requires an axis kwarg, as the default axis value is 0.

Accordingly, PyTorch is the odd one out here, atm, so would be good to get a feel for potential specification constraints in order to ensure broad API compatibility, while also addressing fancy indexing concerns (indexing arrays via arrays) and the proposal to require an axis.

Apr 18 '22 08:04 kgryte

For NumPy-inspired libraries, the axis kwarg is supported and, provided the spec-compliant API is in a sub-namespace, xp.take(x, indices, axis=-1) can simply wrap a call to np.take. Dask already effectively requires an axis kwarg, as the default axis value is 0.

This requires a backwards-incompatible change in NumPy as well, when we want to move to support in the main namespace. It'd be a good change to make though. Dask not following the NumPy default is fairly exceptional for Dask, and was probably done because the flattening behavior is indeed not useful (EDIT: it was added in https://github.com/dask/dask/pull/26, no discussion there on axis default).

I just had a look at SciPy, and it has quite a few np.take usages but none with the default axis=None. Scikit-learn does have some, I need to look in a little more detail to verify whether those are all for 1-D arrays or not.

Jul 06 '22 20:07 rgommers

I looked at all usages in scikit-learn and most inputs to np.take are 1-D, in which case the axis argument is not used/needed, and for the one case with 2-D array input the code uses axis=0:

# `processed` is 1-D
sklearn/cluster/_optics.py:    unproc = np.compress(~np.take(processed, indices), indices)
--
# `X` is 2-D, that's why this uses `axis=0`
sklearn/cluster/_optics.py:        dists = pairwise_distances(P, np.take(X, unproc, axis=0),
--
# `reachability_` is 1-D
sklearn/cluster/_optics.py:    improved = np.where(rdists < np.take(reachability_, unproc))
--
# `compute_class_weight` returns a 1-D array
sklearn/utils/class_weight.py:            weight_k = np.take(compute_class_weight(class_weight_k,
sklearn/utils/class_weight.py-                                                    classes=classes_subsample,
sklearn/utils/class_weight.py-                                                    y=y_subsample),
sklearn/utils/class_weight.py-                               np.searchsorted(classes_subsample,
sklearn/utils/class_weight.py-                                               classes_full),
sklearn/utils/class_weight.py-                               mode='clip')
--
# `labels_multiclass` and `labels_binary` are lists of strings (so 1-D)
sklearn/utils/estimator_checks.py:    y_names_multiclass = np.take(labels_multiclass, y_multiclass)
sklearn/utils/estimator_checks.py:    y_names_binary = np.take(labels_binary, y_binary)
sklearn/utils/estimator_checks.py:    y_names_binary = np.take(labels_binary, y_binary)
--
# `nominal_attributes` is a dict, so first argument to `take` is a 1-D object array
sklearn/datasets/_openml.py:                    np.take(
sklearn/datasets/_openml.py-                        np.asarray(nominal_attributes.pop(col_name),
sklearn/datasets/_openml.py-                                   dtype='O'),
sklearn/datasets/_openml.py-                        y[:, i:i + 1].astype(int, copy=False))
--
# `perm` is 1-D
sklearn/model_selection/tests/test_split.py:        y_perm = np.take(perm, y)
--
# `perm` is 1-D
sklearn/metrics/tests/test_common.py:        y_true_perm = np.take(perm, y_true)
--
# `labels_multiclass` is a list of strings (so 1-D)
sklearn/semi_supervised/tests/test_self_training.py:    y_strings = np.take(labels_multiclass, y)

Jul 07 '22 13:07 rgommers

does not currently support lists/tuples/sequences for the indices argument. NumPy et al allow array-like. Torch requires LongTensor. Was not immediately clear to me how accepting we should be for non-arrays, so went with the most conservative choice.

As a rule we don't allow array-like inputs in the array API standard, because (a) it's ill-defined, and (b) it turns out not to be the best idea to allow it anyway. So it's fine to be consistent with that here.

We could make axis optional only in the case of 1D arrays

This makes sense to me. And seems useful, based on my analysis of how scikit-learn uses np.take.

The current proposal deviates from supported PyTorch behavior in the following ways:
1. PyTorch does not currently support an `axis` kwarg.

2. PyTorch always flattens the input array.
We'd need to ensure that PyTorch is (1) okay with adding an axis kwarg to support taking along a specified axis and (2) making an axis kwarg mandatory, thus breaking current behavior (i.e., based on the current proposal, no way to index into a flattened array using take which is the current PyTorch behavior).

I don't think there's an issue with progressing this PR:

torch.take with two arguments is actually compatible with numpy.take with two arguments,
adding an axis keyword is not controversial for PyTorch, and would be fully backwards compatible if it were to default to None,
the introduction path would add axis=None (or better, dim=None) first, and then have the appropriate warnings raised before switching the default to 0,
the bc-breaking change would be to add axis=0; if as discussed above the axis keyword is mandatory for >=2-D array input, then there's still no risk of silent breakage, just some code churn. And this would be the same for PyTorch as for NumPy,
I'll note that this will be low-prio for PyTorch to work on though, because there's already take_along_dim that does what we propose here (again similar to np.take_along_axis), so it's just some churn but doesn't add functionality.

Jul 07 '22 13:07 rgommers

We could make axis optional only in the case of 1D arrays

Seems OK, since it is compatible with None as a default. But, is there much reason for it as in that case arr[index] is practically identical (unless you need to ensure a copy and index is a scalar)?

(Plausibly, in a few cases sklearn uses it because many years ago take was faster and either the myth or the code stuck around. One of the example uses mode="clip" though.)

I am not immediately sure whether 0 or -1 would make the better default. None is useful for reduce-likes, but take is maybe not quite reduce-like enough (but probably inherits it from there).

take_along_dim that does what we propose here (again similar to np.take_along_axis), so it's just some churn but doesn't add functionality.

take_along_axis is a very different functionality, it is only identical for 1-D inputs. take_along_axis specificalaly takes a arr.ndim dimensional index array as input.

Jul 07 '22 16:07 seberg

Seems OK, since it is compatible with None as a default. But, is there much reason for it as in that case arr[index] is practically identical (unless you need to ensure a copy and index is a scalar)?

Except integer array indexing is not in the standard (yet at least), which is why there was so much interest in adding take.

take_along_axis is a very different functionality,

Good point, I overlooked that.

Jul 07 '22 17:07 rgommers

As discussed in the Consortium meeting on 22 September 2022, consensus was to limit the indices argument to one-dimensional arrays.

For now, the API will not support multi-dimensional arrays for the indices kwarg. In the interim, array API consumers should be able to use reshape to achieve something similar. If, based on feedback and usage, multi-dimensional indices support is desired, we can revisit the specification and update accordingly.

This PR has been updated to reflect current consensus.

Nov 03 '22 05:11 kgryte

As discussed in the 3 November 2022 meeting, no objections were raised. This PR should be ready for merge.

@rgommers would you mind taking a last look and then approving?

Nov 03 '22 17:11 kgryte

I went through the whole discussion here again, and it looks like we're all happy and there aren't any remaining blockers. This wasn't an easy one! In it goes, thanks all!

Nov 16 '22 20:11 rgommers