dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[BugFIx] Support different dtype for `indptr`, `indices` and `data` in `COOToCSR`

Open Skeleton003 opened this issue 1 year ago β€’ 4 comments

πŸ”¨Work Item

IMPORTANT:

  • This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
  • DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

Background πŸ—ΊοΈ

We now have 4 cpp functions responsible for converting COO to CSR, they are SortedCOOToCSR, UnSortedSmallCOOToCSR, UnSortedSparseCOOToCSR and UnSortedDenseCOOToCSR. The selection of the appropriate COOToCSR function among them is based on a heuristic approach (https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L749).

Bug πŸ›

Currently, all 4 COOToCSR functions are defined with the template template <class IdType>. Let us take UnSortedSparseCOOToCSR as an example. https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L413-L414

In the task of converting COO to CSR, the data type IdType is designated in https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/array.cc#L809 , indicating that IdType is actually equal to the data type of coo.row.

And then, in the current implementation of these COOToCSR functions, the constructed ret_indptr, ret_indices and ret_data are all set to be of dtype IdType. https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L426-L433

This is definitely not right because ret_indptr, ret_indices and ret_data do not necessarily have the same data type. Let's break them down in detail:

  1. ret_indices: Its dtype is the same as coo.row.dtype(IdType). This is the only correct part of the current implementation.
  2. ret_indptr: Its dtype depends on the number of non zero elements(NNZ). If IdType is int32 but NNZ exceeds INT32_MAX, then ret_indptr.dtype should be int64, not the same as ret_indices.
  3. ret_data: Its dtype should be exactly the same as coo.data. It could be int, float or even bool, not guaranteed to be the same as ret_indices.

Question ❓

  • Since the data types of ret_indptr, ret_indices and ret_data may be completely different, why do we set them all as IdType?
  • Furthermore, why do we need IdType? It's only applicable to ret_indices. The dtype of ret_indptr should be determined dynamically; and ret_data are not even guaranteed to have ID-like dtype.

Working Plan 🧭

  1. Remove template <class IdType>, making COOToCSR a non-template function.
  2. Dynamically determine the dtypes of ret_indptr, ret_indices and ret_data as follow.
    1. ret_indices.dtype <- coo.row.dtype,
    1. ret_indptr.dtype <- whether NNZ exceeds INT32_MAX (however, if coo.row is of int64, we set ret_indptr as int64 anyway),
    1. ret_data.dtype <- coo.data.dtype (if coo.data is null, set dtype the same as ret_indptr).

Reference πŸ”—

  1. #699 : This 5-year-old PR implemented the first COOToCSR function with template <DLDeviceType XPU, typename IdType, typename DType> .
  2. #1251 : This 4-year-old PR removed typename DType but kept typename IdType.
  3. #3326 : This 3-year-old PR extended COOToCSR to Sorted, UnSortedDense and UnSortedSparse versions, but still kept typename IdType.

None of these venerable PRs explained why we need IdType. πŸ€”

Acknowledgement πŸ‘

#7364 for reporting this bug.

Skeleton003 avatar Jun 12 '24 15:06 Skeleton003

@frozenbugs @Rhett-Ying @peizhou001 @mfbalin @BarclayII I'd be grateful for your opinions and suggestions on this issue. Because this involves modifying the code written many years ago, I'm afraid I may not have thought it through enough.

Skeleton003 avatar Jun 12 '24 16:06 Skeleton003

I think it's a historical issue that only IdType is used only that does not take more scenarios into consideration.

Dynamically determining the type of indptr, indices, data is the correct way. please go ahead with the changes required.

Before the change, is it possible to add a check for data type to throw exception in the scenario we hit in the issue?

Rhett-Ying avatar Jun 13 '24 00:06 Rhett-Ying

I think it's a historical issue that only IdType is used only that does not take more scenarios into consideration.

Dynamically determining the type of indptr, indices, data is the correct way. please go ahead with the changes required.

Before the change, is it possible to add a check for data type to throw exception in the scenario we hit in the issue?

Yes, please help review #7459 .

Skeleton003 avatar Jun 13 '24 08:06 Skeleton003

After further investigation, I've discovered a possible reason why we need IdType. It's because the approaches of cuda implementation of COOToCSR are different for int32 and int64. See https://github.com/dmlc/dgl/blob/ed50c170dda9627730cb8ee4c7110205b6ea09de/src/array/cuda/coo2csr.cu#L25 and https://github.com/dmlc/dgl/blob/ed50c170dda9627730cb8ee4c7110205b6ea09de/src/array/cuda/coo2csr.cu#L100 .

If there is no way to merge these 2 approaches into an Integral one, we have to keep COOToCSR a template function and keep IdType parameter. But we still need to Dynamically determine the dtypes of ret_indptr, ret_indices and ret_data.

Skeleton003 avatar Jun 17 '24 03:06 Skeleton003