lance icon indicating copy to clipboard operation
lance copied to clipboard

lance.torch.data.LanceDataset: torch device produced depends on data type

Open jacketsj opened this issue 1 year ago • 0 comments

Specifically, it seems vectors respect the device type, while other types do not. I don't believe this is intended.

Repro (requires cuda):

import torch
import lance.torch.data
import pyarrow as pa
import numpy as np

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"default device: {device}")
torch.set_default_device(device)

table = pa.table({
   "id": pa.array([1, 2, 3]),
   "embedding": pa.FixedShapeTensorArray.from_numpy_ndarray(
       np.random.rand(3, 128).astype("float32"))
})
lance.write_dataset(table, "./temp.lance", mode="overwrite")
ds_path = "./temp.lance"
ds = lance.dataset(ds_path)
torch_dataset = lance.torch.data.LanceDataset(
    ds,
    columns=["embedding", "id"],
    batch_size=1024,
    batch_readahead=8,
    with_row_id=True,
)
torch.set_default_device(device)
dataloader = torch.utils.data.DataLoader(torch_dataset)
for batch in dataloader:
    embedding = batch["embedding"][0]
    ids = batch["id"][0]
    rowids = batch["_rowid"][0]
    print(f"embedding device={embedding.device}")
    print(f"ids device={ids.device}")
    print(f"rowids device={rowids.device}")
    break

The above prints the following:

default device: cuda
embedding device=cuda:0
ids device=cpu
rowids device=cpu

Non-priority issue obviously, since .to(device) works just fine as a workaround.

jacketsj avatar Aug 28 '24 00:08 jacketsj