lance
lance copied to clipboard
lance.torch.data.LanceDataset: torch device produced depends on data type
Specifically, it seems vectors respect the device type, while other types do not. I don't believe this is intended.
Repro (requires cuda):
import torch
import lance.torch.data
import pyarrow as pa
import numpy as np
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"default device: {device}")
torch.set_default_device(device)
table = pa.table({
"id": pa.array([1, 2, 3]),
"embedding": pa.FixedShapeTensorArray.from_numpy_ndarray(
np.random.rand(3, 128).astype("float32"))
})
lance.write_dataset(table, "./temp.lance", mode="overwrite")
ds_path = "./temp.lance"
ds = lance.dataset(ds_path)
torch_dataset = lance.torch.data.LanceDataset(
ds,
columns=["embedding", "id"],
batch_size=1024,
batch_readahead=8,
with_row_id=True,
)
torch.set_default_device(device)
dataloader = torch.utils.data.DataLoader(torch_dataset)
for batch in dataloader:
embedding = batch["embedding"][0]
ids = batch["id"][0]
rowids = batch["_rowid"][0]
print(f"embedding device={embedding.device}")
print(f"ids device={ids.device}")
print(f"rowids device={rowids.device}")
break
The above prints the following:
default device: cuda
embedding device=cuda:0
ids device=cpu
rowids device=cpu
Non-priority issue obviously, since .to(device) works just fine as a workaround.