Reading sparse tiledb array abruptly exits
I have a large sparse array with about 400 million entries (3D point entries with x,y,z as dimensions and some point information like intensity, etc. as attributes). Unfortunatley when I want to read the pointcloud with point_cloud = db.df[:] the python instance just exits without an error.
My first thought was not enough memory, but the I have 64GB of memory and during processing it raised to only about 25GB, so I can rule out "out of memory"
I have new insights to this issue: the tiledb query only exits, if I have the following settings for creating the array:
- tiledb.default_ctx({"sm.dedup_coords": "true"})
- allows_duplicates=False,
tiledb.default_ctx({"sm.dedup_coords": "true"})
dom = tiledb.Domain(
tiledb.Dim(
name="X",
domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
dtype=np.float64,
filters=filter_list,
),
tiledb.Dim(
name="Y",
domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
dtype=np.float64,
filters=filter_list,
),
tiledb.Dim(
name="Z",
domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
dtype=np.float64,
filters=filter_list,
),
)
# Create schema
schema = tiledb.ArraySchema(
domain=dom,
attrs=[
tiledb.Attr(name="Intensity", dtype=np.uint16),
tiledb.Attr(name="Label", dtype=np.uint16),
tiledb.Attr(name="PointId", dtype=np.uint64),
],
cell_order="hilbert",
tile_order=None,
capacity=100000,
sparse=True,
allows_duplicates=False,
)
# Create the TileDB array
tiledb.SparseArray.create(tiledb_array_name, schema)
Hi @weidinger-c, are you able to share any reproduction code? That would help us pinpoint the issue quickly. In any case, we'll try to reproduce with your schema using random data.
Hi @ihnorton, I'll try to upload an example script with some random test file.
Hi @ihnorton, here is an example script.
import numpy as np
import laspy
import tiledb
import pandas as pd
import os
if __name__ == "__main__":
# Name of the array to create.
tiledb_array_name = "writing_sparse_multiple"
# Set TileDB default context
tiledb.default_ctx({"sm.dedup_coords": "true"})
# Create tiledb filters like in PDAL
filter_list = tiledb.FilterList(
[
tiledb.FloatScaleFilter(factor=0.0001, offset=0, bytewidth=8),
# tiledb.DeltaFilter(),
# tiledb.BitShuffleFilter(),
# tiledb.ZstdFilter(level=7),
]
)
dom = tiledb.Domain(
tiledb.Dim(
name="X",
domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
dtype=np.float64,
filters=filter_list,
),
tiledb.Dim(
name="Y",
domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
dtype=np.float64,
filters=filter_list,
),
tiledb.Dim(
name="Z",
domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
dtype=np.float64,
filters=filter_list,
),
)
# Create schema
schema = tiledb.ArraySchema(
domain=dom,
attrs=[
tiledb.Attr(name="Intensity", dtype=np.uint16),
tiledb.Attr(name="PointId", dtype=np.uint64),
],
cell_order="hilbert",
tile_order=None,
capacity=100000,
sparse=True,
allows_duplicates=False,
)
# Delete the array if it exists
if tiledb.object_type(tiledb_array_name) == "array":
tiledb.remove(tiledb_array_name)
print("tiledb.default_ctx().sm.dedup_coords")
print(tiledb.default_ctx().config().get("sm.dedup_coords"))
# Create the TileDB array
tiledb.SparseArray.create(tiledb_array_name, schema)
# Create some random las files
num_files = 10
num_points = 400000
point_id = 0
for i in range(num_files):
# Create a las file
las_file = laspy.create(file_version="1.4", point_format=1)
las_file.header.offset = [50000.0, 0.0, 0.0]
las_file.header.scale = [0.0001, 0.0001, 0.0001]
las_file.add_extra_dim(laspy.ExtraBytesParams(name="PointId", type=np.uint64))
x = np.random.uniform(45000.0, 55000.0, num_points)
las_file.x = x
del x
y = np.random.uniform(-10000.0, 10000.0, num_points)
las_file.y = y
del y
z = np.random.uniform(-10000.0, 10000.0, num_points)
las_file.z = z
del z
las_file.intensity = np.random.randint(0, 65535, num_points)
las_file.PointId = np.arange(point_id, point_id + num_points)
point_id += num_points
las_file.write(f"/media/weidingerc/T7/random/random_points_{i}.las")
# Iterate over the las files
las_files_dirpath = "/media/weidingerc/T7/random/"
# Get the las files
las_files = [las_files_dirpath + f for f in os.listdir(las_files_dirpath) if f.endswith(".las")]
# Iterate over the las files
for las_file in las_files:
print(f"Processing {las_file}")
# Read the las file
las_file = laspy.read(las_file)
# Get the las attributes
x = las_file["x"]
y = las_file["y"]
z = las_file["z"]
intensity = las_file["intensity"]
point_id = las_file["PointId"]
# Open the TileDB array
A_write = tiledb.open(tiledb_array_name, "w")
# Write the data to the array
A_write[x, y, z] = {
"Intensity": intensity,
"PointId": point_id
}
A_write.close()
# Read the data from the array
A_read = tiledb.open(tiledb_array_name, "r")
print(A_read.schema)
pd.set_option("display.float_format", lambda x: "%.4f" % x)
query = A_read.df[:]
print(query)
# Get only some unique ids
query = A_read.query(cond=f"attr('PointId') >= 0 and attr('PointId') < 1000000")
print(query.df[:])
@ihnorton ok, I have made some further tests. It depends on the input data, I now have a set of files where I can reproduce the behaviour. I uploaded them here with WeTransfer https://we.tl/t-sxopXjKA5p
Thank you, we are investigating.
https://app.shortcut.com/tiledb-inc/story/44758
@ihnorton Are there any news on this issue?
@weidinger-c Sorry for the delay. I believe the following PR will solve your issue: https://github.com/TileDB-Inc/TileDB/pull/4924.
Also, please note that there is something that could be fixed in your schema. For Hilbert, we order the data using a single 64 bits number, and to compute this number, we divide the 64 bits equally amongst the dimensions, then divide the domain of each dimensions into equal buckets. For your case, each dimensions will get 21 bits, so 2097152 buckets. Since domain is going from float min to float max, your buckets will be very large. In the array you shared, every cells will fall in the same bucket, negating the advantage of using Hilbert. I would recommend using tighter bounds on your domain if you already know limits for it.
@KiterLuc Thanks I will try that out. I only used "Hilbert" and the min/max bounds, because it is done the same way in the PDAL TileDB Writer implementation and I thought it should be done like that for 3D point data. Is there some documentation, when to use which cell_order?
@KiterLuc I now tried out the changes of your PR and suggestion:
- Setting the schema to "row-major" works even with domain at min/max values of float.
- Keeping the domain at min/max values of float and using Hilbert still leads to the unwanted exit (shouldn't your PR have fixed that by falling back to "row-major"?)
- Setting the domain to smaller values (e.g. the pointcloud min/max values) and using Hilbert works
Thanks @weidinger-c, we have very little documentation on Hilbert. I filed a ticket to get that fixed soon.
On your question in your second message, how did you rebuilt tiledb? It should be fixed.
@KiterLuc I rebuilt my docker image with the tiledb dev branch (which includes your PR)
The shell script which builds tiledb just looks like this (so nothing special):
# Install tiledb corelib
cd ${tiledb_dirpath}
git clone --branch dev https://github.com/TileDB-Inc/TileDB.git
cd TileDB
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr/local/lib/tiledb ..
make -j 4
make install-tiledb
# Install tileDB python API
pip install -U tiledb && python3 -c "import tiledb"
I wonder if that doesn't get picked up by the python package you have installed?
I installed the python api via the line I added just now above. But I guess this should be the latest one? Or what do you mean with "picked up"?
@ihnorton that's not going to work right?
@weidinger-c Can try installing TileDB-Py from source via eg python3 -m pip install -v -e . ? On my machine that picks up /usr/local/lib/libtiledb.so (and not that I did not impose another directory, and ldconfig finds it there. You still may need sudo ldconfig after the (C++ library) make install-tiledb step.
@eddelbuettel I can try, but I am not sure what should be the aim of this. I have rebuilt a docker image, so there cannot be any remains from previous installations and I can import and use tiledb in python. So I guess there should be nothing wrong with the installation?
@weidinger-c the simplest way to test the latest changes is probably to install a nightly from our conda channel: conda create -p test1 -c tiledb/label/nightlies tiledb-py -- this will give you a build of libtiledb and tiledb-py from last night, with the changes @KiterLuc referenced.
@weidinger-c this fix is available in the latest release on PyPI (and conda-forge). I'll close for now, please let us know if not resolved and we'll reopen to investigate.