TileDB-Py icon indicating copy to clipboard operation
TileDB-Py copied to clipboard

Reading sparse tiledb array abruptly exits

Open weidinger-c opened this issue 1 year ago • 17 comments

I have a large sparse array with about 400 million entries (3D point entries with x,y,z as dimensions and some point information like intensity, etc. as attributes). Unfortunatley when I want to read the pointcloud with point_cloud = db.df[:] the python instance just exits without an error. My first thought was not enough memory, but the I have 64GB of memory and during processing it raised to only about 25GB, so I can rule out "out of memory"

I have new insights to this issue: the tiledb query only exits, if I have the following settings for creating the array:

  • tiledb.default_ctx({"sm.dedup_coords": "true"})
  • allows_duplicates=False,
tiledb.default_ctx({"sm.dedup_coords": "true"})

dom = tiledb.Domain(
    tiledb.Dim(
        name="X",
        domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
        dtype=np.float64,
        filters=filter_list,
    ),
    tiledb.Dim(
        name="Y",
        domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
        dtype=np.float64,
        filters=filter_list,
    ),
    tiledb.Dim(
        name="Z",
        domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
        dtype=np.float64,
        filters=filter_list,
    ),
)
# Create schema
schema = tiledb.ArraySchema(
    domain=dom,
    attrs=[
        tiledb.Attr(name="Intensity", dtype=np.uint16),
        tiledb.Attr(name="Label", dtype=np.uint16),
        tiledb.Attr(name="PointId", dtype=np.uint64),
    ],
    cell_order="hilbert",
    tile_order=None,
    capacity=100000,
    sparse=True,
    allows_duplicates=False,
)
# Create the TileDB array
tiledb.SparseArray.create(tiledb_array_name, schema)

weidinger-c avatar Apr 04 '24 08:04 weidinger-c

Hi @weidinger-c, are you able to share any reproduction code? That would help us pinpoint the issue quickly. In any case, we'll try to reproduce with your schema using random data.

ihnorton avatar Apr 04 '24 12:04 ihnorton

Hi @ihnorton, I'll try to upload an example script with some random test file.

weidinger-c avatar Apr 04 '24 12:04 weidinger-c

Hi @ihnorton, here is an example script.

import numpy as np
import laspy
import tiledb
import pandas as pd
import os

if __name__ == "__main__":

    # Name of the array to create.
    tiledb_array_name = "writing_sparse_multiple"

    # Set TileDB default context 
    tiledb.default_ctx({"sm.dedup_coords": "true"})

    # Create tiledb filters like in PDAL
    filter_list = tiledb.FilterList(
        [
            tiledb.FloatScaleFilter(factor=0.0001, offset=0, bytewidth=8),
            # tiledb.DeltaFilter(),
            # tiledb.BitShuffleFilter(),
            # tiledb.ZstdFilter(level=7),
        ]
    )
    dom = tiledb.Domain(
        tiledb.Dim(
            name="X",
            domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
            dtype=np.float64,
            filters=filter_list,
        ),
        tiledb.Dim(
            name="Y",
            domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
            dtype=np.float64,
            filters=filter_list,
        ),
        tiledb.Dim(
            name="Z",
            domain=(np.finfo(np.float64).min, np.finfo(np.float64).max),
            dtype=np.float64,
            filters=filter_list,
        ),
    )
    # Create schema
    schema = tiledb.ArraySchema(
        domain=dom,
        attrs=[
            tiledb.Attr(name="Intensity", dtype=np.uint16),
            tiledb.Attr(name="PointId", dtype=np.uint64),
        ],
        cell_order="hilbert",
        tile_order=None,
        capacity=100000,
        sparse=True,
        allows_duplicates=False,
    )

    # Delete the array if it exists
    if tiledb.object_type(tiledb_array_name) == "array":
        tiledb.remove(tiledb_array_name)

    print("tiledb.default_ctx().sm.dedup_coords")
    print(tiledb.default_ctx().config().get("sm.dedup_coords"))

    # Create the TileDB array
    tiledb.SparseArray.create(tiledb_array_name, schema)

    # Create some random las files
    num_files = 10
    num_points = 400000
    point_id = 0
    for i in range(num_files):
        # Create a las file
        las_file = laspy.create(file_version="1.4", point_format=1)
        las_file.header.offset = [50000.0, 0.0, 0.0]
        las_file.header.scale = [0.0001, 0.0001, 0.0001]
        las_file.add_extra_dim(laspy.ExtraBytesParams(name="PointId", type=np.uint64))
        x = np.random.uniform(45000.0, 55000.0, num_points)
        las_file.x = x
        del x
        y = np.random.uniform(-10000.0, 10000.0, num_points)
        las_file.y = y
        del y
        z = np.random.uniform(-10000.0, 10000.0, num_points)
        las_file.z = z
        del z
        las_file.intensity = np.random.randint(0, 65535, num_points)
        las_file.PointId = np.arange(point_id, point_id + num_points)
        point_id += num_points
        las_file.write(f"/media/weidingerc/T7/random/random_points_{i}.las")

    # Iterate over the las files
    las_files_dirpath = "/media/weidingerc/T7/random/"
    
    # Get the las files
    las_files = [las_files_dirpath + f for f in os.listdir(las_files_dirpath) if f.endswith(".las")]

    # Iterate over the las files
    for las_file in las_files:

        print(f"Processing {las_file}")

        # Read the las file
        las_file = laspy.read(las_file)

        # Get the las attributes
        x = las_file["x"]
        y = las_file["y"]
        z = las_file["z"]
        intensity = las_file["intensity"]
        point_id = las_file["PointId"]

        # Open the TileDB array
        A_write = tiledb.open(tiledb_array_name, "w")
        # Write the data to the array
        A_write[x, y, z] = {
            "Intensity": intensity,
            "PointId": point_id
        }
        A_write.close()

    # Read the data from the array
    A_read = tiledb.open(tiledb_array_name, "r")
    print(A_read.schema)
    pd.set_option("display.float_format", lambda x: "%.4f" % x)

    query = A_read.df[:]
    print(query)

    # Get only some unique ids
    query = A_read.query(cond=f"attr('PointId') >= 0 and attr('PointId') < 1000000")
    print(query.df[:])

weidinger-c avatar Apr 04 '24 13:04 weidinger-c

@ihnorton ok, I have made some further tests. It depends on the input data, I now have a set of files where I can reproduce the behaviour. I uploaded them here with WeTransfer https://we.tl/t-sxopXjKA5p

weidinger-c avatar Apr 05 '24 05:04 weidinger-c

Thank you, we are investigating.


https://app.shortcut.com/tiledb-inc/story/44758

ihnorton avatar Apr 05 '24 11:04 ihnorton

@ihnorton Are there any news on this issue?

weidinger-c avatar Apr 24 '24 08:04 weidinger-c

@weidinger-c Sorry for the delay. I believe the following PR will solve your issue: https://github.com/TileDB-Inc/TileDB/pull/4924.

Also, please note that there is something that could be fixed in your schema. For Hilbert, we order the data using a single 64 bits number, and to compute this number, we divide the 64 bits equally amongst the dimensions, then divide the domain of each dimensions into equal buckets. For your case, each dimensions will get 21 bits, so 2097152 buckets. Since domain is going from float min to float max, your buckets will be very large. In the array you shared, every cells will fall in the same bucket, negating the advantage of using Hilbert. I would recommend using tighter bounds on your domain if you already know limits for it.

KiterLuc avatar Apr 30 '24 16:04 KiterLuc

@KiterLuc Thanks I will try that out. I only used "Hilbert" and the min/max bounds, because it is done the same way in the PDAL TileDB Writer implementation and I thought it should be done like that for 3D point data. Is there some documentation, when to use which cell_order?

weidinger-c avatar May 01 '24 05:05 weidinger-c

@KiterLuc I now tried out the changes of your PR and suggestion:

  • Setting the schema to "row-major" works even with domain at min/max values of float.
  • Keeping the domain at min/max values of float and using Hilbert still leads to the unwanted exit (shouldn't your PR have fixed that by falling back to "row-major"?)
  • Setting the domain to smaller values (e.g. the pointcloud min/max values) and using Hilbert works

weidinger-c avatar May 02 '24 10:05 weidinger-c

Thanks @weidinger-c, we have very little documentation on Hilbert. I filed a ticket to get that fixed soon.

On your question in your second message, how did you rebuilt tiledb? It should be fixed.

KiterLuc avatar May 02 '24 13:05 KiterLuc

@KiterLuc I rebuilt my docker image with the tiledb dev branch (which includes your PR)

The shell script which builds tiledb just looks like this (so nothing special):

# Install tiledb corelib
cd ${tiledb_dirpath}
git clone --branch dev https://github.com/TileDB-Inc/TileDB.git
cd TileDB
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr/local/lib/tiledb ..
make -j 4
make install-tiledb

# Install tileDB python API
pip install -U tiledb && python3 -c "import tiledb"

weidinger-c avatar May 02 '24 13:05 weidinger-c

I wonder if that doesn't get picked up by the python package you have installed?

KiterLuc avatar May 02 '24 13:05 KiterLuc

I installed the python api via the line I added just now above. But I guess this should be the latest one? Or what do you mean with "picked up"?

weidinger-c avatar May 02 '24 13:05 weidinger-c

@ihnorton that's not going to work right?

KiterLuc avatar May 02 '24 13:05 KiterLuc

@weidinger-c Can try installing TileDB-Py from source via eg python3 -m pip install -v -e . ? On my machine that picks up /usr/local/lib/libtiledb.so (and not that I did not impose another directory, and ldconfig finds it there. You still may need sudo ldconfig after the (C++ library) make install-tiledb step.

eddelbuettel avatar May 02 '24 13:05 eddelbuettel

@eddelbuettel I can try, but I am not sure what should be the aim of this. I have rebuilt a docker image, so there cannot be any remains from previous installations and I can import and use tiledb in python. So I guess there should be nothing wrong with the installation?

weidinger-c avatar May 02 '24 13:05 weidinger-c

@weidinger-c the simplest way to test the latest changes is probably to install a nightly from our conda channel: conda create -p test1 -c tiledb/label/nightlies tiledb-py -- this will give you a build of libtiledb and tiledb-py from last night, with the changes @KiterLuc referenced.

ihnorton avatar May 03 '24 02:05 ihnorton

@weidinger-c this fix is available in the latest release on PyPI (and conda-forge). I'll close for now, please let us know if not resolved and we'll reopen to investigate.

ihnorton avatar Jun 15 '24 19:06 ihnorton