parquet-testing icon indicating copy to clipboard operation
parquet-testing copied to clipboard

Example files for GEOMETRY and GEOGRAPHY logical type

Open paleolimbot opened this issue 11 months ago • 3 comments

As discussed on the mailing list, it's best to get example files early!

Code to generate in details (requires https://github.com/apache/arrow/compare/main...paleolimbot:arrow:parquet-geo-write-files-from-geoarrow , which is a slightly more functional but less appropriate initial version of https://github.com/apache/arrow/pull/45459 ). I've also added the full suite of geoarrow-data files (even the big ones) to that forthcoming release: https://github.com/geoarrow/geoarrow-data .

import urllib.request
import json

import pyarrow as pa
from pyarrow import parquet
import geoarrow.pyarrow as ga

manifest_url = (
    "https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0-rc4/manifest.json"
)
files = {}
with urllib.request.urlopen(manifest_url) as f:
    manifest = json.load(f)
    for group in manifest["groups"]:
        for file in group["files"]:
            if file["format"] == "arrows/wkb":
                files[group["name"] + "_" + file["name"]] = file["url"]

out_dir = "/Users/dewey/gh/parquet-testing/data/geospatial"
ones_that_didnt_work = []
for name, url in files.items():
    # Skip big files + one CRS example that includes a non-PROJJSON value
    # on purpose (allowed in GeoArrow), which is rightly rejected
    # by Parquet
    if (
        "microsoft-buildings" in name
        or ("ns-water" in name and name != "ns-water_water-point")
        or "wkt2" in name
    ):
        print(f"Skipping {name}")
        continue

    # Maintain chunking from IPC into Parquet
    out = f"{out_dir}/{name}.parquet"
    with (
        urllib.request.urlopen(url) as f,
        pa.ipc.open_stream(f) as reader,
        parquet.ParquetWriter(
            out,
            reader.schema,
            store_schema=False,
            compression="none",
            write_geospatial_logical_types=True,
        ) as writer,
    ):
        original_schema = reader.schema
        print(f"Reading {url}")
        for batch in reader:
            writer.write_batch(batch)
        print(f"Wrote {out}")
    
    # Read in original table for comparison
    with (
        urllib.request.urlopen(url) as f,
        pa.ipc.open_stream(f) as reader
    ):
        original_table = reader.read_all()

    print(f"Checking {out}")
    with parquet.ParquetFile(out, arrow_extensions_enabled=True) as f:
        if f.schema_arrow != original_table.schema:
            print(f"Schema mismatch:\n{f.schema_arrow}\nvs\n{original_schema}")
            continue

        reread = f.read()
        if reread != original_table:
            print("Table mismatch")

paleolimbot avatar Feb 07 '25 22:02 paleolimbot

@Kontinuation @zhangfengcdt Can you give these a try from Java when you're ready? I'm fairly confident that they are correct, including the "crs" examples that dump the actual payload of the PROJJSON to the file metadata.

paleolimbot avatar Feb 21 '25 10:02 paleolimbot

I pushed an update to three files here - the original fields that PROJJSON crses were written to were very likely to collide with eachother if you did things like read a Parquet file, filter it, then write it again 😬 . The new files add a hash of the value to the end of the key (e.g., projjson_crs_value_0ffad8372). Totally up for discussion whether that's a good idea or not 🙂 .

paleolimbot avatar Feb 27 '25 22:02 paleolimbot

I updated these to be a bit more intentional about the corner cases we collectively ran into in https://github.com/apache/parquet-java/pull/2971 and https://github.com/apache/arrow/pull/45459. I'm not sure the Python files to generate them belong in this repo but it does make it easier to see what they contain. I also included CRS examples because that was also something that required some thinking about in the C++ PR...happy to remove or tweak any of these if I didn't get the spirit of the format change right 🙂 .

paleolimbot avatar Apr 04 '25 05:04 paleolimbot

Today at the Parquet sync @emkornfield said he might have some time to review this PR

alamb avatar Apr 30 '25 17:04 alamb

This all seems reasonable, going to merge.

emkornfield avatar Apr 30 '25 18:04 emkornfield

Thank you @emkornfield and @paleolimbot 🙏

alamb avatar Apr 30 '25 19:04 alamb

Thank you both!

paleolimbot avatar Apr 30 '25 19:04 paleolimbot