Example files for GEOMETRY and GEOGRAPHY logical type
As discussed on the mailing list, it's best to get example files early!
Code to generate in details (requires https://github.com/apache/arrow/compare/main...paleolimbot:arrow:parquet-geo-write-files-from-geoarrow , which is a slightly more functional but less appropriate initial version of https://github.com/apache/arrow/pull/45459 ). I've also added the full suite of geoarrow-data files (even the big ones) to that forthcoming release: https://github.com/geoarrow/geoarrow-data .
import urllib.request
import json
import pyarrow as pa
from pyarrow import parquet
import geoarrow.pyarrow as ga
manifest_url = (
"https://raw.githubusercontent.com/geoarrow/geoarrow-data/v0.2.0-rc4/manifest.json"
)
files = {}
with urllib.request.urlopen(manifest_url) as f:
manifest = json.load(f)
for group in manifest["groups"]:
for file in group["files"]:
if file["format"] == "arrows/wkb":
files[group["name"] + "_" + file["name"]] = file["url"]
out_dir = "/Users/dewey/gh/parquet-testing/data/geospatial"
ones_that_didnt_work = []
for name, url in files.items():
# Skip big files + one CRS example that includes a non-PROJJSON value
# on purpose (allowed in GeoArrow), which is rightly rejected
# by Parquet
if (
"microsoft-buildings" in name
or ("ns-water" in name and name != "ns-water_water-point")
or "wkt2" in name
):
print(f"Skipping {name}")
continue
# Maintain chunking from IPC into Parquet
out = f"{out_dir}/{name}.parquet"
with (
urllib.request.urlopen(url) as f,
pa.ipc.open_stream(f) as reader,
parquet.ParquetWriter(
out,
reader.schema,
store_schema=False,
compression="none",
write_geospatial_logical_types=True,
) as writer,
):
original_schema = reader.schema
print(f"Reading {url}")
for batch in reader:
writer.write_batch(batch)
print(f"Wrote {out}")
# Read in original table for comparison
with (
urllib.request.urlopen(url) as f,
pa.ipc.open_stream(f) as reader
):
original_table = reader.read_all()
print(f"Checking {out}")
with parquet.ParquetFile(out, arrow_extensions_enabled=True) as f:
if f.schema_arrow != original_table.schema:
print(f"Schema mismatch:\n{f.schema_arrow}\nvs\n{original_schema}")
continue
reread = f.read()
if reread != original_table:
print("Table mismatch")
@Kontinuation @zhangfengcdt Can you give these a try from Java when you're ready? I'm fairly confident that they are correct, including the "crs" examples that dump the actual payload of the PROJJSON to the file metadata.
I pushed an update to three files here - the original fields that PROJJSON crses were written to were very likely to collide with eachother if you did things like read a Parquet file, filter it, then write it again 😬 . The new files add a hash of the value to the end of the key (e.g., projjson_crs_value_0ffad8372). Totally up for discussion whether that's a good idea or not 🙂 .
I updated these to be a bit more intentional about the corner cases we collectively ran into in https://github.com/apache/parquet-java/pull/2971 and https://github.com/apache/arrow/pull/45459. I'm not sure the Python files to generate them belong in this repo but it does make it easier to see what they contain. I also included CRS examples because that was also something that required some thinking about in the C++ PR...happy to remove or tweak any of these if I didn't get the spirit of the format change right 🙂 .
Today at the Parquet sync @emkornfield said he might have some time to review this PR
This all seems reasonable, going to merge.
Thank you @emkornfield and @paleolimbot 🙏
Thank you both!