PARQUET-2471: Add support for geometry logical type
This is the PoC of https://github.com/apache/parquet-format/pull/240
Jira
- [ ] My PR addresses the following Parquet Jira issues and references
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
Tests
- [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason:
Commits
- [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines
from "How to write a good git commit message":
- Subject is separated from body by a blank line
- Subject is limited to 50 characters (not including Jira issue reference)
- Subject does not end with a period
- Subject uses the imperative mood ("add", not "adding")
- Body wraps at 72 characters
- Body explains "what" and "why", not "how"
Style
- [ ] My contribution adheres to the code style guidelines and Spotless passes.
- To apply the necessary changes, run
mvn spotless:apply -Pvector-plugins
- To apply the necessary changes, run
Documentation
- [ ] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does
Thanks @paleolimbot for the quick review! I've addressed your feedback and also added page index support for geometry stats. BTW, it is a convention that we need PoC on another Parquet implementation to move forward the proposal. I plan to add similar things to parquet-cpp. Do you happen to know any JTS-parity on the C++ side? Or do you have a plan to implement the PoC? I guess there should be a lot of existing code on the GeoParquet side.
@wgtmac The JTS-parity on C++ side is GEOS, which is a C++ port of JTS: https://github.com/libgeos/geos
Note that: GEOS is LGPL 2.1, which is NOT compatible with ASF
Core sedona contributors @Kontinuation and @zhangfengcdt , can you also review this?
@wgtmac Thanks for putting together this PR. I guess you might need some help from the Sedona community regarding the Java geometry implementation together with JTS. We can work together to implement this POC.
Thanks @jiayuasu for your valuable feedback! Yes, it would be great if we can get direct help from geospatial experts!
Do you happen to know any JTS-parity on the C++ side?
As Jia mentioned, it's GEOS, but this is not something the C++ implementation can/should take on as a dependency.
I guess there should be a lot of existing code on the GeoParquet side.
GDAL has quite a lot to draw from ( https://github.com/OSGeo/gdal/tree/master/ogr/ogrsf_frmts/parquet ), although I think pretty much all we need to do is walk WKB for bounding boxes and geometry type ( https://github.com/OSGeo/gdal/blob/12dab86ca2d8b1a65c4c085e137c62294682ac1d/ogr/ogr_wkb.cpp#L387-L584 ).
Or do you have a plan to implement the PoC?
I am happy to give it a shot (I'll make a start this week).
Close this in favor of https://github.com/apache/parquet-java/pull/2971