Add CQL to PostgreSQL provider (via pygeofilter and sqlalchemy)
Overview
This merge request adds CQL query functionality to the PostgreSQL provider. Using CQL allows more powerful queries than simple equality in a WHERE clause, including spatial operators such as INTERSECTS or CROSSES.
CQL is passed to the API plain text via a new parameter (cql=)in the URL. The CQL is parsed into an abstract syntaxt tree (AST) in the API using pygeofilter and parsing errors are handled there. In the backend, the query function makes a call to a new function query_cql in the case where CQL has been provided. This function uses SQLAlchemy's "reflect" capability to generate a TableModel for the table and uses pygeofilter to create a SQLAlchemy filter from the CQL AST. This is used query the table for filtered data, which is then converted to GeoJSON-like format dictionaries to be returned.
The new features work and are covered by tests (in test_api.py and test_postgresql.py) but this merge request is presented as a work-in-progress so that the best way to integrate the new functionality can be discussed.
To test
- Use the instructions in pygeoapi/provider/postgresql.py to start a local Docker container populated with OSM test data.
- Use the test config:
export PYGEOAPI_CONFIG=tests/pygeoapi-test-config-postgresql.yml - Start pygeoapi:
pygeoapi serve
Try some queries:
- ILIKE: http://localhost:5000/collections/hot_osm_waterways/items?f=html&cql=name%20ILIKE%20%27muha%27
- BBOX: http://localhost:5000/collections/hot_osm_waterways/items?f=html&cql=BBOX(foo_geom,%2029,%20-2.8,%2029.2,%20-2.9)&limit=100
- CROSSES: http://localhost:5000/collections/hot_osm_waterways/items?f=html&cql=CROSSES(foo_geom,%20LINESTRING(28%20-2,%2030%20-4))&limit=100
- Bad query: http://localhost:5000/collections/hot_osm_waterways/items?cql=osm_id%20EATS%201
Additional Information
- ECQL guide (from GeoServer docs): https://docs.geoserver.org/stable/en/user/filter/ecql_reference.html#filter-ecql-reference
- CQL tutorial: https://docs.geoserver.org/latest/en/user/tutorials/cql/cql_tutorial.html
Questions
- Should we also accept CQL via POST request like the ElasticSearch backend does?
- Where should the documentation for this go?
Using SQLAlchemy for all the PostgreSQL backend
If we went fully SQLAlchemy for the PostgreSQL implementation we could use both the WHERE and BBOX clauses from the API call and the CQL query with the same code.
where_filter = ... # generated from API input
bbox_filter = ... # generated from API input
cql_filter = ... # generated from API cql_ast input
q = (session.query(TableModel)
.filter(where_filter)
.filter(bbox_filter)
.filter(cql_filter)
.order_by(*order_by_clauses)
.offset(offset)
.limit(limit))
If any of the provided filters were None they will (should) just be ignored.
Connection pooling
Do we want to have persistent connections? These can save 0.5 seconds per API request. A new PostgreSQLProvider is instantiated for each API request so these would need to held within some singleton item that belonged to the postgresql.py module (e.g. an _active_connections dictionary). We could use the connection parameters concatenated in a string as the key because it has to be hashable and unique.
Also, connections can timeout so we would have to check that the connection was active again before we tried to use it. The process to get a connection would be:
- generate connection key
- check for connection in store
- if there is one, use it
- if there isn't one, or if an attempt to use it fails, create a new one, store it then use it
- if the attempt create one fails, raise an error
We could pass "application_name" to the server on connection so that we can recognise the connection.
https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS
Notes on specific SQL Alchemy pygeofilter approach.
pycsw creates SQL Alchemy table models from Base programmatically using the type function. Our approach is to "reflect" the table model from the database.
Contributions and Licensing
(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)
- [x] I'd like to contribute [feature] to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution.
- [ ] I have already previously agreed to the pygeoapi Contributions and Licensing Guidelines
Great work! Some additional comments:
- we need to add
pygeofiltertorequirements.txtanddebian/control - Using pygeofilter gives us the opportunity to extend this PR for HTTP POST CQL as JSON. Currently (depending on HTTP GET or POST) we have different codepaths. I wonder whether we can consolidate this behaviour through a single
.../itemshandler to detect GET or POST and process accordingly. This would also allow us to lighten our use of https://github.com/geopython/pygeoapi/tree/master/pygeoapi/models, thereby pushing all CQL handling to pygeofilter proper. Now that I think more, we can cover this in another dedicated PR - we can remove
pygeofilter_sqlalchemy_demo.py(as long as we have the proper tests should suffice)
cc @constantinius
Hi @tomkralidis,
Thanks for the rapid feedback! I'm still composing the main merge request text after we had a forking mishap and lost the previous draft that I had written. I'll let you know when it is final.
I've pushed some extra tweaks, too.
OK, the merge request is ready. I think @KoalaGeo will arrange a call where we can talk it through in more detail.
Hi,
@ximenesuk and I are booked onto other projects for the next few weeks. We will come back to this during the week of 19 September.
In response to some of the comments so far:
- [x] We will remove
setup.sh - [x] We will generalise exception handling
- [x] We will change the parameter name to
filterand add afilter-langoption - [x] We will separate the API tests into success and different failure conditions
In this simple implementation, if you specify cql then the other query parameters will be ignored. The best way to handle both query options is to use replace the entire query function with a SQLAlchemy-based version. We wanted to show this proof-of-concept before doing so, but I think that is the approach that we should take.
Hopefully we can have a call before then, too.
@tomkralidis @francbartoli what are your thoughts on CQL having a new param cql:true / enable /false / disable
There might be collections where a user doesn't want to offer that level of access/control
I would delegate that to an external service. CQL in pygeoapi proper should be supported by default (the exception being a given provider does not handle CQL yet, which would throw an exception).
We are working on this this week. The changes that we plan to make are:
- [x] Update documentation to describe CQL with PostgreSQL provider
- [x] Refactor PostgreSQL backend to be pure SQLAlchemy instead of current hybrid form
- [x] Update API to allow POSTing of CQL as JSON to the PostgreSQL provider
@tomkralidis @francbartoli We have merged some work into the branch that addresses the points raised by @volcan01010 in https://github.com/geopython/pygeoapi/pull/964#issuecomment-1235265715 The changes also address two of the points in https://github.com/geopython/pygeoapi/pull/964#issuecomment-1252078191 , namely,
- Update documentation to describe CQL with PostgreSQL provider
- Update API to allow POSTing of CQL as JSON to the PostgreSQL provider
This last part has raised a few issues relating to testing and code. The changes include tests for the POSTing of JSON to the API for a PostgreSQL provider. However, we are testing this in isolation just using that one provider with specific data. How can we run these tests using another provider that accepts the POST requests (is ElasticSearch the only one?)? Can your pipelines be run to check the issues arising from this? Do your pipelines test the API for multiple providers? We may need to isolate the Postgres tests using pytest markers or some other way if there are issues here.
In adding support for the POST request to the API we have also needed to refactor parts of your code. In particular the work in this commit https://github.com/geopython/pygeoapi/pull/964/commits/67b8ed6142846912e34240255bb404ceb8f8482c We have slightly reworked the filter-lang get to retain the variable name to aid readability further down, after the logging. We are then using the provider to determine how the CQL is parsed.
The API POST, if the Postgres provider is set up according to https://github.com/geopython/pygeoapi/blob/master/pygeoapi/provider/postgresql.py#L33 , can be tested from the commandline using:
curl --location \
--request POST 'http://localhost:5000/collections/hot_osm_waterways/items?f=json&limit=50&filter-lang=cql-json' \
--header 'Content-Type: application/query-cql-json' \
--data-raw '{"and": [{"between": {"value": {"property": "osm_id"}, "lower": 80800000, "upper": 80900000}}, {"isNull": {"property": "name"}}]}'
Any comments most welcome.
I have some questions about the provider.get() method. As part of the response it includes next and prev fields with the IDs of the adjacent records. For the first record in the table, the prev is set to the same record ID (i.e. the first record), and for the last record in the table next is set to the last record. See the code here:
https://github.com/geopython/pygeoapi/blob/e2e12e9bd4395b453cc382084cd7c33172662b3a/pygeoapi/provider/postgresql.py#L415
It seems to me that this will cause problems with looping as consumers of the data will never know when they have exhausted the table. The OGC API specs suggest that the next field should only be present if another record exists.
- Do you think that the response should be changed so that
prev/nextonly exist if they are valid? - If so, should I change that as part of this refactor or submit that as a separate issue?
- If it is changed, will it affect other providers / the HTML output?
Also, related to the previous comment/questions, should the next and prev values be links? The server URL would need to be added by the API as the provider doesn't know about those. Also, it doesn't seem like they are used in the response, e.g.
https://ogcapi.bgs.ac.uk/collections/historicalearthquakes/items/8?f=json
Hi @tomkralidis, @francbartoli,
We have finished our refactoring and think this is ready to merge now. As described above (https://github.com/geopython/pygeoapi/pull/964#issuecomment-1252078191) we have updated the documentation, added the ability to POST CQL as JSON to the PostgreSQL provider and refactored the entire backend to use SQLAlchemy.
The slight caveats are:
- We haven't tested the API changes against Elastic Search (https://github.com/geopython/pygeoapi/pull/964#issuecomment-1253471554)
- The
nextandpreviousbehaviour is unchanged (https://github.com/geopython/pygeoapi/issues/990)
In terms of the backend refactor, the diff isn't much use as so much has changed, so it is easier to read the new file. The SQLAlchemy logic should make sense and the individual commits should show where different logic was added. The trickiest bit was generating the table model. That shouldn't need any future changes so further enhancements (e.g. adding editing capability) shouldn't need deep SQLAlchemy knowledge.
Here are some metrics on the changes.
| Metric | Previous | This branch |
|---|---|---|
| postgresql.py lines of code | 461 | 396 |
| Tests count (postgresql.py + api.py) | 45 | 76 |
| Coverage (postgresql.py) | 92% | 99% |
| Coverage (api.py) | 61% | 67% |
For information: I did some basic load tests on the new version to check that SQLAlchemy connection pooling was working. It reported that the new version is about 33% slower.
Number of iterations of multi-endpoint GET script in 30 seconds:
| master | this branch |
|---|---|
| 60 | 40 |
This was a surprise to me as I expected the connection pooling would have sped things up. Perhaps getting a connection is not the bottleneck. I was running both pygeoapi and the k6 load tester on the same machine so this may have skewed the results as k6, pygeoapi and PostgreSQL were all using the same CPU / RAM. It would be interesting to see results on a fully-deployed system.
I also tried using curl for individual requests. With these, the response times were erratic and between 0.14 and 0.2 seconds for both the old and new implementations, so I couldn't really tell the difference.
curl --output /dev/null http://localhost:5000/collections/hot_osm_waterways/items/13990765?f=json -w 'Total: %{time_total}s\n
k6 load tests
I used the k6 tool to run this script that queried multiple endpoints (test-postgresql config):
import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
vus: 20,
duration: '30s',
};
export default function () {
http.get('http://localhost:5000/collections/hot_osm_waterways/items/13990765?f=json');
http.get('http://localhost:5000/collections/hot_osm_waterways/items/30030186?f=json');
http.get('http://localhost:5000/collections/hot_osm_waterways/items/80827787?f=json');
http.get('http://localhost:5000/collections/hot_osm_waterways/items/124128967?f=json');
http.get('http://localhost:5000/collections/hot_osm_waterways/items/150771186?f=json');
http.get('http://localhost:5000/collections/hot_osm_waterways/items?f=json&resulttype=hits');
http.get('http://localhost:5000/collections/hot_osm_waterways/items?f=json&limit=20&offset=100');
}
Run with k6 Docker container:
docker run --rm -i --network host loadimpact/k6 run - < k6_test.js
First of all @volcan01010, this is an impressive improvement! +1 to merge it but let's wait for more consensus since the rework with the provider is quite huge. Also, I'd like to test the performance given your previous comments
I wanted to see why the SQLAlchemy version was slower, so I wrote the following script to time multiple requests, only looking at the provider.
import os
import time
from pygeoapi.provider.postgresql import PostgreSQLProvider
NUM_REQUESTS = 100
PASSWORD = os.environ.get('POSTGRESQL_PASSWORD', 'postgres')
def instantiate_and_query():
config = {
'name': 'PostgreSQL',
'type': 'feature',
'data': {'host': '127.0.0.1',
'dbname': 'test',
'user': 'postgres',
'password': PASSWORD,
'search_path': ['osm', 'public']
},
'id_field': 'osm_id',
'table': 'hotosm_bdi_waterways',
'geom_field': 'foo_geom'
}
provider = PostgreSQLProvider(config)
_ = provider.get(29701937)
if __name__ == "__main__":
start = time.time()
for _ in range(NUM_REQUESTS):
instantiate_and_query()
elapsed = time.time() - start
print(f"{NUM_REQUESTS} requests processed in {elapsed:.1f} seconds.")
I then ran it against the master and pygeofilter-sqlalchemy branches. They came out about the same:
# master
100 requests processed in 7.1 seconds.
100 requests processed in 6.6 seconds.
100 requests processed in 6.4 seconds.
100 requests processed in 6.3 seconds.
100 requests processed in 6.6 seconds.
# pygeofilter-sqlalchemy
100 requests processed in 6.7 seconds.
100 requests processed in 6.4 seconds.
100 requests processed in 6.0 seconds.
100 requests processed in 5.8 seconds.
100 requests processed in 6.0 seconds.
After that, I ran the script with pyinstrument to see where it was spending its time (pyinstrument bottleneck.py):
100 requests processed in 6.7 seconds.
_ ._ __/__ _ _ _ _ _/_ Recorded: 22:13:54 Samples: 4506
/_//_/// /_\ / //_// / //_'/ // Duration: 7.807 CPU time: 4.394
/ _/ v4.3.0
Program: bottleneck.py
7.807 <module> <string>:1
[4 frames hidden] <string>, runpy
7.807 _run_code runpy.py:64
└─ 7.807 <module> bottleneck.py:1
├─ 6.707 instantiate_and_query bottleneck.py:10
│ ├─ 3.637 get pygeoapi/provider/postgresql.py:185
│ │ ├─ 2.253 first sqlalchemy/orm/query.py:2795
│ │ │ [476 frames hidden] sqlalchemy, encodings, weakref, <buil...
│ │ │ 1.837 do_execute sqlalchemy/engine/default.py:731
│ │ │ └─ 1.836 [self]
│ │ ├─ 1.016 get <string>:1
│ │ │ [518 frames hidden] <string>, sqlalchemy, <built-in>, wea...
│ │ ├─ 0.130 query sqlalchemy/orm/session.py:2153
│ │ │ [124 frames hidden] sqlalchemy, weakref, <built-in>, logg...
│ │ ├─ 0.108 __exit__ sqlalchemy/orm/session.py:1166
│ │ │ [37 frames hidden] sqlalchemy, <built-in>, threading
│ │ └─ 0.086 _sqlalchemy_to_feature pygeoapi/provider/postgresql.py:298
│ └─ 3.067 __init__ pygeoapi/provider/postgresql.py:74
│ └─ 2.995 _reflect_table_model pygeoapi/provider/postgresql.py:258
│ ├─ 2.685 reflect sqlalchemy/sql/schema.py:4729
│ │ [1045 frames hidden] sqlalchemy, <string>, encodings, <bui...
│ └─ 0.287 prepare <string>:1
│ [317 frames hidden] <string>, sqlalchemy, abc, <built-in>...
├─ 0.706 <module> pygeoapi/provider/postgresql.py:49
│ ├─ 0.383 <module> geoalchemy2/__init__.py:1
│ │ [840 frames hidden] geoalchemy2, pkg_resources, <built-in...
│ ├─ 0.202 <module> sqlalchemy/__init__.py:8
│ │ [524 frames hidden] sqlalchemy, textwrap, <built-in>, ins...
│ └─ 0.079 <module> sqlalchemy/ext/automap.py:8
│ [185 frames hidden] sqlalchemy, textwrap, <built-in>, re,...
└─ 0.392 <module> pygeoapi/__init__.py:30
└─ 0.381 <module> pygeoapi/config.py:30
├─ 0.286 <module> pygeoapi/util.py:30
│ └─ 0.151 <module> shapely/geometry/__init__.py:1
│ [370 frames hidden] shapely, numpy, pathlib, <built-in>, ...
└─ 0.087 <module> jsonschema/__init__.py:1
[291 frames hidden] jsonschema, urllib, http, email, re, ...
These results show that initialising the class takes as long as running the "get" query because it has to "reflect" the table. This means each query makes two requests to the database. But the table model doesn't change each time. I'm going to try caching the table model.
Thanks for this deep analysis, it makes a lot of sense to definitively caching
As I hoped, this has made the query take about half as long as we now only make two trips to the database for the first request.
100 requests processed in 2.8 seconds.
_ ._ __/__ _ _ _ _ _/_ Recorded: 22:57:42 Samples: 1724
/_//_/// /_\ / //_// / //_'/ // Duration: 3.920 CPU time: 1.759
/ _/ v4.3.0
Program: /tmp/bottleneck.py
3.920 <module> <string>:1
[4 frames hidden] <string>, runpy
3.920 _run_code runpy.py:64
└─ 3.920 <module> bottleneck.py:1
├─ 2.831 instantiate_and_query bottleneck.py:10
│ ├─ 2.680 get pygeoapi/provider/postgresql.py:184
│ │ ├─ 1.899 first sqlalchemy/orm/query.py:2795
│ │ │ [142 frames hidden] sqlalchemy, <built-in>, <string>, re
│ │ │ 1.793 do_execute sqlalchemy/engine/default.py:731
│ │ ├─ 0.612 get <string>:1
│ │ │ [173 frames hidden] <string>, sqlalchemy, <built-in>, wea...
│ │ └─ 0.068 __exit__ sqlalchemy/orm/session.py:1166
│ │ [35 frames hidden] sqlalchemy, <built-in>
│ └─ 0.143 __init__ pygeoapi/provider/postgresql.py:74
│ └─ 0.108 _get_engine_and_table_model pygeoapi/provider/postgresql.py:230
│ └─ 0.081 _reflect_table_model pygeoapi/provider/postgresql.py:261
│ └─ 0.076 reflect sqlalchemy/sql/schema.py:4729
│ [255 frames hidden] sqlalchemy, <string>, <built-in>, re,...
├─ 0.710 <module> pygeoapi/provider/postgresql.py:49
│ ├─ 0.354 <module> geoalchemy2/__init__.py:1
│ │ [786 frames hidden] geoalchemy2, pkg_resources, <built-in...
│ ├─ 0.234 <module> sqlalchemy/__init__.py:8
│ │ [537 frames hidden] sqlalchemy, textwrap, <built-in>, fun...
│ └─ 0.081 <module> sqlalchemy/ext/automap.py:8
│ [182 frames hidden] sqlalchemy, textwrap, re, <built-in>,...
└─ 0.370 <module> pygeoapi/__init__.py:30
└─ 0.358 <module> pygeoapi/config.py:30
├─ 0.272 <module> pygeoapi/util.py:30
│ └─ 0.144 <module> shapely/geometry/__init__.py:1
│ [290 frames hidden] shapely, numpy, pathlib, ntpath, <bui...
└─ 0.080 <module> jsonschema/__init__.py:1
[244 frames hidden] jsonschema, urllib, http, email, sock...
It seems to be consistent, too.
100 requests processed in 2.7 seconds.
100 requests processed in 2.7 seconds.
100 requests processed in 2.7 seconds.
100 requests processed in 2.7 seconds.
100 requests processed in 2.7 seconds.
And the load testing is better, too. With the SQL Alchemy + caching it is 40% higher than the current release.
| master | without table model cache | with table model cache |
|---|---|---|
| 60 | 40 | 84 |
I've made a couple of minor tweaks this morning, but I think that we are finished from our end. I have updated the example queries to use filter as the keyword instead of cql.
Example queries:
- ILIKE:http://localhost:5000/collections/hot_osm_waterways/items?f=html&filter=name%20ILIKE%20%27muha%27
- BBOX: http://localhost:5000/collections/hot_osm_waterways/items?f=html&filter=BBOX(foo_geom,%2029,%20-2.8,%2029.2,%20-2.9)&limit=100
- CROSSES: http://localhost:5000/collections/hot_osm_waterways/items?f=html&filter=CROSSES(foo_geom,%20LINESTRING(28%20-2,%2030%20-4))&limit=100
- Bad query: http://localhost:5000/collections/hot_osm_waterways/items?filter=osm_id%20EATS%201
Aaargh! I just tried out this example CURL command from @ximenesuk:
curl --location \
--request POST 'http://localhost:5000/collections/hot_osm_waterways/items?f=json&limit=50&filter-lang=cql-json' \
--header 'Content-Type: application/query-cql-json' \
--data-raw '{"and": [{"between": {"value": {"property": "osm_id"}, "lower": 80800000, "upper": 80900000}}, {"isNull": {"property": "name"}}]}'
It now fails with:
{
"code": "InvalidParameterValue",
"description": "Collection is not editable"
}
It seems like adding support for API transactions (in this commit: faa00e7c) since we branched off to implement the CQL-JSON POST has stopped us being able to POST to PostgreSQL.
Fixing in master now..
@volcan01010 before I push a few updates, does the following work for you:
diff --git a/pygeoapi/flask_app.py b/pygeoapi/flask_app.py
index 0dbcd66..1a9457f 100644
--- a/pygeoapi/flask_app.py
+++ b/pygeoapi/flask_app.py
@@ -195,12 +195,13 @@ def collection_items(collection_id, item_id=None):
api_.get_collection_items(request, collection_id))
elif request.method == 'POST': # filter or manage items
if request.content_type is not None:
- return get_response(
- api_.manage_collection_item(request, 'create',
- collection_id))
- else:
- return get_response(
- api_.post_collection_items(request, collection_id))
+ if request.content_type == 'application/geo+json':
+ return get_response(
+ api_.manage_collection_item(request, 'create',
+ collection_id))
+ else:
+ return get_response(
+ api_.post_collection_items(request, collection_id))
elif request.method == 'DELETE':
return get_response(
Yes, that has fixed it. I have applied the patch and pushed it up.
@volcan01010 I need to apply this change directly along with some additional small changes. If this works for you, let me know and I will push here, at which point you would be able to rebase.
Do you mean that you update master and I rebase here? Yes, that works for me.
+1. Pushing momentarily.
@volcan01010 you should be good to rebase now.
I don't think that it needs to. GitHub says there are no conflicts and I applied your patch (https://github.com/geopython/pygeoapi/pull/964/commits/6e62d8f67096e12ce14e0e05ec2e8735315e6561) so the POST request works.
@tomkralidis - I had a go at rebasing to master but it was a complicated merge with many conflicts and I didn't manage to do it without breaking things. Instead I have merged master into this branch. We will have the ugly merge commit in the history, but the advantage is that the next merge is clean. All the tests for postgresql.py and api.py are still green and the CURL command works.
Need to wait for https://github.com/geopython/pygeofilter/pull/56 or https://github.com/geopython/pygeofilter/pull/59 to be merged and closed which will fix outstanding bug:
File "/pygeoapi/pygeoapi/provider/postgresql.py", line 60, in <module>
from pygeofilter.backends.sqlalchemy.evaluate import to_filter
File "/usr/local/lib/python3.8/dist-packages/pygeofilter/backends/sqlalchemy/__init__.py", line 1, in <module>
from .evaluate import to_filter
File "/usr/local/lib/python3.8/dist-packages/pygeofilter/backends/sqlalchemy/evaluate.py", line 4, in <module>
from . import filters
File "/usr/local/lib/python3.8/dist-packages/pygeofilter/backends/sqlalchemy/filters.py", line 8, in <module>
from pygeoif.geometry import as_shape
ImportError: cannot import name 'as_shape' from 'pygeoif.geometry' (/usr/local/lib/python3.8/dist-packages/pygeoif/geometry.py)
ERROR: openapi.yml could not be generated ERROR
pygeoif / pygeofilter dependancy bug is now fixed with https://github.com/geopython/pygeofilter/pull/61
However new bug discovered to be fixed before merge - https://github.com/BritishGeologicalSurvey/pygeoapi/issues/13
@tomkralidis & @francbartoli we've got this working internally but have two further PR into our branch pygeofilter-sqlalchemy to make then this will be ready to merge - thanks for your patience!