stac-server Search limit greater than arbitrary value returns status code 502

Exceeding the limit for a query on a collection provides an unhelpful server message, and leaves the user guessing what is wrong. This appears to be related to the overall response size, as different limits can be used with different STAC collections.

Examples using pystac-client and https://earth-search.aws.element84.com/v0

Succeeds:

search = sentinel2.search(collections=['sentinel-s2-l2a-cogs'],
                          bbox=(-120.23822859135915, 35.63894025515473, -118.19087145985415, 37.262086717429455),
                          datetime='2013-01-01/2020-12-31',
                          limit=500)

records = search.get_all_items_as_dict()

Fails:

search = sentinel2.search(collections=['sentinel-s2-l2a-cogs'],
                          bbox=(-120.23822859135915, 35.63894025515473, -118.19087145985415, 37.262086717429455),
                          datetime='2013-01-01/2020-12-31',
                          limit=650)

records = search.get_all_items_as_dict()

APIError: {"message": "Internal server error"}

Different collection

Succeeds:

search = sentinel2.search(collections=['sentinel-s2-l2a'],
                          bbox=(-120.23822859135915, 35.63894025515473, -118.19087145985415, 37.262086717429455),
                          datetime='2013-01-01/2020-12-31',
                          limit=750)

records = search.get_all_items_as_dict()

Fails:

search = sentinel2.search(collections=['sentinel-s2-l2a'],
                          bbox=(-120.23822859135915, 35.63894025515473, -118.19087145985415, 37.262086717429455),
                          datetime='2013-01-01/2020-12-31',
                          limit=800)

records = search.get_all_items_as_dict()

APIError: {"message": "Internal server error"}

Nov 18 '21 20:11 klsmith-usgs

Status update on this ticket:

I looked in the CloudWatch logs for the lambda, and there are no error logs that indicate this is happening. Todo: look closer at other logs
One important details is that in the first example above, the server failure does not occur until the 3rd page with a limit of 650. It could be that the first two pages are just under some unknown size limit that's triggering this. Or, it could be that there's an item in the 3rd page that's inordinately larger than the others, and retrieving a page with that one in it causes things to blow up.
Next step is to query the api directly with requests to allow control over the exact page, etc.

Feb 23 '22 16:02 philvarner

More updates:

The error is coming from the API Gateway and (confusingly) is a 502 (Bad Gateway) but the body is {"message": "Internal server error"} 🙄

Running with a page size of 325, this is the size of each page:

page	status	size (b)	sum of last 2 pages	page for limit 650
1:	200	3176323
2:	200	2397581	5573904	1
3:	200	2412505
4:	200	2417225	4829730	2
5:	200	2816015
6:	200	3297559	6113574	3
7:	200	3203616
8:	200	3201786	6405402	4
9:	200	3219906
10:	200	2102250	5322156	5

Apparently, there is a hard limit in AWS of only returning 6MB from a Lambda, as used here behind API Gateway.

Feb 23 '22 20:02 philvarner

I believe the right approach here is that if the response body is going to be > 6MB, we return a 400 with

{
   "code": "0001"
    "description": "The response body that resulted from this query was too large to be returned by API Gateway. Try a smaller limit."
}

Feb 23 '22 20:02 philvarner

#193 should hopefully fix this

Feb 23 '22 21:02 marchuffnagle

I think it's going to make it better, but I think it will still fail with a limit of 10000 (and a query that has at least that many results) -- 10k is the upper limit from the OGC API - Features Part 1 spec.

Feb 23 '22 22:02 philvarner

Bumping this out to 0.5.0. Supporting gzipping the responses will help increase the limit, but the 6MB response limit is in the Lambda, so there's no good or easy way to get around that. We should probably document that this can happen and that the workaround is to decrease the limit.

Feb 24 '22 20:02 philvarner