Search for products among multiple providers
Original request made by @geonux :
To be able to search for all available data on all providers over a given AOI
We decided that the best way to approach this was to add a providers list parameter to the search method. A list of all the providers can be retrieved with dag.available_providers(). But a user could also provide a subset of the available providers:
dag.search(args, providers=["theia", "sobloo"])
Note that there is already a provider kwarg that the user can pass to search.
For retrieving all the product types, we have to deal with the already used productType parameter and change search API and to use productType as a list.
Note here that dag.list_product_types(provider) could come in handy if productType accepts a list of product types.
Here is a snippet that allows to search for all the products available on all the providers in a given area of interest (around Toulouse here) some time in August 2020:
from eodag import EODataAccessGateway
from eodag.api.search_result import SearchResult
from eodag.utils.logging import setup_logging
setup_logging(verbose=1)
dag = EODataAccessGateway()
search_criteria = dict(
start='2020-08-01',
end='2020-08-10',
geom=[0, 43, 2, 45],
)
all_prods = SearchResult([])
# Loop over ALL the providers
for provider in dag.available_providers():
# Set it as the preferred one
dag.set_preferred_provider(provider=provider)
# Get the product ID, i.e. the products types (e.g. S2_MSI_L1C), for this provider
product_types = (
p["ID"]
for p in dag.list_product_types(provider=provider)
if p["ID"] != "GENERIC_PRODUCT_TYPE"
)
# And loop over them and search all the products available
for product_type in product_types:
try:
results = dag.search_all(productType=product_type, **search_criteria)
except Exception:
print(f"Failed to collect '{product_type}' products with '{provider}'")
results = []
print(f"Got {len(results)} '{product_type}' products with '{provider}'")
all_prods.extend(results)
print(f"Got a total of {len(all_prods)} products.")
(I got 1090 products)
@geonux since you're at the origin of this issue, I would like to ask you a few questions about it if I may.
To be able to search for all available data on all providers over a given AOI
I think it can be translated into two different ways:
- To get all the products from all the providers
- To get all the unique products from all the providers
Indeed, it is quite sure that were will be duplicate products (both provider A and B offer the same product i) from a search over the same AOI and time period.
If you are interested in 1., the snippet above should get you what want. Would it be enough to document it?
If you are interested in 2., this is trickier. We would need to remove duplicate products. We could rely on the product unique identifier, however, as shown in https://github.com/CS-SI/eodag/issues/136#issuecomment-808082569, we can't always make sure that different providers use the same id (surprisingly!). So there may still be some duplicates after an id filter. We could also rely on a combination of properties, and remove duplicates if 2 or more products share the same combination of, for instance, product_type / geometry / start date / end date.
Removing duplicates based on the id can be done as follows:
almost_unique_prods = SearchResult({p.properties["id"]: p for p in all_prods[::-1]}.values())
An attempt to remove potential duplicate products from almost_unique_prods could be done as follows:
unique_prods = SearchResult({
(p.properties["startTimeFromAscendingNode"], p.geometry.wkt, p.product_type): p
for p in almost_unique_prods[::-1]
}.values())
If we implement 2., internally we could add __eq__ (and __hash__?) to the EOProduct class, to specify how we define whether two products are the same or not.
Note the reverse order on all_prods and almost_unique_prods in the dict comprehension above. Its aim is to ensure that, if there are duplicate products, the one that end up in unique_prods is the one obtained from the first provider (the first one that offered this product). If we implement 2. we should ensure this priority is preserved, e.g. search_all(geom=..., start=..., end=..., providers=["peps", "sobloo"]) should return products from peps in priority.
If we decide to implement 1. or 2., where should add this?
-
search_all: it seems like an obvious candidate => Yes? -
search: since it returns a given page of a search, I believe it assumes the search is done on a single provider => No? -
search_iter_page: I see it being used by advanced users, who may want to implement this (search all products from all providers) in a different way than what we'll do => No?
Having differently formatted ids for the same products, depending on the providers must be fixed.
But this is also related to the fact that some providers are not (yet) configured to return Sentinel products in SAFE format. See #216 and #171