Mars validation is performed on retrieve but not on archive
What happened?
I use a local fdb to store experimental ERA6 data with a schema that is containing a new key "timespan".
The metkit version I was using did not contain this key in the language yaml yet. It was possible to archive fields containing this key, but when retrieving a specific field was not possible, because the expansion and validation on the request failed (the key was not known).
My questions are:
- Why is validation performed only on retrieval, but not when a field is being archived? The current behavior allows archiving "bad" messages but forbids to retrieve them. It should be the other way round.
- What is the purpose of the schema if there is separate validation mechanism? The mars language describes the types and possible values for all the keys. Effectively the schema duplicate information and only specifiy a subset of keys that are used to index.
What are the steps to reproduce the bug?
Invent a new key, create a valid schema containing it, archive a field and try to retrieve it again.
Version
5.17.4
Platform (OS and architecture)
ATOS
Relevant log output
[mapg@ad6-198 mapg-20250721-CPP-20250811-125904-481e7474ac647ce014657e7c08618b22-woof]$ fdb where --config=/lus/h2resw01/scratch/mapg/5220/mapg-20250721-CPP-20250811-125904-481e7474ac647ce014657e7c08618b22-woof/fdbs/fdb/etc/fdb/config.yaml anoffset=3,class=e6,da
te=20220531,domain=g,expver=5220,levtype=o2d,month=5,param=262141,step=18,stream=lwda,time=1800,timespan=none,type=fc,year=2022
Exception: UserError: Cannot match [timespan] in [source,style,class,type,stream,product,section,range,use,expver,dataset,model,georef,repres,obsgroup,reportype,levtype,levelist,leve,level,levellist,param,date,year,month,hdate,offsetdate,fcmonth,fcperiod,time,offs
ettime,leadtime,opttime,step,anoffset,reference,number,quantile,domain,bcmodel,icmodel,country,grib,frequency,direction,diagnostic,iteration,channel,ident,instrument,method,origin,system,activity,experiment,generation,realization,resolution,obstype,latitude,longit
ude,accuracy,bitmap,format,frame,gaussian,area,grid,interpolation,packing,resol,rotation,intgrid,truncation,process,filter,target,fieldset,field]
Exception: UserError: UserError: Cannot match [timespan] in [source,style,class,type,stream,product,section,range,use,expver,dataset,model,georef,repres,obsgroup,reportype,levtype,levelist,leve,level,levellist,param,date,year,month,hdate,offsetdate,fcmonth,fcperio
d,time,offsettime,leadtime,opttime,step,anoffset,reference,number,quantile,domain,bcmodel,icmodel,country,grib,frequency,direction,diagnostic,iteration,channel,ident,instrument,method,origin,system,activity,experiment,generation,realization,resolution,obstype,lati
tude,longitude,accuracy,bitmap,format,frame,gaussian,area,grid,interpolation,packing,resol,rotation,intgrid,truncation,process,filter,target,fieldset,field] request=read,anoffset=3,class=e6,date=20220531,domain=g,expver=5220,levtype=o2d,month=5,param=262141,step=1
8,stream=lwda,time=1800,timespan=none,type=fc,year=2022, expanded=read,
** UserError: UserError: Cannot match [timespan] in [source,style,class,type,stream,product,section,range,use,expver,dataset,model,georef,repres,obsgroup,reportype,levtype,levelist,leve,level,levellist,param,date,year,month,hdate,offsetdate,fcmonth,fcperiod,time,o
ffsettime,leadtime,opttime,step,anoffset,reference,number,quantile,domain,bcmodel,icmodel,country,grib,frequency,direction,diagnostic,iteration,channel,ident,instrument,method,origin,system,activity,experiment,generation,realization,resolution,obstype,latitude,lon
gitude,accuracy,bitmap,format,frame,gaussian,area,grid,interpolation,packing,resol,rotation,intgrid,truncation,process,filter,target,fieldset,field] request=read,anoffset=3,class=e6,date=20220531,domain=g,expver=5220,levtype=o2d,month=5,param=262141,step=18,stream
=lwda,time=1800,timespan=none,type=fc,year=2022, expanded=read, Caught in (/hpcperm/deploy/metabuilder/builds/ecfg-deploy-mbm_3141/aa/GNU.85/mars-server/mars-server/eckit/src/eckit/runtime/Tool.cc:31 start)
** Exception terminates fdb-where
FDBException: Error in function fdb_expand_request: UserError: UserError: Cannot match [timespan] in [style,class,type,stream,product,section,range,use,expver,dataset,model,georef,repres,obsgroup,reportype,levtype,levelist,leve,level,levellist,param,date,year,mont
h,hdate,offsetdate,fcmonth,fcperiod,time,offsettime,leadtime,opttime,step,anoffset,reference,number,quantile,domain,bcmodel,icmodel,country,grib,frequency,direction,diagnostic,iteration,channel,ident,instrument,method,origin,system,activity,experiment,generation,r
ealization,resolution,obstype,latitude,longitude,accuracy,bitmap,format,frame,gaussian,area,grid,interpolation,packing,resol,rotation,intgrid,truncation,process,filter,target,source,expect,fieldset,field,database,dbase,optimise,duplicates,padding] request=retrieve
,anoffset=3,type=fc,timespan=fs,date=20220531,step=0,year=2022,class=e6,domain=g,time=1800,param=162104,stream=lwda,expver=5220,month=5,levtype=ml,levelist=81, expanded=retrieve
Accompanying data
No response
Organisation
ECMWF
Hey @pgeier
Why is validation performed only on retrieval, but not when a field is being archived?
I believe this check is carried out by MarsExpension.expand, which as far as I can tell is only called on retrieve and not on archival. Whether this is intentional is unclear. From a quick skim of the code, I don't think this metkit "validation" happens on the C++ API fdb.retrieve either, but it does on the C-API (pyfdb) and the command line tools, which wrap fdb.retrieve.
Obviously it's not good if it's possible to write data that is unretrievable by (some of) the APIs.
What is the purpose of the schema if there is separate validation mechanism?
The schema does more than just define a set of valid keys. The division of the keys into three levels affects the layout of data on disk, which has a big impact on FDB's performance.
Also, there are multiple valid mars keys (such as use, target, filter, etc.) which are used elsewhere in the mars-ecosystem (mars, pgen, ...) but are not keys that would ever be valid for indexing the data.
The question is rather conceptually
Given the information that is put in the schema, the conceptual separation from the the MARS language is unclear. I understand that FDB was initially promoted to be language agnostic - in this case the schema describes keys and their types and a conditional order of indexation (with condition I mean things like class=ci, stream=mmfs).
Now with the MARS validation on top, there is a duplication of a lot of information.
The MARS language already describes types of keys and expectations on values.
All the conditional logic that is reproduced in the schema to describe different set of indexation keys, is basically the same information in the MARS language yaml that performs validation - most the time its just about categorization and dis/enabling specific keys for a subtree in the MARS language. Example: number is only used in ensemble forecasts, directions & frequencies is used for wave fields....
I am supporting having schemas - however looking at the era6 schema (which is quite big) most of the information is just a very detailed replication of the domain specific categorization. But this "view from above" is neither captured in any of the schemas nor in the MARS language itself, instead it is always relied on contextual conditions of single values (that look rather cryptically than expressive) --- however this is more a criticism on the MARS language and their description.
What I find really concerning is that this additional validation does not seem to be integrated very thoughtfully - the reason for validation is to maintain data consistency, to error out and notify before harm happens and not when informal data already has been written.
As a user I should not understand in which cases fdb triggers MarsRequest.expand - I should just understand how FDB is performing validation and that is (in the first instance) schema validation.
Right now FDB is very hesitant on whether it is domain/MARS specific or not and where the domain information is primarily managed.
I broadly agree with your points.
As a user I should not understand in which cases fdb triggers MarsRequest.expand
Of course.
What I find really concerning is that this additional validation does not seem to be integrated very thoughtfully - the reason for validation is to maintain data consistency, to error out and notify before harm happens and not when informal data already has been written.
Indeed. It is clearly flawed based on what you have identified.
Small related summary of #163 : Eccodes representation of timespan was not the "canonical" form specified in metkit. The archive was performed with archive(eckit::message::Message) which uses MetadataGatherer and iterates the mars namespace on eccodes. But in this case the extracted keys do not get expanded.
An expansion/validation would have avoided the problem.