Required controlledVocabulary metadata marked as valid while empty
Bug description
In Dataverse (v6.8), when a previously optional metadata field becomes required for a dataset, or when a dataset is created via the API without all required metadata (dataverse.api.allow-incomplete-metadata=true), the system displays an Incomplete metadata label and normally prevents DRAFT form validation or dataset publication.
However, if the user uploads a file, Dataverse inserts a record for the missing required controlledVocabulary metadata into the datasetfieldvalue table.
Consequence: required controlledVocabulary metadata fields (e.g. subject) are treated as valid in the dataset metadata form even though their value is empty.
Steps to reproduce
- Create a dataset via API with
dataverse.api.allow-incomplete-metadata=truewithout including all required metadata. - Upload a file to the dataset.
- Open the dataset metadata edit form.
- Notice that required
controlledVocabularyfields appear valid although no value is set.
Expected behavior
Required controlledVocabulary fields should remain invalid until a valid value is provided.
Affected version
Dataverse v6.8
Are you thinking about creating a pull request for this issue?
No fix is currently planned by the team.
@stevenferey thanks for the bug report. Out of curiosity, are you able to reproduce this in the SPA at https://beta.dataverse.org/spa/ ?
@ErykKul since you added the "incomplete metadata" concept, I'm just tagging you as a heads up. 😄
I assigned it to me; I will investigate it and see if I can reproduce it (and do a PR). We did not notice that behavior yet, but it is certainly something relevant to KU Leuven. If someone wants to join, you are welcome to do so.
@stevenferey thanks for the bug report. Out of curiosity, are you able to reproduce this in the SPA at https://beta.dataverse.org/spa/ ?
@pdurbin, no because the application server must have the feature enabled in order to create a dataset with incomplete mandatory metadata (and my institutional information is blocking account creation)
dataverse.api.allow-incomplete-metadata=true
I made a PR that should fix it. I could reproduce the problem with a unit test; I could not reproduce it with the combination of API/UI edits. I think that the test would have to be very specific for that, where the only missing value is the controlled vocabulary value? Or specific citation block? To be sure, the code would have to be tested in the specific conditions where it is reproducible.
@stevenferey , What value is exactly inserted to datasetfieldvalue table? The code did already check for blank and "N/A" values. What I added is the check if the specific value is in the possible values list. Do you upload the file with UI or API to reproduce it (I tested both ways without any problem, however, I tested on v. 6.7.1)? Is it only the 6.8 that has this problem? The important part is also the reindex after any changes to the metadata block. Meanwhile, I will also try to test the 6.8 specifically to see if I can reproduce the problem there (before the fix, and than with the fix, if I can reproduce the problem)
I have now tested with 6.8, I copied a dataset from the demo dataverse with only the subject left out, and added files via UI (JSF version) and API, the subject is still missing and the dataset is invalid:
Only the subject is missing:
Setting the subject manually in UI makes the dataset valid.
Tested with v. 6.8 build 1994-92d1ec8 (the official released version)
There must be very specific about this bug, or I am doing something wrong when trying to reproduce it. Let me know if it still persists.
I have discovered that I can save the dataset after removing the subject, making it incomplete. This should be not possible in the UI. It might be the cause of the problem. I will investigate it in more detail.
Removing the subject and saving was a bug; I improved the validation issues detection for the JSF, you cannot save without setting all mandatory fields now, including the controlled vocabulary values (it already worked for other types of values). PR is updated now. I tested the fix in UI.
I have been trying to QA this pr. I created the dataset by setting dataverse.api.allow-incomplete-metadata=true and adding qp /api/dataverses/" + dataverseAlias + "/datasets?doNotValidate=true
The create worked I then uploaded a file to the dataset which added the row to the db table datasetfieldvalue 7 0 "N/A" 10
At this point I cannot publish and continue to be notified that there is a missing field.
Long story short I cannot reproduce this bug. (FYI I am using the current develop branch to reproduce)
I retested with the 11900-improved-cvoc-value-validation branch and here is what I observed:
Scenario 1: Dataset creation without required metadata via API
Steps to reproduce:
- Create a dataset via API with the following payload (without required metadata):
{
"datasetVersion": {
"metadataBlocks": {
"citation": {
"fields": [
{
"value": [
{
"authorName": {
"value": "Finch, Fiona",
"typeClass": "primitive",
"multiple": false,
"typeName": "authorName"
}
}
],
"typeClass": "compound",
"multiple": true,
"typeName": "author"
}
],
"displayName": "Citation Metadata"
}
}
}
}
- Upload a file to this dataset
Observed results:
- The
subjectmetadata has anN/Avalue in the database (datasetfieldvaluetable) - The metadata edit form does not display any error for the
subjectmetadata
- The handling of the "incomplete metadata" tag and the "publish" button is OK.
Scenario 2: Metadata made required after dataset creation
Steps to reproduce:
- Create a dataset from the UI, filling in the required metadata + the
publicationCitationmetadata - Edit the
citation.tsvfile to make thepublicationRelationTypemetadata required - Reindex
- Verify that there is a validation error on
publicationRelationTypewhen editing the dataset - Upload a file
Observed results:
- The
N/Avalue appears in the UI for the Related Publication field
- There is no longer an error on the
publicationRelationTypefield when editing (only a global error)
- The handling of the "incomplete metadata" tag and the "publish" button is OK.
@stevenferey, my fix is what you have observed: the global error that does not let you save the dataset without fixing the problem (adding the missing required field). A better one would be that the validator is smarter to detect what is going wrong here and point to the spot that needs fixing, but I think that one would still not be fixing the real bug here. I did notice a long time ago, when implementing the functionality of incomplete metadata of datasets, that there are some edge cases (especially after making fields required that previously were not required) that regular validators miss occasionally, and I could not pinpoint exactly what was causing it. I implemented a more sophisticated global validator that reuses the regular validators but does it on the copied dataset metadata, where the copy is first transformed by removing empty values, N/A values, etc. This fixed the incomplete metadata labels, and now the global validator does not let you save the dataset by using the same logic (I reused it for UI in this PR). As it appears, these are just workarounds.
I know that there is a bug in metadata that I have been chasing for some time now, and I could not figure out what is causing it. I think you might have found a way to reproduce it. My suspicion is it is the same bug as the one I was fighting with. I did not realize that the key to reproducing it is to add a file to a dataset after changing the citation block (I know, changing the citation block is never a good idea...). I did come up with a query that surfaces the inconsistent state that causes all those problems (some UI quirks, like the missing Access field in our specific case at KU Leuven):
SELECT d.id, d.value, o.identifier FROM public.datasetfieldvalue d,
public.datasetfield_controlledvocabularyvalue c,
public.datasetfield f,
public.datasetfield f2,
public.datasetfieldcompoundvalue cv,
public.datasetversion v,
public.dvobject o
WHERE d.datasetfield_id = c.datasetfield_id
and f.id = d.datasetfield_id
and cv.id = f.parentdatasetfieldcompoundvalue_id
and f2.id = cv.parentdatasetfield_id
and v.id = f2.datasetversion_id
and o.id = v.dataset_id;
Can you run that query and see if you have any matches? If so, that would help a lot in better understanding what is really causing the issue. In a bugless situation there should not be any match. You can fix all datasets (after validating the results of the query to make sure you are not deleting something important but only fixing the issue):
delete from public.datasetfieldvalue where id in (SELECT d.id FROM public.datasetfieldvalue d,
public.datasetfield_controlledvocabularyvalue c,
public.datasetfield f,
public.datasetfield f2,
public.datasetfieldcompoundvalue cv,
public.datasetversion v,
public.dvobject o
WHERE d.datasetfield_id = c.datasetfield_id
and f.id = d.datasetfield_id
and cv.id = f.parentdatasetfieldcompoundvalue_id
and f2.id = cv.parentdatasetfield_id
and v.id = f2.datasetversion_id
and o.id = v.dataset_id);
The file‑addition step may be the key to reproducing the inconsistent state. If it’s the same issue, this should help me pinpoint the exact source - I haven’t been able to isolate it until now. I’ll also run some additional tests on my side when I can.