dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

Required controlledVocabulary metadata marked as valid while empty

Open stevenferey opened this issue 3 months ago • 10 comments

Bug description

In Dataverse (v6.8), when a previously optional metadata field becomes required for a dataset, or when a dataset is created via the API without all required metadata (dataverse.api.allow-incomplete-metadata=true), the system displays an Incomplete metadata label and normally prevents DRAFT form validation or dataset publication.

However, if the user uploads a file, Dataverse inserts a record for the missing required controlledVocabulary metadata into the datasetfieldvalue table.

Consequence: required controlledVocabulary metadata fields (e.g. subject) are treated as valid in the dataset metadata form even though their value is empty.

Steps to reproduce

  1. Create a dataset via API with dataverse.api.allow-incomplete-metadata=true without including all required metadata.
  2. Upload a file to the dataset.
  3. Open the dataset metadata edit form.
  4. Notice that required controlledVocabulary fields appear valid although no value is set.

Expected behavior

Required controlledVocabulary fields should remain invalid until a valid value is provided.

Affected version

Dataverse v6.8

Are you thinking about creating a pull request for this issue?

No fix is currently planned by the team.

stevenferey avatar Oct 16 '25 15:10 stevenferey

@stevenferey thanks for the bug report. Out of curiosity, are you able to reproduce this in the SPA at https://beta.dataverse.org/spa/ ?

@ErykKul since you added the "incomplete metadata" concept, I'm just tagging you as a heads up. 😄

pdurbin avatar Oct 16 '25 16:10 pdurbin

I assigned it to me; I will investigate it and see if I can reproduce it (and do a PR). We did not notice that behavior yet, but it is certainly something relevant to KU Leuven. If someone wants to join, you are welcome to do so.

ErykKul avatar Oct 16 '25 17:10 ErykKul

@stevenferey thanks for the bug report. Out of curiosity, are you able to reproduce this in the SPA at https://beta.dataverse.org/spa/ ?

@pdurbin, no because the application server must have the feature enabled in order to create a dataset with incomplete mandatory metadata (and my institutional information is blocking account creation)
dataverse.api.allow-incomplete-metadata=true

stevenferey avatar Oct 17 '25 08:10 stevenferey

I made a PR that should fix it. I could reproduce the problem with a unit test; I could not reproduce it with the combination of API/UI edits. I think that the test would have to be very specific for that, where the only missing value is the controlled vocabulary value? Or specific citation block? To be sure, the code would have to be tested in the specific conditions where it is reproducible.

@stevenferey , What value is exactly inserted to datasetfieldvalue table? The code did already check for blank and "N/A" values. What I added is the check if the specific value is in the possible values list. Do you upload the file with UI or API to reproduce it (I tested both ways without any problem, however, I tested on v. 6.7.1)? Is it only the 6.8 that has this problem? The important part is also the reindex after any changes to the metadata block. Meanwhile, I will also try to test the 6.8 specifically to see if I can reproduce the problem there (before the fix, and than with the fix, if I can reproduce the problem)

ErykKul avatar Nov 04 '25 11:11 ErykKul

I have now tested with 6.8, I copied a dataset from the demo dataverse with only the subject left out, and added files via UI (JSF version) and API, the subject is still missing and the dataset is invalid:

Image

Only the subject is missing:

Image

Setting the subject manually in UI makes the dataset valid.

Tested with v. 6.8 build 1994-92d1ec8 (the official released version)

There must be very specific about this bug, or I am doing something wrong when trying to reproduce it. Let me know if it still persists.

ErykKul avatar Nov 04 '25 11:11 ErykKul

I have discovered that I can save the dataset after removing the subject, making it incomplete. This should be not possible in the UI. It might be the cause of the problem. I will investigate it in more detail.

ErykKul avatar Nov 04 '25 12:11 ErykKul

Removing the subject and saving was a bug; I improved the validation issues detection for the JSF, you cannot save without setting all mandatory fields now, including the controlled vocabulary values (it already worked for other types of values). PR is updated now. I tested the fix in UI.

ErykKul avatar Nov 04 '25 12:11 ErykKul

I have been trying to QA this pr. I created the dataset by setting dataverse.api.allow-incomplete-metadata=true and adding qp /api/dataverses/" + dataverseAlias + "/datasets?doNotValidate=true

The create worked I then uploaded a file to the dataset which added the row to the db table datasetfieldvalue 7 0 "N/A" 10

At this point I cannot publish and continue to be notified that there is a missing field.

Image Image

Long story short I cannot reproduce this bug. (FYI I am using the current develop branch to reproduce)

stevenwinship avatar Dec 04 '25 21:12 stevenwinship

I retested with the 11900-improved-cvoc-value-validation branch and here is what I observed:

Scenario 1: Dataset creation without required metadata via API

Steps to reproduce:

  1. Create a dataset via API with the following payload (without required metadata):
{
  "datasetVersion": {
    "metadataBlocks": {
      "citation": {
        "fields": [
          {
            "value": [
              {
                "authorName": {
                  "value": "Finch, Fiona",
                  "typeClass": "primitive",
                  "multiple": false,
                  "typeName": "authorName"
                }
              }
            ],
            "typeClass": "compound",
            "multiple": true,
            "typeName": "author"
          }
        ],
        "displayName": "Citation Metadata"
      }
    }
  }
}
  1. Upload a file to this dataset

Observed results:

  • The subject metadata has an N/A value in the database (datasetfieldvalue table)
  • The metadata edit form does not display any error for the subject metadata
Image
  • The handling of the "incomplete metadata" tag and the "publish" button is OK.

Scenario 2: Metadata made required after dataset creation

Steps to reproduce:

  1. Create a dataset from the UI, filling in the required metadata + the publicationCitation metadata
  2. Edit the citation.tsv file to make the publicationRelationType metadata required
  3. Reindex
  4. Verify that there is a validation error on publicationRelationType when editing the dataset
  5. Upload a file

Observed results:

  • The N/A value appears in the UI for the Related Publication field
Image
  • There is no longer an error on the publicationRelationType field when editing (only a global error)
Image Image
  • The handling of the "incomplete metadata" tag and the "publish" button is OK.

stevenferey avatar Dec 05 '25 18:12 stevenferey

@stevenferey, my fix is what you have observed: the global error that does not let you save the dataset without fixing the problem (adding the missing required field). A better one would be that the validator is smarter to detect what is going wrong here and point to the spot that needs fixing, but I think that one would still not be fixing the real bug here. I did notice a long time ago, when implementing the functionality of incomplete metadata of datasets, that there are some edge cases (especially after making fields required that previously were not required) that regular validators miss occasionally, and I could not pinpoint exactly what was causing it. I implemented a more sophisticated global validator that reuses the regular validators but does it on the copied dataset metadata, where the copy is first transformed by removing empty values, N/A values, etc. This fixed the incomplete metadata labels, and now the global validator does not let you save the dataset by using the same logic (I reused it for UI in this PR). As it appears, these are just workarounds.

I know that there is a bug in metadata that I have been chasing for some time now, and I could not figure out what is causing it. I think you might have found a way to reproduce it. My suspicion is it is the same bug as the one I was fighting with. I did not realize that the key to reproducing it is to add a file to a dataset after changing the citation block (I know, changing the citation block is never a good idea...). I did come up with a query that surfaces the inconsistent state that causes all those problems (some UI quirks, like the missing Access field in our specific case at KU Leuven):

SELECT d.id, d.value, o.identifier FROM public.datasetfieldvalue d,
public.datasetfield_controlledvocabularyvalue c,
public.datasetfield f,
public.datasetfield f2,
public.datasetfieldcompoundvalue cv,
public.datasetversion v,
public.dvobject o
WHERE d.datasetfield_id = c.datasetfield_id
and f.id = d.datasetfield_id
and cv.id = f.parentdatasetfieldcompoundvalue_id
and f2.id = cv.parentdatasetfield_id
and v.id = f2.datasetversion_id
and o.id = v.dataset_id;

Can you run that query and see if you have any matches? If so, that would help a lot in better understanding what is really causing the issue. In a bugless situation there should not be any match. You can fix all datasets (after validating the results of the query to make sure you are not deleting something important but only fixing the issue):

delete from public.datasetfieldvalue where id in (SELECT d.id FROM public.datasetfieldvalue d,
public.datasetfield_controlledvocabularyvalue c,
public.datasetfield f,
public.datasetfield f2,
public.datasetfieldcompoundvalue cv,
public.datasetversion v,
public.dvobject o
WHERE d.datasetfield_id = c.datasetfield_id
and f.id = d.datasetfield_id
and cv.id = f.parentdatasetfieldcompoundvalue_id
and f2.id = cv.parentdatasetfield_id
and v.id = f2.datasetversion_id
and o.id = v.dataset_id);

The file‑addition step may be the key to reproducing the inconsistent state. If it’s the same issue, this should help me pinpoint the exact source - I haven’t been able to isolate it until now. I’ll also run some additional tests on my side when I can.

ErykKul avatar Dec 09 '25 10:12 ErykKul