[Bug]: TextIO.read().withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW) still fails if no file is found
What happened?
According to the documentation https://beam.apache.org/releases/javadoc/2.11.0/org/apache/beam/sdk/io/TextIO.html
We can configure the behaviour of the read() method to also allow no matches or empty matches, so I configured a pipeline step like this:
PCollection<KV<String, Incident>> incidents = pipeline.apply("READ CSV", TextIO.read().withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW).from("gs://bucket/incidents.csv"))
.apply("Convert to bean and output KV Collection", ParDo.of(new IncidentCsvToBeanFunction()));
// The IncidentCsvToBeanFunction is just mapping the csv content to a java class.
but it still fails to start the pipeline even with this configuration when the incidents.csv is not present at the bucket, but when I use a wildcard (*) for example incidents*.csv or *.csv it works even if the incidents.csv does not exists, but according to the documentation and what I understand is that with the .withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW) it should work with just incidents.csv and no wildcards even if the csv is not present, so I consider it a bug unless I misunderstood... The error im getting on google cloud dataflow logs is:
"[The preflight pipeline validation failed for job 2024-05-14_15_25_49-jobnumbers. To bypass the validation, use the Dataflow service option with the value enable_preflight_validation=false. Learn more at https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#validation] NOT_FOUND: Unable to find object incidents.csv in bucket bucket_name.
What I expect is the same behaviour when I use the wildcard, just continue the pipeline and return an empty PCollection, even when the csv is not present in the bucket (does not exists).
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
- [ ] Component: Python SDK
- [X] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [ ] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [X] Component: Google Cloud Dataflow Runner
Sorry I think the problem is the dataflow preflight service, I tried what this link in the error message says: https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline?hl=es-419#validation
using: --dataflowServiceOptions=enable_preflight_validation=false
and it allowed the pipeline to start and work as expected, but I believe this is bad because what if I still want the preflight service on? the preflight service should also be able to recognize about the .withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW) and allow the start of the pipeline even if it detects separately that the csv does not exists in the bucket
This looks like a bug in the service. Can you open a Google cloud support case?
@liferoad Sure could do it, can you please share the link to open one? Have not open one before unless you mean the google cloud community forum?
Please check this: https://cloud.google.com/dataflow/docs/support/getting-support#file-bugs-or-feature-requests