openeo-python-client icon indicating copy to clipboard operation
openeo-python-client copied to clipboard

more flexible job manager end state

Open HansVRP opened this issue 9 months ago • 4 comments

With the new internal queue; jobs are automatically retried incase more jobs are created that the amount of allowed parallel jobs.

Since the job manager runs until all jobs end in finalized, start failed or error, it doe snot support the internal queueing.

Ideally we would build in some flexibility that allows the user to submit and track more parallel jobs than those supported with their standard account. Can we make the 'end condition' on start_failed more flexible while not risking an endless loop?

HansVRP avatar Apr 19 '25 07:04 HansVRP

Since the job manager runs until all jobs end in finalized, start failed or error, it does not support the internal queueing.

I'm not sure I understand what you mean. The "internal queuing" feature is just an internal backend thing by design, I don't think there is anything required client-side to support that.

Something that might be possible however, is to have a standard API to discover and leverage job submission limits as discussed at

  • https://github.com/Open-EO/openeo-api/issues/559

soxofaan avatar Apr 22 '25 07:04 soxofaan

Will create a minimal example to reproduce the issue

HansVRP avatar Apr 22 '25 07:04 HansVRP

narrowed down the issue;

It comes from the try except loop in the PR: https://github.com/Open-EO/openeo-python-client/pull/736

`def execute(self) -> _TaskResult: """ Executes the job start process using the OpenEO connection.

    Authenticates if a bearer token is provided, retrieves the job by ID,
    and attempts to start it.

    :returns:
        A `_TaskResult` with status and statistics metadata, indicating
        success or failure of the job start.
    """
    try:
        conn = openeo.connect(self.root_url)
        if self.bearer_token:
            conn.authenticate_bearer_token(self.bearer_token)
        job = conn.job(self.job_id)
        job.start()
        _log.info(f"Job {self.job_id} started successfully")
        return _TaskResult(
            job_id=self.job_id,
            db_update={"status": "queued"},
            stats_update={"job start": 1},
        )
    except Exception as e:
        _log.error(f"Failed to start job {self.job_id}: {e}")
        return _TaskResult(
            job_id=self.job_id,
            db_update={"status": "start_failed"},  
            stats_update={"start_job error": 1})`

Failed to start job j-2504220752104722b90406957695f315: [429] Too Many Requests

--> We need to avoid labeling too many request errors as start_failed and instead handle those jobs as 'created'

HansVRP avatar Apr 22 '25 09:04 HansVRP

related:

  • https://github.com/Open-EO/openeo-python-client/issues/764

soxofaan avatar Apr 22 '25 10:04 soxofaan