digdag icon indicating copy to clipboard operation
digdag copied to clipboard

[Q] Depend on the past

Open kakoni opened this issue 8 years ago • 18 comments

Apache airflow has this feature called depends_on_past, where "task instances will depend on the success of the preceding task instance".

I find this extremely usable in my usecase where I've got daily recurring tasks, so task running on 20170806 depends on success of 20170805.

Not sure, can you do something similar with digdag?

kakoni avatar Aug 06 '17 07:08 kakoni

@kakoni Have you ever tried require>? http://docs.digdag.io/operators/require.html#require-depends-on-another-workflow

I hope this is the operator you are looking.

hiroyuki-sato avatar Aug 07 '17 05:08 hiroyuki-sato

Aah right, I could do something like

+require:
  require>: ..SELF..
  session_time: ${last_session_time}

Yes that would work! One more question though, how do I get the initial state/run(=that can't be depended on last session ...)?

kakoni avatar Aug 07 '17 07:08 kakoni

Hmm... last_session_time is calculated just based on current timestamp... https://github.com/treasure-data/digdag/blob/master/digdag-standards/src/main/java/io/digdag/standards/scheduler/SecondsIntervalSchedulerFactory.java#L73

Indeed, you can't depend on it.

komamitsu avatar Aug 08 '17 04:08 komamitsu

Maybe you need to use external persistent data (e.g. local file) as a workaround like this.

+start:
  sh>: touch /tmp/${session_time}.lock

+check:
  sh>: if [ -f /tmp/${last_session_time}.lock ]; then exit 1; fi

+run:
  echo>: "Executing ${session_time}"

+end:
  sh>: rm /tmp/${session_time}.lock

It seems there is a room to improve the above workflow in terms of robustness, though.

komamitsu avatar Aug 08 '17 05:08 komamitsu

@hiroyuki-sato Does digdag have an interface to get previous instance for session?(=to get its status)

kakoni avatar Nov 22 '17 08:11 kakoni

Hello, @kakoni

Could you tell me more detail about your question? Are you looking for CLI command like this? https://github.com/treasure-data/digdag/issues/603

Maybe there is no CLI interface yet.

hiroyuki-sato avatar Nov 22 '17 08:11 hiroyuki-sato

I was thinking about creating a new operator/extending require> with depends_on_past(=perhaps there is a better name, but using this for now) option.

In order to get that to work, I would need to access the previous instance for the current session. So in pseudo lang;

  • Is there a previous session for this session. No => Good to go/Ok
  • Previous session still running => Wait
  • Previous session completed unsuccesfully => Wait
  • Previous session completed succefully => Good to go / start current workflow.

kakoni avatar Nov 22 '17 08:11 kakoni

Hello, @kakoni

I have no idea yet. I'll let you know if I find a good solution. (Due to I'm not core developer, I have to read the source) Le'ts hacking digdag! :smile:

hiroyuki-sato avatar Nov 27 '17 10:11 hiroyuki-sato

@kakoni Did you ever find a solution to this problem? I'm dealing with the same thing. See #929.

jaymed avatar Jan 10 '19 19:01 jaymed

@jaymed Yes. I really wanted to use digdag for my usecases but as this depends on past is so essential for my workflows, I had to go with airflow..

kakoni avatar Jan 10 '19 21:01 kakoni

@kakoni OK makes sense. Thanks for getting back to me.

@hiroyuki-sato There's definitely a major need for this feature.

jaymed avatar Jan 10 '19 21:01 jaymed

Hello, @kakoni and @jaymed

Thank you for commenting on a new feature.

Compare with AirFlow project(677 contributors), Digdag still develops with very a small team(58 contributors).

I will consider those requests.

By the way, I'm not familiar Apache AirFlow. Do you know how to write depends_on_tasks for an initial state in AirFlow? (It's mean that can't be depended on the last session ) https://github.com/treasure-data/digdag/issues/615#issuecomment-320591081

hiroyuki-sato avatar Jan 11 '19 00:01 hiroyuki-sato

@muga Please take a look this Issue when you get a chance https://github.com/treasure-data/digdag/issues/929#issuecomment-454270266

hiroyuki-sato avatar Jan 15 '19 08:01 hiroyuki-sato

To solve #615 and #929, I would like to introduce new scheduler options. wait_until_last_schedule and wait_until_last_schedule_succeed as follows.

https://github.com/treasure-data/digdag/compare/master...yoyama:feature-wait_until_last_schedule?expand=1

  • wait_until_last_schedule is true, scheduler wait until last session finished.
  • wait_until_last_schedule_succeed is true, scheduler wait until last session finished successfully.

How about these options?

yoyama avatar Jan 17 '19 09:01 yoyama

@hiroyuki-sato

Do you know how to write depends_on_tasks for an initial state in AirFlow? (It's mean that can't be depended on the last session )

Theres another configuration option called start_date. If your execution date is same as start_date then it doesn't depend on last session(As this is the initial/first state)

kakoni avatar Jan 25 '19 10:01 kakoni

Hello, @kakoni

Thank you for your reply!

@yoyama Does wait_until_last_schedule and wait_until_last_schedule_succeed support start_date option in Airflow?

hiroyuki-sato avatar Jan 25 '19 11:01 hiroyuki-sato

Heres the logic in airflow if interested https://github.com/apache/airflow/blob/master/airflow/ti_deps/deps/prev_dagrun_dep.py#L47

kakoni avatar Jan 25 '19 12:01 kakoni

I am also wanted this feature. It is necessary for backfill multiple sessions but it need to proceed one-by-one. And also it need to run as single job such a memory consume workflow.

y-ken avatar Dec 17 '19 03:12 y-ken