Ax is not not starting as many workers as I'd like to; sometimes, get_next_trials returns 0 new trials
Hi,
I really like ax for optimizing hyperparameters. Based on it, I have written a tool for hyperparameter optimization, but I stumble upon a problem.
We use Slurm and submitit for our cluster and it all works fine, except for one thing. The number of parallel "workers" (ie. the number of parallel running jobs) does barely ever reach the maximum specified in my script.
The problem lies in the "ax_client.get_next_trials"-function. I do a loop like this:
new_jobs_needed = min(args.num_parallel_jobs - len(jobs), max_eval - submitted_jobs)
for m in range(0, new_jobs_needed):
trial_index_to_param, _ = ax_client.get_next_trials(max_trials=1)
I've tried max_trials=args.max_trials (coming from argparse) as well, but the behaviour is the same.
Sometimes, the trial_index_to_param is empty. There are just 0 entries in it.
I've tried the following:
experiment_args = {
"name": experiment_name,
"parameters": experiment_parameters,
"objectives": {"result": ObjectiveProperties(minimize=minimize_or_maximize)},
"choose_generation_strategy_kwargs": {
"num_trials": max_eval,
"num_initialization_trials": args.num_parallel_jobs,
"use_batch_trials": True,
"max_parallelism_override": args.num_parallel_jobs
},
}
experiment = ax_client.create_experiment(**experiment_args)
But still, sometimes the number of results coming from get_next_trials is empty and has 0 entries. Using use_batch_trials or not doesn't have any difference there as far as I know.
This is done in 10 minute slots, and as you can see, in the beginning there are many completed jobs, almost 90 per 10-minute-slot. But later on, it gets less and less, every time because the length of the trial_index_to_param is 0.
Is there anything I can do more against this? How may I use the full number of parallel evaluation specified?
Thanks!
Edit: tried adding enforce_sequential_optimization=False to the choose_generation_strategy_kwargs, but that doesn't change anything also.
https://github.com/NormanTUD/OmniOpt/tree/main/ax
Main script:
https://github.com/NormanTUD/OmniOpt/blob/main/ax/.omniopt.py
Maybe for anyone looking through the environment the problem is appearing in, my general plan is to allow this:
./omniopt --partition=alpha --experiment_name=example --mem_gb=1 --time=60 --worker_timeout=60 --max_eval=500 --num_parallel_jobs=500 --gpus=1 --follow --run_program=ZWNobyAiUkVTVUxUOiAlKHBhcmFtKSI= --parameter param range 0 1000 float
and to run that optimization on our clusters and to use ax/botorch internally for hyper parameter optimization. We have basically unlimited resources for free (university) and want to have as many workers in parallel as possible to gain from the HPC as much as possible in finding good hyperparameters for every type of problem or just researching those areas (depending on what your program does).
On the top of the code is a large comment showing some things I tried, the list is anything but complete though.
It would really be appreciated by us if you helped us with that.
Yours sincerly
NormanTUD
Hi @NormanTUD! Thanks so much for engaging with our tool - happy to help. Could you provide the logs from AxClient for your experiment? These logs usually contain information about the trial generation and generation strategy that will be helpful for us debugging the issue.
Also good catch on "use_batch_trials" not having an effect. This code hasn't been opensourced yet (hopefully soon!), so it isn't doing anything at this time. Let me raise an error to make that more clear.
@NormanTUD -- added a PR for an error to populate with use_batch_trials, it'll be live once we cut a new release :)
Let me know if you have the logs from AxClient for additional support. Thanks!
Hi,
thanks for your reply. I was on vacation and as such, didn't code anything. But currently, I am trying to get all logs now. Thanks for the patience. I will update this post when I have the logs.
First a bit of my own debugging code:
Update #1:
1531 trial_index_to_param, _ = ax_client.get_next_trials(
1532 max_trials=1
1533 )
1534
1535 print_debug(f"Got {len(trial_index_to_param.items())} new items (m = {m}, in range(0, {calculated_max_trials})).")
These lines are only executed when there are new jobs to be generated (in a for loop for further testing instead of by changing max_trials= to the number of new trials, it's set to 1, but in a for loop for each new job). But sometimes, I get this:
2024-03-26 11:14:13: Got 0 new items (m = 0, in range(0, 33)).
So it just returns 0 jobs.
These are the number of workers over time:
17
7
5
8
(No time given there though, it's in each generative loop)
It should be around ~20, so 17 is fine for a snapshot during starting the jobs, but over time, it gets much less.
The only message I can see from ax that seems relevant seems to be this:
ax.models.torch.botorch_modular.acquisition:
Encountered Xs pending for some Surrogates but observed for others. Considering
these points to be pending.
I've seen the tag "fixready" and installed it from the latest version (via pip/github). I cannot see any changes in behaviour, it looks exactly like before. I am not entirely sure whether this tag should imply that the fix is in the master already, but if it is, it hasn't changed anything for me.
Problem seems to be that the generation_node.generator_run_limit() returns 0, even though it shouldn't return 0. Not sure why yet, though.
Edit: debugged it a bit more. Having 30 workers in parallel, gives me this and thus returns 0:
generation_node.generator_run_limit: criterion = MaxTrials({'threshold': 30, 'only_in_statuses': None, 'not_in_statuses': [<TrialStatus.FAILED: 2>, <TrialStatus.ABANDONED: 5>], 'transition_to': 'GenerationStep_1', 'block_transition_if_unmet': True, 'block_gen_if_met': True}), this_threshold: 0
I changed the function to this in modelbridge/generation_node.py:
489 def generator_run_limit(self, supress_generation_errors: bool = True) -> int:
490 """How many generator runs can this generation strategy generate right now,
491 assuming each one of them becomes its own trial. Only considers
492 `transition_criteria` that are TrialBasedCriterion.
493
494 Returns:
495 - the number of generator runs that can currently be produced, with -1
496 meaning unlimited generator runs,
497 """
498 # TODO @mgarrard remove filter when legacy usecases are updated
499 valid_criterion = []
500 for criterion in self.transition_criteria:
501 if criterion.criterion_class not in {
502 "MinAsks",
503 "RunIndefinitely",
504 }:
505 myprint(f"generator_run_limit: adding class {criterion.criterion_class} to criterion")
506 valid_criterion.append(criterion)
507
508 myprint(f"generator_run_limit: valid_criterion: {valid_criterion}")
509 # TODO: @mgarrard Should we consider returning `None` if there is no limit?
510 # TODO:@mgarrard Should we instead have `raise_generation_error`? The name
511 # of this method doesn't suggest that it would raise errors by default, since
512 # it's just finding out the limit according to the name. I know we want the
513 # errors in some cases, so we could call the flag `raise_error_if_cannot_gen` or
514 # something like that : )
515 trial_based_gen_blocking_criteria = [
516 criterion
517 for criterion in valid_criterion
518 if criterion.block_gen_if_met and isinstance(criterion, TrialBasedCriterion)
519 ]
520 """
521 gen_blocking_criterion_delta_from_threshold = [
522 criterion.num_till_threshold(
523 experiment=self.experiment, trials_from_node=self.trials_from_node
524 )
525 for criterion in trial_based_gen_blocking_criteria
526 ]
527 """
528
529 gen_blocking_criterion_delta_from_threshold = []
530
531 for criterion in trial_based_gen_blocking_criteria:
532 this_threshold = criterion.num_till_threshold(
533 experiment=self.experiment, trials_from_node=self.trials_from_node
534 )
535
536 myprint(f"generator_run_limit: criterion = {criterion}, this_threshold: {this_threshold}")
537
538 gen_blocking_criterion_delta_from_threshold.append(this_threshold)
539
540 myprint(f"generator_run_limit: gen_blocking_criterion_delta_from_threshold: {gen_blocking_criterion_delta_from_threshold}")
541
542 # Raise any necessary generation errors: for any met criterion,
543 # call its `block_continued_generation_error` method The method might not
544 # raise an error, depending on its implementation on given criterion, so the
545 # error from the first met one that does block continued generation, will be
546 # raised.
547 if not supress_generation_errors:
548 for criterion in trial_based_gen_blocking_criteria:
549 # TODO[mgarrard]: Raise a group of all the errors, from each gen-
550 # blocking transition criterion.
551 if criterion.is_met(
552 self.experiment, trials_from_node=self.trials_from_node
553 ):
554 criterion.block_continued_generation_error(
555 node_name=self.node_name,
556 model_name=self.model_to_gen_from_name,
557 experiment=self.experiment,
558 trials_from_node=self.trials_from_node,
559 )
560 if len(gen_blocking_criterion_delta_from_threshold) == 0:
561 if not self.gen_unlimited_trials:
562 logger.warning(
563 "Even though this node is not flagged for generation of unlimited "
564 "trials, there are no generation blocking criterion, therefore, "
565 "unlimited trials will be generated."
566 )
567 myprint(f"generator_run_limit: returning -1 (no limit)")
568 return -1
569 res = min(gen_blocking_criterion_delta_from_threshold)
570 myprint(f"generator_run_limit: returning res {res}")
571 return res
myprint just adds the filename in front of it, so I can debug it more easily.
I am not sure why trials have failed, nor why some are abandoned, but in total, the this_threshold variable gives me 0, and it gets chosen as the number of new parameters to be created.
I also tried monkey patching it:
from unittest.mock import patch
def patched_generator_run_limit(*args, **kwargs):
return 1
with patch('ax.modelbridge.generation_node.GenerationNode.generator_run_limit', new=patched_generator_run_limit):
trial_index_to_param, _ = ax_client.get_next_trials(max_trials=1)
around the get_next_trials(max_trials=1)-function, but then I get this exception:
All trials for current model have been generated, but not enough data has been
observed to fit next model. Try again when more data are available.
Adding min_trials_observed=1 to the model=Models.BOTORCH_MODULAR step in the GenerationStrategy didn't help there, it didn't go away.
Is there anything else I may provide?
Yours sincerly,
NormanTUD
I have made a breakthrough regarding the reason why I don't get so many workers!
When the job failed, I needed to do:
_trial = ax_client.get_trial(trial_index)
_trial.mark_failed()
ax_client.log_trial_failure(trial_index=trial_index)
and when it succeeded I needed to do:
_trial.mark_completed(unsafe=True)
This way, ax knows about the jobs being finished (or failed), and it doesn't block the generation of new points anymore then, regarding to max_parallelism.
This was, admittedly, previously unclear to me.
Now it finally works pretty much as I like it :)