Ax Ax is not not starting as many workers as I'd like to; sometimes, get_next

Hi,

I really like ax for optimizing hyperparameters. Based on it, I have written a tool for hyperparameter optimization, but I stumble upon a problem.

We use Slurm and submitit for our cluster and it all works fine, except for one thing. The number of parallel "workers" (ie. the number of parallel running jobs) does barely ever reach the maximum specified in my script.

The problem lies in the "ax_client.get_next_trials"-function. I do a loop like this:

new_jobs_needed = min(args.num_parallel_jobs - len(jobs), max_eval - submitted_jobs)
for m in range(0, new_jobs_needed):
    trial_index_to_param, _ = ax_client.get_next_trials(max_trials=1)

I've tried max_trials=args.max_trials (coming from argparse) as well, but the behaviour is the same.

Sometimes, the trial_index_to_param is empty. There are just 0 entries in it.

I've tried the following:

experiment_args = {                                   
       "name": experiment_name,
        "parameters": experiment_parameters,              
        "objectives": {"result": ObjectiveProperties(minimize=minimize_or_maximize)},
        "choose_generation_strategy_kwargs": {            
            "num_trials": max_eval,                       
            "num_initialization_trials": args.num_parallel_jobs,
            "use_batch_trials": True,                     
            "max_parallelism_override": args.num_parallel_jobs                                                                                                                                                                                                                                                      
        },                                                
}                                                     
            
experiment = ax_client.create_experiment(**experiment_args)

But still, sometimes the number of results coming from get_next_trials is empty and has 0 entries. Using use_batch_trials or not doesn't have any difference there as far as I know.

This is done in 10 minute slots, and as you can see, in the beginning there are many completed jobs, almost 90 per 10-minute-slot. But later on, it gets less and less, every time because the length of the trial_index_to_param is 0.

Is there anything I can do more against this? How may I use the full number of parallel evaluation specified?

Thanks!

Edit: tried adding enforce_sequential_optimization=False to the choose_generation_strategy_kwargs, but that doesn't change anything also.

Mar 25 '24 10:03 NormanTUD

https://github.com/NormanTUD/OmniOpt/tree/main/ax

Main script:

https://github.com/NormanTUD/OmniOpt/blob/main/ax/.omniopt.py

Maybe for anyone looking through the environment the problem is appearing in, my general plan is to allow this:

./omniopt --partition=alpha --experiment_name=example --mem_gb=1 --time=60 --worker_timeout=60 --max_eval=500 --num_parallel_jobs=500 --gpus=1 --follow --run_program=ZWNobyAiUkVTVUxUOiAlKHBhcmFtKSI= --parameter param range 0 1000 float

and to run that optimization on our clusters and to use ax/botorch internally for hyper parameter optimization. We have basically unlimited resources for free (university) and want to have as many workers in parallel as possible to gain from the HPC as much as possible in finding good hyperparameters for every type of problem or just researching those areas (depending on what your program does).

On the top of the code is a large comment showing some things I tried, the list is anything but complete though.

It would really be appreciated by us if you helped us with that.

Yours sincerly

NormanTUD

Mar 31 '24 17:03 NormanTUD

Hi @NormanTUD! Thanks so much for engaging with our tool - happy to help. Could you provide the logs from AxClient for your experiment? These logs usually contain information about the trial generation and generation strategy that will be helpful for us debugging the issue.

Also good catch on "use_batch_trials" not having an effect. This code hasn't been opensourced yet (hopefully soon!), so it isn't doing anything at this time. Let me raise an error to make that more clear.

Apr 01 '24 19:04 mgarrard

@NormanTUD -- added a PR for an error to populate with use_batch_trials, it'll be live once we cut a new release :)

Let me know if you have the logs from AxClient for additional support. Thanks!

Apr 15 '24 15:04 mgarrard

Hi,

thanks for your reply. I was on vacation and as such, didn't code anything. But currently, I am trying to get all logs now. Thanks for the patience. I will update this post when I have the logs.

First a bit of my own debugging code:

Update #1:

1531                                 trial_index_to_param, _ = ax_client.get_next_trials(                                                                              
1532                                     max_trials=1                                                                                                                  
1533                                 )                                                                                                                                 
1534                                                                                                                                                                   
1535                                 print_debug(f"Got {len(trial_index_to_param.items())} new items (m = {m}, in range(0, {calculated_max_trials})).")

These lines are only executed when there are new jobs to be generated (in a for loop for further testing instead of by changing max_trials= to the number of new trials, it's set to 1, but in a for loop for each new job). But sometimes, I get this:

2024-03-26 11:14:13: Got 0 new items (m = 0, in range(0, 33)).

So it just returns 0 jobs.

These are the number of workers over time:

(No time given there though, it's in each generative loop)

It should be around ~20, so 17 is fine for a snapshot during starting the jobs, but over time, it gets much less.

The only message I can see from ax that seems relevant seems to be this:

ax.models.torch.botorch_modular.acquisition: 
Encountered Xs pending for some Surrogates but observed for others. Considering 
these points to be pending.

Apr 22 '24 04:04 NormanTUD

I've seen the tag "fixready" and installed it from the latest version (via pip/github). I cannot see any changes in behaviour, it looks exactly like before. I am not entirely sure whether this tag should imply that the fix is in the master already, but if it is, it hasn't changed anything for me.

Problem seems to be that the generation_node.generator_run_limit() returns 0, even though it shouldn't return 0. Not sure why yet, though.

Edit: debugged it a bit more. Having 30 workers in parallel, gives me this and thus returns 0:

generation_node.generator_run_limit: criterion = MaxTrials({'threshold': 30, 'only_in_statuses': None, 'not_in_statuses': [<TrialStatus.FAILED: 2>, <TrialStatus.ABANDONED: 5>], 'transition_to': 'GenerationStep_1', 'block_transition_if_unmet': True, 'block_gen_if_met': True}), this_threshold: 0

I changed the function to this in modelbridge/generation_node.py:

 489     def generator_run_limit(self, supress_generation_errors: bool = True) -> int:
 490         """How many generator runs can this generation strategy generate right now,
 491         assuming each one of them becomes its own trial. Only considers
 492         `transition_criteria` that are TrialBasedCriterion.
 493                                                         
 494         Returns:                                        
 495               - the number of generator runs that can currently be produced, with -1
 496                 meaning unlimited generator runs,       
 497         """                                             
 498         # TODO @mgarrard remove filter when legacy usecases are updated
 499         valid_criterion = []                            
 500         for criterion in self.transition_criteria:      
 501             if criterion.criterion_class not in {       
 502                 "MinAsks",                              
 503                 "RunIndefinitely",                      
 504             }:                                          
 505                 myprint(f"generator_run_limit: adding class {criterion.criterion_class} to criterion")
 506                 valid_criterion.append(criterion)       
 507                                                         
 508         myprint(f"generator_run_limit: valid_criterion: {valid_criterion}")
 509         # TODO: @mgarrard Should we consider returning `None` if there is no limit?
 510         # TODO:@mgarrard Should we instead have `raise_generation_error`? The name
 511         # of this method doesn't suggest that it would raise errors by default, since
 512         # it's just finding out the limit according to the name. I know we want the
 513         # errors in some cases, so we could call the flag `raise_error_if_cannot_gen` or
 514         # something like that : )                       
 515         trial_based_gen_blocking_criteria = [           
 516             criterion                                   
 517             for criterion in valid_criterion            
 518             if criterion.block_gen_if_met and isinstance(criterion, TrialBasedCriterion)
 519         ]                                               
 520         """                                             
 521         gen_blocking_criterion_delta_from_threshold = [ 
 522             criterion.num_till_threshold(               
 523                 experiment=self.experiment, trials_from_node=self.trials_from_node
 524             )                                           
 525             for criterion in trial_based_gen_blocking_criteria
 526         ]                                               
 527         """                                         
  528                                                         
 529         gen_blocking_criterion_delta_from_threshold = []                                                                                                                                                                                                              
 530                                                         
 531         for criterion in trial_based_gen_blocking_criteria:
 532             this_threshold = criterion.num_till_threshold(
 533                 experiment=self.experiment, trials_from_node=self.trials_from_node
 534             )                                           
 535                                                         
 536             myprint(f"generator_run_limit: criterion = {criterion}, this_threshold: {this_threshold}")
 537                                                         
 538             gen_blocking_criterion_delta_from_threshold.append(this_threshold)
 539                                                         
 540         myprint(f"generator_run_limit: gen_blocking_criterion_delta_from_threshold: {gen_blocking_criterion_delta_from_threshold}")
 541                                                         
 542         # Raise any necessary generation errors: for any met criterion,
 543         # call its `block_continued_generation_error` method The method might not
 544         # raise an error, depending on its implementation on given criterion, so the
 545         # error from the first met one that does block continued generation, will be
 546         # raised.                                       
 547         if not supress_generation_errors:               
 548             for criterion in trial_based_gen_blocking_criteria:
 549                 # TODO[mgarrard]: Raise a group of all the errors, from each gen-
 550                 # blocking transition criterion.        
 551                 if criterion.is_met(                    
 552                     self.experiment, trials_from_node=self.trials_from_node
 553                 ):                                      
 554                     criterion.block_continued_generation_error(
 555                         node_name=self.node_name,       
 556                         model_name=self.model_to_gen_from_name,
 557                         experiment=self.experiment,     
 558                         trials_from_node=self.trials_from_node,
 559                     )                                   
 560         if len(gen_blocking_criterion_delta_from_threshold) == 0:
 561             if not self.gen_unlimited_trials:           
 562                 logger.warning(                         
 563                     "Even though this node is not flagged for generation of unlimited "
 564                     "trials, there are no generation blocking criterion, therefore, "
 565                     "unlimited trials will be generated."
 566                 )                                       
 567             myprint(f"generator_run_limit: returning -1 (no limit)")
 568             return -1                                   
 569         res = min(gen_blocking_criterion_delta_from_threshold)
 570         myprint(f"generator_run_limit: returning res {res}")
 571         return res

myprint just adds the filename in front of it, so I can debug it more easily.

I am not sure why trials have failed, nor why some are abandoned, but in total, the this_threshold variable gives me 0, and it gets chosen as the number of new parameters to be created.

I also tried monkey patching it:

from unittest.mock import patch

def patched_generator_run_limit(*args, **kwargs):
    return 1

with patch('ax.modelbridge.generation_node.GenerationNode.generator_run_limit', new=patched_generator_run_limit):
    trial_index_to_param, _ = ax_client.get_next_trials(max_trials=1)

around the get_next_trials(max_trials=1)-function, but then I get this exception:

All trials for current model have been generated, but not enough data has been
observed to fit next model. Try again when more data are available.

Adding min_trials_observed=1 to the model=Models.BOTORCH_MODULAR step in the GenerationStrategy didn't help there, it didn't go away.

Is there anything else I may provide?

Yours sincerly,

NormanTUD

Apr 29 '24 13:04 NormanTUD

I have made a breakthrough regarding the reason why I don't get so many workers!

When the job failed, I needed to do:

                    _trial = ax_client.get_trial(trial_index)
                    _trial.mark_failed()
                    ax_client.log_trial_failure(trial_index=trial_index)

and when it succeeded I needed to do:

                    _trial.mark_completed(unsafe=True)

This way, ax knows about the jobs being finished (or failed), and it doesn't block the generation of new points anymore then, regarding to max_parallelism.

This was, admittedly, previously unclear to me.

Now it finally works pretty much as I like it :)

May 08 '24 03:05 NormanTUD

Ax is not not starting as many workers as I'd like to; sometimes, get_next_trials returns 0 new trials