SmartSim icon indicating copy to clipboard operation
SmartSim copied to clipboard

Reusing `RunSettings` leads to unexpected `DBObject` placement

Open MattToast opened this issue 2 years ago • 2 comments

Description

SmartSim stores information for how to colocate a DB alongside a Model, including what db objects are stored on that object in a dictionary, run_settings.colocated_db_settings. When a user calls one of the model.colocate_db_*, a shared reference to model._db_models and model._db_scripts is stored in that dictionary, so that when a user calls one of the model.add_db* methods, the model is also added to the run_settings.colocated_db_setting dict.

This works perfectly fine when a 1-to-1 mapping of models to run settings is assumed. However, it is not enforced and can lead to bizarre side effects when run settings are shared across models.

How to reproduce

Consider the following script:

from smartsim import Experiment

exp = Experiment("my-exp", launcher='local')
rs = exp.create_run_settings('echo', ['hello', 'world'], path='b')

model_a = exp.create_model('my-model-a', run_settings=rs)
model_a.colocate_db_tcp()

model_b = exp.create_model('my-model-b', run_settings=rs)
model_b.colocate_db_tcp()

# place a DBModel on the model w/ ml
model_b.add_ml_model("ml-1", "TORCH", model_path="/tmp/yo.txt")

# but is also appended to a totally different model and makes the
# object's state out of sync
print([m.name for m in model_a.run_settings.colocated_db_settings["db_models"]])
print([m.name for m in model_a._db_models])
print("=" * 40)

# Furthermore ``model_a`` cannot append its own ml models to run settings
model_a.add_ml_model("ml-2", "TORCH", model_path="/tmp/yo.txt")
print([m.name for m in model_a.run_settings.colocated_db_settings["db_models"]])
print([m.name for m in model_a._db_models])

The produced output is:

['ml-1']
[]
========================================
['ml-1']
['ml-2']

Expected behavior

The state of model_a.run_settings.colocated_db_settings["db_models"] should never be out-of-sync with model_a._db_models. Furthermore adding or removing ml-models from a separate model (in this case model_b) should not affect the state of other models (model_a).

The expected output for the previously mentioned script should be:

[]
[]
========================================
['ml-2']
['ml-2']

Justification for fix

The way that exp.create_ensemble("ens", run_settings=rs, replicas=123) works is actually not dissimilar to the example script provided. Luckily most users colocating a db across a models in an ensemble do want the model present in the of the members. That said, we do allow users to reach into a models and set attrs directly on the individual members of the ensemble.

System

All

MattToast avatar Oct 18 '23 17:10 MattToast

@MattToast Per group discussion, post an issue regarding moving addition of an ML model to the colocated db and orchestrator classes.

mellis13 avatar Nov 03 '23 17:11 mellis13

Issue regarding the moving the addition of ML model to the colocated db and orchestrator classes posted at https://github.com/CrayLabs/SmartSim/issues/410

MattToast avatar Nov 07 '23 19:11 MattToast