[Spike] Disambiguate DB Object Placement by Shifting API

Open MattToast opened this issue 2 years ago • 0 comments

Description

With the merge of multidatabase support for SmartSim (#342) users have the ability to create and launch multiple Redis instances within a single SmartSim driver script. This is important, as it allows users to utilize different instances for different purposes. For example, a user may want launch a colocated instance to use as an ML inference engine, as well as a clustered instance to share information amongst models/members of an ensemble/etc.

Unfortunately, the syntax of SmartSim was designed with the assumption that there would only ever be a single DB instance. By reverting this assumption, this has led ambiguity into the SmartSim API as to where specific DB objects will be placed at run time. For instance, a user wanting to achieve the afore mentioned example may write a driver script to the nature of

from smartsim import Experiment
import smartsim.status

exp = Experiment("my-exp", launcher="slurm")

# Instance a model
rs = exp.create_run_settings("/path/to/my/app/that/uses/a/torch/ml/model")
model = exp.create_model("my-model", run_settings=rs)

# Instance a colocated DB
model.colocate_db_uds(db_identifier="COLO",
                      db_cpus=3)

# Instance a clustered DB
db = exp.create_database(port=1234,
                         db_nodes=3,
                         interface="some-interface")

# Add an ML-model to the previously instanced model
model.add_ml_model("my-torch-model",
                   "TORCH",
                   model=b"some-torch-model-byte-str",
                   device="GPU",
                   inputs=["input_key_a", "input_key_b"],
                   outputs=["output_key_c"])

# Start the experiment 
try:
    # If SmartSim only ever launched with a single DB it would be 
    # obvious where the user intended the ML model should be
    # placed. Unfortunately the user launched with 2 DBs: one
    # colo and one clustered. Therefore it is not immediately clear
    # on which DB the user would like the ML model to be placed. 
    exp.start(db, model, block=True)
finally:
    if exp.get_status(db)[0] == smartsim.status.STATUS_RUNNING:
        exp.stop(db)

Because of this, SmartSim needs to assume the the user wanted to place the model on both DBs. This is good in that it does not break existing SmartSim driver scripts, but is problematic as it (1) requires that a user suffer the overhead of a potentially costly "put" operation, especially for large, non-trivial ml models and (2) requires that standalone/clustered database also store any ML models that a user may only intend to put a singe colocated redis instance, which can very quickly take up excess space, and likely lead to unexpected "out of memory" errors.

Purposed API change to Disambiguate DB Object Placement

Rather than users attaching ML models to SmartSimEntitiys/EntitiyLists/Models/Ensembles/etc. users should instead specify on which database they would like the model to be available. After all, this is where the model "lives" during post launch.

This will require that a user is able to have some in-memory representation of a redis instance will be created post-launch. This is actually already available for standard and clustered redis instances (it is the Orchestrator class). Something similar will need to be created for colocated DB instances.

A roughly analogous driver script to the previously supplied one, using this hypothetical API, might look something like:

from smartsim import Experiment
import smartsim.status

exp = Experiment("my-exp", launcher="slurm")

# Instance a model
rs = exp.create_run_settings("/path/to/my/app/that/uses/a/torch/ml/model")
model = exp.create_model("my-model", run_settings=rs)

# Instance a colocated DB
# NOTE: that this now returns an object
colo_db = model.colocate_db_uds(db_identifier="COLO",
                                db_cpus=3)
print(type(colo_db))  # prints: <class smartsim.database.UDSColocatedDatabase>
                      # This would presumably be a subclass of a `ColocatedDatabase` class 
assert model._colocated_db is colo_db  # models would hold reference to the created database
assert model.colocated_db is colo_db  # users could reference the model through a 
                                      # read-only ``property`` instance
colo_db.set_cpus(5)  # Similar to `Orchestrors` and colo databases today, 
                     # this class could remain mutable

# TODO: Need to decide what happens when a user tries to
#       colocated a model that has already been colocaed.
#       The three obvious strategies are:
#         1) Previous colocated DB is overwritten and forgotten
#         2) An `SSModelAlreadyColocated` error is raised
#         3) `Model()._colocated_db` is actually a container
#            of `ColocatedDatabase` instances, all of which are launched
#            when the model is launched
model.colocate_db_uds(db_identifeier="NEW_COLO")

if False:
    # In theory a user should be able to "un-colocate" a model
    # by simply deleting the colocated db
    del model.colocated_db 
    assert model.colocated_db is None  # <-- and the field should be nullable

# Instance a clustered DB
# NOTE: This is unchanged from the original API
db = exp.create_database(port=1234,
                         db_nodes=3,
                         interface="some-interface")

# Add an ML-model to the previously instanced model
# Note that it is now unabmiguous which database a user wants to place
# model on as it is described directly in the driver script
model.colocated_db.add_ml_model("my-torch-model-a",
                                "TORCH",
                                model=b"some-torch-model-byte-str",
                                device="GPU",
                                inputs=["input_key_a", "input_key_b"],
                                outputs=["output_key_c"])
# This method should have a near identical signature to previous,
# which would in theory be near identical the the current
# `Model.add_ml_model`.
db.add_ml_model("my-torch-model-b"
                "TORCH",
                model=b"some-other-torch-model-byte-str",
                device="GPU",
                inputs=["input_key_a", "input_key_b"],
                outputs=["output_key_c"])

assert not hasattr(model, "add_ml_model")  # This method would no longer
                                           # exist on the model iteslf

# Start the experiment 
try:
    # SmartSim now has no ambiguity and a correctly knows to place
    # `my-torch-model-a` on the colo DB and `my-torch-model-b` on the
    # clustered DB
    exp.start(db, model, block=True)
finally:
    if exp.get_status(db)[0] == smartsim.status.STATUS_RUNNING:
        exp.stop(db)

Justification

Making this change eliminates the previously mentioned problems that arise as a result of the ambiguity of the API as it stands today. It also makes it much easier for new users to grok exactly where there ML models will be placed as they begin writing their own driver scripts.

Acceptance Criteria

[ ] Further explore the idea of attaching DB Objects directly to DB instances in code. This should extend as well to DBScripts.
[ ] Figure out how to extend this idea to more complex SmartSim entities (e.g. SmartSim entity containers such as Ensembles).
[ ] Create a design doc highlighting what changes will need to be made throughout SmartSim in order to accommodate such an API shift.
[ ] Present the design doc to the SmartSim Team for refinement. Allow refinement until consensus is reached on a possible implementation strategy and/or the proposal is rejected.
[ ] Update this ticket with this decision of the SmartSim team regarding this proposal.
[ ] If accepted and implementation strategy is agreed upon, create tickets for the completion of the implementation strategy.
[ ] If rejected, add an explanation to this ticket as to why.

Nov 03 '23 22:11 MattToast