Add differentiation between available and total memory
Add differentiation between available and total memory and allow model download if total memory is sufficient
Motivation
Previously, the placement system only allowed models to be placed if they fit within the currently available memory across the cluster. This was overly restrictive—if a model fits in the total memory of the cluster but not in currently available memory, it should still be allowed to launch (with appropriate warnings), as the user or OS may free up memory by closing other applications or instances etc.
This change enables users to launch models that require more memory than is currently available, as long as the total cluster memory is sufficient, improving the user experience when working with memory-constrained clusters.
Changes
Backend Changes
-
Placement Logic (
src/exo/master/placement.py,src/exo/master/placement_utils.py):- Added
allow_low_memoryparameter toPlaceInstancecommand and placement functions - Modified
filter_cycles_by_memory()to acceptuse_total_memoryflag, allowing it to filter cycles based on total memory instead of available memory - Updated
get_shard_assignments_for_pipeline_parallel()to use total memory whenuse_total_memory=Truefor layer distribution calculations - Modified cycle selection to prefer cycles with more total memory when
allow_low_memory=True
- Added
-
API Layer (
src/exo/master/api.py):- Added
allow_low_memoryparameter toget_placement()method - Implemented automatic retry logic: if placement fails with
allow_low_memory=False, automatically retry withallow_low_memory=True - Updated
get_placement_previews()to try strict placement first, then fall back to low-memory placement, marking previews withis_low_memory=Truewhen the fallback succeeds - Added
is_low_memoryfield toPlacementPreviewresponses - Added
storage_size_megabytesto model card responses for frontend use
- Added
-
Type Definitions (
src/exo/shared/types/):- Added
allow_low_memory: bool = FalsetoPlaceInstancecommand - Added
allow_low_memory: bool = FalsetoPlaceInstanceParamsAPI request model - Added
is_low_memory: bool = FalsetoPlacementPreviewresponse model
- Added
Frontend Changes
-
Dashboard (
dashboard/src/routes/+page.svelte):- Introduced distinction between
modelWillFit()(fits in available memory) andmodelCanFit()(fits in total memory) - Updated model sorting to prioritize models that fit in total memory, then those that fit in available memory
- Added warning confirmation dialog when launching models that require more than available memory
- Updated placement API calls to include
allow_low_memory=trueparameter when user confirms low-memory launch - Improved error messages to suggest freeing memory when placement fails
- Added fallback UI to show
ModelCardfor models that can fit in total memory but not available memory
- Introduced distinction between
-
Model Card Component (
dashboard/src/lib/components/ModelCard.svelte):- Changed memory check from
availableGBtototalGBfor determining if a model can fit - Added separate derived values for
totalClusterMemoryandavailableClusterMemory - Updated
canFitlogic to allow placement if total memory is sufficient, even when API preview indicates an error
- Changed memory check from
-
Type Definitions (
dashboard/src/lib/stores/app.svelte.ts):- Added
is_low_memory: booleanfield toPlacementPreviewinterface
- Added
Testing
-
New Test Suite (
src/exo/master/tests/test_api_placement.py):-
test_get_placement_retries_with_allow_low_memory_true: Verifies automatic retry withallow_low_memory=Truewhen strict placement fails -
test_get_placement_raises_http_exception_when_both_strict_and_low_memory_fail: Ensures proper error handling when both placement attempts fail -
test_get_placement_previews_sets_is_low_memory_false_when_strict_succeeds: Verifiesis_low_memory=Falsewhen strict placement succeeds -
test_get_placement_previews_sets_is_low_memory_true_when_only_low_memory_succeeds: Verifiesis_low_memory=Truewhen only low-memory placement succeeds
-
Why It Works
The solution works by introducing a two-tier memory check:
-
Strict Mode (default): Uses available memory (
ram_available) to ensure models can be placed without causing memory pressure. This is the preferred mode for normal operation. -
Low Memory Mode (fallback): Uses total memory (
ram_total) to allow placement when strict mode fails. This enables users to launch models that fit in total memory but require freeing up memory from other processes.
The automatic retry mechanism in the API layer provides a seamless experience: if a user attempts to launch a model and strict placement fails, the system automatically tries low-memory placement. If that succeeds, the user gets a warning dialog explaining the situation. This approach:
- Maintains safety: Users are always warned when launching in low-memory mode
- Improves UX: No need to manually retry or understand the distinction between available and total memory
- Preserves correctness: Shard assignment calculations correctly use total memory when in low-memory mode, ensuring proper layer distribution
-
Provides transparency: The
is_low_memoryflag in previews allows the UI to clearly indicate when a placement requires freeing memory
The frontend changes complement this by:
- Clearly distinguishing between models that will fit (available memory) vs. can fit (total memory)
- Providing visual feedback (yellow text for models that can fit but won't fit)
- Requiring explicit user confirmation before launching in low-memory mode
- Offering helpful error messages that guide users to free memory when needed
Test Plan
Manual Testing
Hardware: MacBook Pro M3 Max 48GB
What you did:
- Launched a large model that fits in total cluster memory (48GB) but not in available memory (e.g., 20GB available, model requires 30GB)
- Verified that the dashboard shows the model as "can fit" (yellow indicator) but not "will fit"
- Confirmed that clicking launch shows a warning dialog explaining the low-memory situation
- Verified that after confirming, the model launches successfully
- Tested that models that don't fit in total memory are still properly rejected
- Verified that the model dropdown correctly sorts models (total-fit > available-fit > won't-fit)
- Confirmed that placement previews correctly show
is_low_memory=truefor low-memory placements - Tested error handling when both strict and low-memory placement fail
Automated Testing
Changes to automated tests:
- Added comprehensive test suite in
test_api_placement.pywith 4 new test cases covering:- Automatic retry behavior when strict placement fails
- Error handling when both placement modes fail
- Correct
is_low_memoryflag setting in placement previews for both success scenarios
- All existing tests continue to pass, ensuring backward compatibility
- Tests use monkeypatching to isolate placement logic and verify retry behavior without requiring actual cluster setup
Thanks for the contribution! Looks like there's a lot of effort put into it. Although I haven't gone through the PR in detail yet, it seems like a good start to #976 , where we would like to use a different metric than available memory.
There is a memory pressure metric in MacOS that might be more accurate for placement and safer than total memory available. Perhaps you'd be interested in looking into the issue.
e.g. The mode that uses as much of the memory as possible -> all the remaining memory according to memory pressure * total memory
I actually think a "just try anyway" mode is a good feature to have, outside of memory pressure. Forcing OS components into swap will not be good for the stability of your system, but who are we to say that you shouldn't try!
It should never be default behaviour, obviously. Maybe a setting.
Thanks for the expedient feedback, let me know whether I should extend/change the current functionality to solve 976 and/or add some kind of setting to enable the behavior. If so, please concretize in what way these changes should be made so that they align with the over all vision.