exo Add differentiation between available and total memory

Add differentiation between available and total memory and allow model download if total memory is sufficient

Motivation

Previously, the placement system only allowed models to be placed if they fit within the currently available memory across the cluster. This was overly restrictive—if a model fits in the total memory of the cluster but not in currently available memory, it should still be allowed to launch (with appropriate warnings), as the user or OS may free up memory by closing other applications or instances etc.

This change enables users to launch models that require more memory than is currently available, as long as the total cluster memory is sufficient, improving the user experience when working with memory-constrained clusters.

Changes

Backend Changes

Placement Logic (src/exo/master/placement.py, src/exo/master/placement_utils.py):
- Added allow_low_memory parameter to PlaceInstance command and placement functions
- Modified filter_cycles_by_memory() to accept use_total_memory flag, allowing it to filter cycles based on total memory instead of available memory
- Updated get_shard_assignments_for_pipeline_parallel() to use total memory when use_total_memory=True for layer distribution calculations
- Modified cycle selection to prefer cycles with more total memory when allow_low_memory=True
API Layer (src/exo/master/api.py):
- Added allow_low_memory parameter to get_placement() method
- Implemented automatic retry logic: if placement fails with allow_low_memory=False, automatically retry with allow_low_memory=True
- Updated get_placement_previews() to try strict placement first, then fall back to low-memory placement, marking previews with is_low_memory=True when the fallback succeeds
- Added is_low_memory field to PlacementPreview responses
- Added storage_size_megabytes to model card responses for frontend use
Type Definitions (src/exo/shared/types/):
- Added allow_low_memory: bool = False to PlaceInstance command
- Added allow_low_memory: bool = False to PlaceInstanceParams API request model
- Added is_low_memory: bool = False to PlacementPreview response model

Frontend Changes

Dashboard (dashboard/src/routes/+page.svelte):
- Introduced distinction between modelWillFit() (fits in available memory) and modelCanFit() (fits in total memory)
- Updated model sorting to prioritize models that fit in total memory, then those that fit in available memory
- Added warning confirmation dialog when launching models that require more than available memory
- Updated placement API calls to include allow_low_memory=true parameter when user confirms low-memory launch
- Improved error messages to suggest freeing memory when placement fails
- Added fallback UI to show ModelCard for models that can fit in total memory but not available memory
Model Card Component (dashboard/src/lib/components/ModelCard.svelte):
- Changed memory check from availableGB to totalGB for determining if a model can fit
- Added separate derived values for totalClusterMemory and availableClusterMemory
- Updated canFit logic to allow placement if total memory is sufficient, even when API preview indicates an error
Type Definitions (dashboard/src/lib/stores/app.svelte.ts):
- Added is_low_memory: boolean field to PlacementPreview interface

Testing

New Test Suite (src/exo/master/tests/test_api_placement.py):
- test_get_placement_retries_with_allow_low_memory_true: Verifies automatic retry with allow_low_memory=True when strict placement fails
- test_get_placement_raises_http_exception_when_both_strict_and_low_memory_fail: Ensures proper error handling when both placement attempts fail
- test_get_placement_previews_sets_is_low_memory_false_when_strict_succeeds: Verifies is_low_memory=False when strict placement succeeds
- test_get_placement_previews_sets_is_low_memory_true_when_only_low_memory_succeeds: Verifies is_low_memory=True when only low-memory placement succeeds

Why It Works

The solution works by introducing a two-tier memory check:

Strict Mode (default): Uses available memory (ram_available) to ensure models can be placed without causing memory pressure. This is the preferred mode for normal operation.
Low Memory Mode (fallback): Uses total memory (ram_total) to allow placement when strict mode fails. This enables users to launch models that fit in total memory but require freeing up memory from other processes.

The automatic retry mechanism in the API layer provides a seamless experience: if a user attempts to launch a model and strict placement fails, the system automatically tries low-memory placement. If that succeeds, the user gets a warning dialog explaining the situation. This approach:

Maintains safety: Users are always warned when launching in low-memory mode
Improves UX: No need to manually retry or understand the distinction between available and total memory
Preserves correctness: Shard assignment calculations correctly use total memory when in low-memory mode, ensuring proper layer distribution
Provides transparency: The is_low_memory flag in previews allows the UI to clearly indicate when a placement requires freeing memory

The frontend changes complement this by:

Clearly distinguishing between models that will fit (available memory) vs. can fit (total memory)
Providing visual feedback (yellow text for models that can fit but won't fit)
Requiring explicit user confirmation before launching in low-memory mode
Offering helpful error messages that guide users to free memory when needed

Test Plan

Manual Testing

Hardware: MacBook Pro M3 Max 48GB

What you did:

Launched a large model that fits in total cluster memory (48GB) but not in available memory (e.g., 20GB available, model requires 30GB)
Verified that the dashboard shows the model as "can fit" (yellow indicator) but not "will fit"
Confirmed that clicking launch shows a warning dialog explaining the low-memory situation
Verified that after confirming, the model launches successfully
Tested that models that don't fit in total memory are still properly rejected
Verified that the model dropdown correctly sorts models (total-fit > available-fit > won't-fit)
Confirmed that placement previews correctly show is_low_memory=true for low-memory placements
Tested error handling when both strict and low-memory placement fail

Automated Testing

Changes to automated tests:

Added comprehensive test suite in test_api_placement.py with 4 new test cases covering:
- Automatic retry behavior when strict placement fails
- Error handling when both placement modes fail
- Correct is_low_memory flag setting in placement previews for both success scenarios
All existing tests continue to pass, ensuring backward compatibility
Tests use monkeypatching to isolate placement logic and verify retry behavior without requiring actual cluster setup

Dec 23 '25 15:12 ArvidSU

Thanks for the contribution! Looks like there's a lot of effort put into it. Although I haven't gone through the PR in detail yet, it seems like a good start to #976 , where we would like to use a different metric than available memory.

There is a memory pressure metric in MacOS that might be more accurate for placement and safer than total memory available. Perhaps you'd be interested in looking into the issue.

e.g. The mode that uses as much of the memory as possible -> all the remaining memory according to memory pressure * total memory

Dec 23 '25 16:12 rltakashige

I actually think a "just try anyway" mode is a good feature to have, outside of memory pressure. Forcing OS components into swap will not be good for the stability of your system, but who are we to say that you shouldn't try!

It should never be default behaviour, obviously. Maybe a setting.

Dec 23 '25 16:12 Evanev7

Thanks for the expedient feedback, let me know whether I should extend/change the current functionality to solve 976 and/or add some kind of setting to enable the behavior. If so, please concretize in what way these changes should be made so that they align with the over all vision.

Dec 23 '25 17:12 ArvidSU