exo
exo copied to clipboard
feat: Add Support for qwen3_moe Architecture, e.g. Qwen 3 Coder 30B
This commit introduces support for the Qwen 3 Coder 30B model and includes a critical fix for handling bfloat16 data types in the MLX inference engine.
Changes made:
- Added Qwen 3 Coder 30B model configuration to models.py with proper repository mapping
- Added pretty name for Qwen 3 Coder 30B in the model naming system
- Fixed a RuntimeError related to bfloat16 data conversion in sharded_inference_engine.py by implementing a dedicated numpy conversion function that properly handles dtype casting
- Updated MLX dependencies to use version ranges instead of fixed versions for better compatibility
The bfloat16 fix addresses a specific issue where item size mismatches were causing RuntimeError during inference with the Qwen3-Coder extension. The fix ensures proper dtype handling while maintaining performance through zero-copy operations where possible.