training icon indicating copy to clipboard operation
training copied to clipboard

Reference implementations of MLPerf™ training benchmarks

Results 128 training issues
Sort by recently updated
recently updated
newest added

Why do hardware configurations only consider the number of CPU cores and the number of Accelerators, without taking into account the size of the server's memory?

# MLCube for Bert MLCube™ GitHub [repository](https://github.com/mlcommons/mlcube). MLCube™ [wiki](https://mlcommons.github.io/mlcube/). ## Project setup An important requirement is that you must have Docker installed. ```bash # Create Python environment and install MLCube...

1. The Readme points to using a eval.py, which is missing from the llama2 scripts folder. 2. Instructions to run this reference implementation on multiple nodes would be helpful for...

the guide link is [image_segmentation/pytorch](https://github.com/mlcommons/training/tree/master/image_segmentation/pytorch) when I try to run the container, I got below error, mention the runtime nvidia does not exist. could you please shed some light? ```...

# Benchmark execution with MLCube ### Project setup ```bash # Create Python environment and install MLCube Docker runner virtualenv -p python3 ./env && source ./env/bin/activate && pip install pip==24.0 &&...

object_detection
MLCube

Update the docker compose file to use the reference implementation in the training repo instead of gltorch

![image](https://github.com/mlcommons/training/assets/50818159/f9fb28fb-4dc8-4fa7-b82b-756f763fc1f7)

After building the Docker image provided in stable_diffusion, the first data download command fails as follows: ``` root@0d839dc3dd25:/workspace# scripts/datasets/laion400m-filtered-download-moments.sh --output-dir /datasets/laion-400m/webdataset-moments-filtered scripts/datasets/laion400m-filtered-download-moments.sh: line 18: rclone: command not found scripts/datasets/laion400m-filtered-download-moments.sh: line...

after I build contaner, and enter it, when I preprocess the data, it have failure with data attribute. ``` root@ed1902ed9916:/workspace/rnnt# bash scripts/preprocess_librispeech.sh Traceback (most recent call last): File "./utils/convert_librispeech.py", line...

Hi Teams, I have run the **default training script** with the following changes based on the results table **1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)** P.S. **I did...