training
training copied to clipboard
Reference implementations of MLPerf™ training benchmarks
Why do hardware configurations only consider the number of CPU cores and the number of Accelerators, without taking into account the size of the server's memory?
# MLCube for Bert MLCube™ GitHub [repository](https://github.com/mlcommons/mlcube). MLCube™ [wiki](https://mlcommons.github.io/mlcube/). ## Project setup An important requirement is that you must have Docker installed. ```bash # Create Python environment and install MLCube...
1. The Readme points to using a eval.py, which is missing from the llama2 scripts folder. 2. Instructions to run this reference implementation on multiple nodes would be helpful for...
the guide link is [image_segmentation/pytorch](https://github.com/mlcommons/training/tree/master/image_segmentation/pytorch) when I try to run the container, I got below error, mention the runtime nvidia does not exist. could you please shed some light? ```...
# Benchmark execution with MLCube ### Project setup ```bash # Create Python environment and install MLCube Docker runner virtualenv -p python3 ./env && source ./env/bin/activate && pip install pip==24.0 &&...
Update the docker compose file to use the reference implementation in the training repo instead of gltorch

After building the Docker image provided in stable_diffusion, the first data download command fails as follows: ``` root@0d839dc3dd25:/workspace# scripts/datasets/laion400m-filtered-download-moments.sh --output-dir /datasets/laion-400m/webdataset-moments-filtered scripts/datasets/laion400m-filtered-download-moments.sh: line 18: rclone: command not found scripts/datasets/laion400m-filtered-download-moments.sh: line...
after I build contaner, and enter it, when I preprocess the data, it have failure with data attribute. ``` root@ed1902ed9916:/workspace/rnnt# bash scripts/preprocess_librispeech.sh Traceback (most recent call last): File "./utils/convert_librispeech.py", line...
Hi Teams, I have run the **default training script** with the following changes based on the results table **1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)** P.S. **I did...