training issues

Hardware Configuration

Why do hardware configurations only consider the number of CPU cores and the number of Accelerators, without taking into account the size of the server's memory?

BhAem

# MLCube for Bert MLCube™ GitHub [repository](https://github.com/mlcommons/mlcube). MLCube™ [wiki](https://mlcommons.github.io/mlcube/). ## Project setup An important requirement is that you must have Docker installed. ```bash # Create Python environment and install MLCube...

davidjurado

Llama2 - LoRA Reference Implementation

2

1. The Readme points to using a eval.py, which is missing from the llama2 scripts folder. 2. Instructions to run this reference implementation on multiple nodes would be helpful for...

rgandikota

docker run error for image_segmentation/pytorch test following the guide

2

the guide link is [image_segmentation/pytorch](https://github.com/mlcommons/training/tree/master/image_segmentation/pytorch) when I try to run the container, I got below error, mention the runtime nvidia does not exist. could you please shed some light? ```...

gaowayne

Add MLCube support for Object Detection Benchmark

3

# Benchmark execution with MLCube ### Project setup ```bash # Create Python environment and install MLCube Docker runner virtualenv -p python3 ./env && source ./env/bin/activate && pip install pip==24.0 &&...

davidjurado

object_detection

MLCube

GNN: update the docker compose file

1

Update the docker compose file to use the reference implementation in the training repo instead of gltorch

LiSu

DLRM criteo day23 MD5 varify faild

1

![image](https://github.com/mlcommons/training/assets/50818159/f9fb28fb-4dc8-4fa7-b82b-756f763fc1f7)

kkkparty

Data download for Stable Diffusion fails

After building the Docker image provided in stable_diffusion, the first data download command fails as follows: ``` root@0d839dc3dd25:/workspace# scripts/datasets/laion400m-filtered-download-moments.sh --output-dir /datasets/laion-400m/webdataset-moments-filtered scripts/datasets/laion400m-filtered-download-moments.sh: line 18: rclone: command not found scripts/datasets/laion400m-filtered-download-moments.sh: line...

coppock

error run the rnn speech workload, failed to process data after enter docker

4

after I build contaner, and enter it, when I preprocess the data, it have failure with data attribute. ``` root@ed1902ed9916:/workspace/rnnt# bash scripts/preprocess_librispeech.sh Traceback (most recent call last): File "./utils/convert_librispeech.py", line...

gaowayne

The default training script of DLRM v2 does not reach the reported AUC.

3

Hi Teams, I have run the **default training script** with the following changes based on the results table **1. GLOBAL_BATCH_SIZE=16384 2. WORLD_SIZE=4 ( 4 A100 40GB GPUs)** P.S. **I did...

Kevin0624

training
training copied to clipboard

Metadata

Hardware Configuration

MLCube integration with Bert

Llama2 - LoRA Reference Implementation

docker run error for image_segmentation/pytorch test following the guide

Add MLCube support for Object Detection Benchmark

GNN: update the docker compose file

DLRM criteo day23 MD5 varify faild

Data download for Stable Diffusion fails

error run the rnn speech workload, failed to process data after enter docker

The default training script of DLRM v2 does not reach the reported AUC.

← Metadata

Owner

Metadata

training training copied to clipboard

Metadata

← Metadata

Owner

Metadata

training
training copied to clipboard