Vincent Hellendoorn comments

Results 15 comments of


                                            Vincent Hellendoorn

Context prompt - example not working

Hi, that's a surprising error: it looks like the model is trying to predict a token (index 50,269) that is outside of its vocabulary (sized 50,267). That is technically possible...

CUDA out of memory error on training

Hi Aftab, Thanks for submitting this issue. It took a while to debug; the main thing I have found so far is that I can run this just fine on...

Converting models from GPT-NeoX to HuggingFace format

Hi, that's great to hear. The basic steps should be the following: 1. Download a checkpoint and convert it to the HuggingFace format. [This PR](https://github.com/EleutherAI/gpt-neox/pull/480) contains a file named [`convert_to_huggingface.py`](https://github.com/EleutherAI/gpt-neox/pull/480/files#diff-503107e2e8659542f2aca1df0f1ba8fbff76845eac37cc1c867c91f5b6d41d27)...

Converting models from GPT-NeoX to HuggingFace format

Yes, thanks @NinedayWang! I'll try it out as soon as I have some time. In terms of next steps: if this just works with the HF classes, which it sounds...

Context Prompt - RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling ```cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)```

Hi, a few others have had this error. It is typically either an out-of-memory issue or a matter of a mismatch between the CUDA version within and outside the container....

Update AliBi matrix caching for dynamic sequence lengths

Sounds good! No problem; I am using it in my fork for now. I also just realized the inital PR version had a wrong condition that I'd fixed locally (hence...

dataset

Hi, the repository we used to parse Python code and generate program graph has been open-sourced [here](https://github.com/google-research/python-graphs). This won't output samples in exactly the same format as in this dataset,...

Add gradient accumulation support

Sounds good. FWIW, I just noticed that this PR messes with the printed loss [here](https://github.com/karpathy/nanoGPT/blob/master/train.py#L249) because each loss term is normalized. One obvious fix is to scale that loss back...

Add gradient accumulation support

Great, glad I could help! Minor note: I realized on my own end that the number of eval steps is also affected by this, in that it now refers to...

Add gradient accumulation support

This is useful info! I hadn't used DDP yet (training a sweep of smaller models instead), but it's nice that the sync overhead becomes neglible with more accumulation steps. I...