adding consistency calculations/checks at init time
Stella pointed out to how they do consistency calculations/checks with NeoX: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/neox_arguments/arguments.py
It'd be good for someone to study what they did over the base Megatron-LM and replicate anything that can help our work, since some good checks can save days of running a model under a wrong setup thinking it's doing something else.
I haven't studied what they did, so I don't have any specific suggestions here.
Thank you.
Hello. I'd like to take this issue, please.
Are we specifically looking for an implementation of consistency checks to be applied directly to Megatron-DeepSpeed? Or is it just a study to better implement checks?
Are we specifically looking for an implementation of consistency checks to be applied directly to Megatron-DeepSpeed?
Yes.
I think this is mainly the code starting from:
https://github.com/EleutherAI/gpt-neox/blob/49e60fe7ad14f6991a7fa678d3a0c330d09b9ff4/megatron/neox_arguments/arguments.py#L641
Megatron-LM and thus Meg-DS has already all kinds of validations as well, so I suppose the above ones are additional checks.
Additionally, I have started compiling various constraints here: https://github.com/bigscience-workshop/bigscience/blob/master/train/sanity-checks.md so it'd be good to make sure to encompass this list as well - it's possible that it's already covered by the code in neox.
Hi @stas00, can I work on this issue?
Yes, of course, as I it doesn't look that @jtboing is working on it. I could be wrong of course.
As I mentioned earlier it was Stella's recommendation so I don't really know what can be done here.
So please have a look and if things looks interesting to work on then by all means.
It's hard to work without a concrete spec.
The only spec I have is the doc I created here: https://github.com/bigscience-workshop/bigscience/blob/master/train/sanity-checks.md to do manual checks so perhaps this could be a good starting point - so that we won't need to do that manually.
Sure, I'll look into the doc and add relevant checks for those parameters. Will update here!
I was thinking if we can add a function to validate args before this line: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/04c461ed786ed0e257690e136d3481957b0ef582/megatron/arguments.py#L319
- There’s a function by name of
_check_arg_is_not_nonewhich checks ifrequired_argsare notNone. Can I make this check in validator only and remove_check_arg_is_not_none? - I looked into the sanity check list and found that
PP_SIZE,DP_SIZEwere not available withargssince they're dependent onget_3d_dimensions. So should I use itget_3d_dimensionshere to get these vars? - How can I fetch path of
ds_config_cl.jsonfor (5) point in sanity check file?
Sorry for the lack of response. Please take over this issue for me. Thanks.
I was thinking if we can add a function to validate args before this line:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/04c461ed786ed0e257690e136d3481957b0ef582/megatron/arguments.py#L319
- There’s a function by name of
_check_arg_is_not_nonewhich checks ifrequired_argsare notNone. Can I make this check in validator only and remove_check_arg_is_not_none?
Whatever you propose we don't want to remove anything but extend improve and refactor if need be.
- I looked into the sanity check list and found that
PP_SIZE,DP_SIZEwere not available withargssince they're dependent onget_3d_dimensions. So should I use itget_3d_dimensionshere to get these vars?
why? those are available in args. e.g. args.tensor_model_parallel_size
- How can I fetch path of
ds_config_cl.jsonfor (5) point in sanity check file?
args.deepspeed_config? I think it gets parsed on the deepspeed side so perhaps it's not in args? the deepspeed engine object should definitely have it after deepspeed.initialize - not sure if it's too late for validation.
Exactly, my plan is to extend and provide most asserts into validate_args function instead of keeping them in parse_args
Can you give me write access for this repo (to raise the PR), @stas00? I made some basic changes and wanted your feedback for the same