Megatron-DeepSpeed adding consistency calculations/checks at init time

Stella pointed out to how they do consistency calculations/checks with NeoX: https://github.com/EleutherAI/gpt-neox/blob/main/megatron/neox_arguments/arguments.py

It'd be good for someone to study what they did over the base Megatron-LM and replicate anything that can help our work, since some good checks can save days of running a model under a wrong setup thinking it's doing something else.

I haven't studied what they did, so I don't have any specific suggestions here.

Thank you.

Oct 04 '21 00:10 stas00

Hello. I'd like to take this issue, please.

Nov 26 '21 10:11 jtboing

Are we specifically looking for an implementation of consistency checks to be applied directly to Megatron-DeepSpeed? Or is it just a study to better implement checks?

Nov 26 '21 11:11 jtboing

Are we specifically looking for an implementation of consistency checks to be applied directly to Megatron-DeepSpeed?

Yes.

I think this is mainly the code starting from:

https://github.com/EleutherAI/gpt-neox/blob/49e60fe7ad14f6991a7fa678d3a0c330d09b9ff4/megatron/neox_arguments/arguments.py#L641

Megatron-LM and thus Meg-DS has already all kinds of validations as well, so I suppose the above ones are additional checks.

Additionally, I have started compiling various constraints here: https://github.com/bigscience-workshop/bigscience/blob/master/train/sanity-checks.md so it'd be good to make sure to encompass this list as well - it's possible that it's already covered by the code in neox.

Nov 26 '21 17:11 stas00

Hi @stas00, can I work on this issue?

Jan 13 '22 18:01 bhavitvyamalik

Yes, of course, as I it doesn't look that @jtboing is working on it. I could be wrong of course.

As I mentioned earlier it was Stella's recommendation so I don't really know what can be done here.

So please have a look and if things looks interesting to work on then by all means.

It's hard to work without a concrete spec.

The only spec I have is the doc I created here: https://github.com/bigscience-workshop/bigscience/blob/master/train/sanity-checks.md to do manual checks so perhaps this could be a good starting point - so that we won't need to do that manually.

Jan 13 '22 18:01 stas00

Sure, I'll look into the doc and add relevant checks for those parameters. Will update here!

Jan 13 '22 18:01 bhavitvyamalik

I was thinking if we can add a function to validate args before this line: https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/04c461ed786ed0e257690e136d3481957b0ef582/megatron/arguments.py#L319

There’s a function by name of _check_arg_is_not_none which checks if required_args are not None. Can I make this check in validator only and remove _check_arg_is_not_none?
I looked into the sanity check list and found that PP_SIZE, DP_SIZE were not available with args since they're dependent on get_3d_dimensions. So should I use it get_3d_dimensionshere to get these vars?
How can I fetch path of ds_config_cl.json for (5) point in sanity check file?

Jan 15 '22 14:01 bhavitvyamalik

Sorry for the lack of response. Please take over this issue for me. Thanks.

Jan 15 '22 16:01 jtboing

I was thinking if we can add a function to validate args before this line:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/04c461ed786ed0e257690e136d3481957b0ef582/megatron/arguments.py#L319

There’s a function by name of _check_arg_is_not_none which checks if required_args are not None. Can I make this check in validator only and remove _check_arg_is_not_none?

Whatever you propose we don't want to remove anything but extend improve and refactor if need be.

I looked into the sanity check list and found that PP_SIZE, DP_SIZE were not available with args since they're dependent on get_3d_dimensions. So should I use it get_3d_dimensionshere to get these vars?

why? those are available in args. e.g. args.tensor_model_parallel_size

How can I fetch path of ds_config_cl.json for (5) point in sanity check file?

args.deepspeed_config? I think it gets parsed on the deepspeed side so perhaps it's not in args? the deepspeed engine object should definitely have it after deepspeed.initialize - not sure if it's too late for validation.

Jan 16 '22 07:01 stas00

Exactly, my plan is to extend and provide most asserts into validate_args function instead of keeping them in parse_args

Jan 16 '22 18:01 bhavitvyamalik

Can you give me write access for this repo (to raise the PR), @stas00? I made some basic changes and wanted your feedback for the same

Jan 18 '22 19:01 bhavitvyamalik