Xin (Simon) Dong comments

Results 15 comments of


                                            Xin (Simon) Dong

import nonechucks fails for torch1.4.0+cu100

so issue

XNOR for ResNet

I have implemented ResNet-XNOR and reproduce result of xnor paper. I will release it soon when I have time.

Loss curve spikes on amalagamated datasets - need full scale shuffler in dataloader

@tianyu-l @lessw2020 FYI, I am using this trick. ```python hf_ds = HuggingFaceDataset( dataset_name, dataset_path, tokenizer, seq_len, world_size, rank, infinite ) if shuffle: hf_ds._data = hf_ds._data.shuffle(seed=int(rank*10007+int(time.time()))) ```

[Feature] Add gradient accumulation

Thanks for updating. @wanchaol Yes, I am talking about microbatching. https://github.com/pytorch/torchtitan/blob/58b11693507bc16e7df4618455ebe66e8094f71d/train.py#L291-L294 @awgu is it sufficient to change ? Thanks from (current) ```python with loss_parallel_ctx(): pred = model(input_ids) loss = loss_fn(pred,...

Inconsistency of your code and paper.

Could you please give some information how to derive equ(3) in this paper?

Make dataloader stateful?

Tested the branch ``` File "torchtitan/train.py", line 255, in main checkpoint.load() File "torchtitan/torchtitan/checkpoint.py", line 217, in load dcp.load( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/utils.py", line 427, in inner_func return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/checkpoint/state_dict_loader.py", line...