pythia icon indicating copy to clipboard operation
pythia copied to clipboard

Update batch_viewer docs to accurately reflect data indexing

Open jeffreygwang opened this issue 1 year ago • 1 comments

A change to address the issue below.

Issue: Documentation for Step Viewing

Currently, the documentation for utils/batch_viewer.py reads as follows:

parser.add_argument(
        "--start_iteration",
        type=int,
        default=0,
        help="What train step to start logging"
    )
    parser.add_argument(
        "--end_iteration",
        type=int,
        default=143000,
        help="Train step to end logging (inclusive)"
    )

In normal Python ranges (e.g. range(a,b), the first number is inclusive, so this would imply to me that a start of 1 and an end of 5 should include data from steps 1, 2, 3, 4, and 5.

However, a small experiment reveals this is not the case, as one cannot get a full batch of data by giving a start/end that are one apart (you would expect there to be two batches of data here in an inclusive/inclusive world):

python utils/batch_viewer.py --start_iteration 0 --end_iteration 1 --load_path final/document --save_path tester --conf_dir utils/dummy_config.yml
... [output] ...
(base) bash-4.4$ cd tester
(base) bash-4.4$ python
>>> import numpy as np
>>> zeroone = np.load("indicies.npy")
>>> zeroone.shape
(1025, 2049)

Similarly, when using the same number for start and end, you get a np array of shape (1, 2049).

This implies either the beginning is exclusive, or the ending is not inclusive as documented. I put in a PR with the former, but please let me know / change it if it's the latter!

jeffreygwang avatar Oct 08 '24 04:10 jeffreygwang

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Oct 08 '24 04:10 CLAassistant

I think there might be an off‑by‑one in the slice boundaries. Instead of returning (end_idx - start_idx +1) * 1024 the script returns (end_idx - start_idx) * 1024 +1

Error cases seen above:

--start_iteration 0 --end_iteration 0 # returns (   1, 2049) should return (1024, 2049)
--start_iteration 0 --end_iteration 1 # returns (1025, 2049) should return (2048, 2049)

I think this is the relevant code:

indicies = dataset[args.start_iteration * 1024 : args.end_iteration * 1024 + 1]

potential fix (for end_iteration inclusion):

indicies = dataset[args.start_iteration * 1024 : (args.end_iteration + 1) * 1024]

efittschen avatar May 06 '25 12:05 efittschen

I believe this is resolved by #186, and am closing this PR. Please reopen if I'm mistaken.

Quentin-Anthony avatar May 06 '25 16:05 Quentin-Anthony