LWM Please provide example uses of the scripts

Only got image with vision jax model to work, and even then had to remove the mesh_grid arg.

Everything else has failed.

E.g. needle fails like:

#! /bin/bash

export SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
export PROJECT_DIR="$( cd -- "$( dirname -- "$SCRIPT_DIR" )" &> /dev/null && pwd )"
cd $PROJECT_DIR
export PYTHONPATH="$PYTHONPATH:$PROJECT_DIR"
export LIBTPU_INIT_ARGS="--xla_tpu_megacore_fusion_allow_ags=false --xla_enable_async_collective_permute=true --xla_tpu_enable_ag_backward_pipelining=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true"

export llama_tokenizer_path="LWM-Chat-1M-Jax/tokenizer.model"
export lwm_text_checkpoint="LWM-Chat-1M-Jax/params"
# jsonl file containing text for haystack. Each line should be a json
# with a single key "text" containing the text.
export haystack_file="../ultrachat_qa_mix_128K/data.jsonl"
export output_file="output"

python3 -u scripts/eval_needle.py \
    --mesh_dim='!1,-1,4,1' \
    --dtype='fp32' \
    --load_llama_config='7b' \
    --update_llama_config="dict(theta=10000000,max_sequence_length=131072,use_flash_attention=False,scan_attention=True,scan_query_chunk_size=1024,scan_key_chunk_size=1024,scan_mlp=True,scan_mlp_chunk_size=1024,scan_layers=True)" \
    --load_checkpoint="params::$lwm_text_checkpoint" \
    --tokenizer.vocab_file="$llama_tokenizer_path" \
    --max_tokens_per_batch=5000 \
    --output_file="$output_file" \
    --haystack_file="$haystack_file" \
    --context_lengths_min=1000 \
    --context_lengths_max=10000 \
    --n_context_length_intervals=20 \
    --n_document_depth_intervals=20 \
    --n_rounds=3
read

(lwm) jon@gpu:~/LWM$ bash scripts/run_eval_needle.sh
I0216 10:25:24.068257 139879088207680 xla_bridge.py:660] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA
I0216 10:25:24.070914 139879088207680 xla_bridge.py:660] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory


Starting Needle In A Haystack Testing...
- Context Lengths: 20, Min: 1000, Max: 10000
- Document Depths: 20, Min: 0%, Max: 100%
- Needle: The special magic {city} number is: {rnd_number}



W0216 10:26:39.398258 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
W0216 10:26:39.447406 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: [Errno 113] No route to host
W0216 10:26:42.451228 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
W0216 10:26:42.451697 139879088207680 _default.py:338] Authentication failed using Compute Engine authentication due to unavailable metadata server.
W0216 10:26:42.530295 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f4c0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
W0216 10:26:42.607035 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430efb0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
W0216 10:26:42.686556 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f130>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
W0216 10:26:42.767113 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 4 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f160>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
W0216 10:26:42.851304 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f7f0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
completed 0
Traceback (most recent call last):
  File "/home/jon/LWM/scripts/eval_needle.py", line 447, in <module>
    run(main)
  File "/home/jon/miniconda3/envs/lwm/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/jon/miniconda3/envs/lwm/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/jon/LWM/scripts/eval_needle.py", line 444, in main
    ht.start_test()
  File "/home/jon/LWM/scripts/eval_needle.py", line 306, in start_test
    self.run_test()
  File "/home/jon/LWM/scripts/eval_needle.py", line 230, in run_test
    full_contexts = self.read_context_files(FLAGS.n_rounds)
  File "/home/jon/LWM/scripts/eval_needle.py", line 129, in read_context_files
    text = json.loads(f.readline())['text']
KeyError: 'text'

i.e. some specific files are required that aren't shared, and some access to google is used, which isn't explained.

Feb 16 '24 18:02 pseudotensor

There seems to be no examples or clarity on how to run the torch version of models.

Feb 16 '24 18:02 pseudotensor

Sorry about that, I'll spend some time this coming weekend to write some more descriptions.

I can also include the dataset generation script. In general, it's just downloading pg19 and rewriting each entry into a jsonl file, with each row as {'text': <text>}

Feb 21 '24 20:02 wilson1yan

here in this commit is the info for the data formatting. hope this helps!

Mar 14 '24 15:03 angelacast135

Hi, I wonder if those warnings could be ignored when the inference seems fine.

W0216 10:26:39.398258 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out W0216 10:26:39.447406 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: [Errno 113] No route to host W0216 10:26:42.451228 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out W0216 10:26:42.451697 139879088207680 _default.py:338] Authentication failed using Compute Engine authentication due to unavailable metadata server. W0216 10:26:42.530295 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f4c0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) W0216 10:26:42.607035 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430efb0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) W0216 10:26:42.686556 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f130>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) W0216 10:26:42.767113 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 4 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f160>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) W0216 10:26:42.851304 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f7f0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) ...... {'context_length': 1000, 'depth_percent': 0.0, 'response': 'The special magic Jakarta number is 8394266.', 'answer': '8394266', 'correct': True, 'seed': 0} {'context_length': 1000, 'depth_percent': 0.0, 'response': 'The special magic Damascus number is 1125686.', 'answer': '1125686', 'correct': True, 'seed': 1} 3%|████ | 2/60 [00:35<17:05, 17.69s/it] {'context_length': 1000, 'depth_percent': 0.0, 'response': 'The special magic Belgrade number is 1585963.', 'answer': '1585963', 'correct': True, 'seed': 2} {'context_length': 1000, 'depth_percent': 5.0, 'response': 'The special magic Los Angeles number is 2408249.', 'answer': '2408249', 'correct': True, 'seed': 0 } 7%|████████▏ | 4/60 [00:56<12:36, 13.52s/it] ......

Apr 07 '24 10:04 Treemann