Please provide example uses of the scripts
Only got image with vision jax model to work, and even then had to remove the mesh_grid arg.
Everything else has failed.
E.g. needle fails like:
#! /bin/bash
export SCRIPT_DIR="$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )"
export PROJECT_DIR="$( cd -- "$( dirname -- "$SCRIPT_DIR" )" &> /dev/null && pwd )"
cd $PROJECT_DIR
export PYTHONPATH="$PYTHONPATH:$PROJECT_DIR"
export LIBTPU_INIT_ARGS="--xla_tpu_megacore_fusion_allow_ags=false --xla_enable_async_collective_permute=true --xla_tpu_enable_ag_backward_pipelining=true --xla_tpu_enable_data_parallel_all_reduce_opt=true --xla_tpu_data_parallel_opt_different_sized_ops=true --xla_tpu_enable_async_collective_fusion=true --xla_tpu_enable_async_collective_fusion_multiple_steps=true --xla_tpu_overlap_compute_collective_tc=true --xla_enable_async_all_gather=true"
export llama_tokenizer_path="LWM-Chat-1M-Jax/tokenizer.model"
export lwm_text_checkpoint="LWM-Chat-1M-Jax/params"
# jsonl file containing text for haystack. Each line should be a json
# with a single key "text" containing the text.
export haystack_file="../ultrachat_qa_mix_128K/data.jsonl"
export output_file="output"
python3 -u scripts/eval_needle.py \
--mesh_dim='!1,-1,4,1' \
--dtype='fp32' \
--load_llama_config='7b' \
--update_llama_config="dict(theta=10000000,max_sequence_length=131072,use_flash_attention=False,scan_attention=True,scan_query_chunk_size=1024,scan_key_chunk_size=1024,scan_mlp=True,scan_mlp_chunk_size=1024,scan_layers=True)" \
--load_checkpoint="params::$lwm_text_checkpoint" \
--tokenizer.vocab_file="$llama_tokenizer_path" \
--max_tokens_per_batch=5000 \
--output_file="$output_file" \
--haystack_file="$haystack_file" \
--context_lengths_min=1000 \
--context_lengths_max=10000 \
--n_context_length_intervals=20 \
--n_document_depth_intervals=20 \
--n_rounds=3
read
(lwm) jon@gpu:~/LWM$ bash scripts/run_eval_needle.sh
I0216 10:25:24.068257 139879088207680 xla_bridge.py:660] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA
I0216 10:25:24.070914 139879088207680 xla_bridge.py:660] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory
Starting Needle In A Haystack Testing...
- Context Lengths: 20, Min: 1000, Max: 10000
- Document Depths: 20, Min: 0%, Max: 100%
- Needle: The special magic {city} number is: {rnd_number}
W0216 10:26:39.398258 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
W0216 10:26:39.447406 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: [Errno 113] No route to host
W0216 10:26:42.451228 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
W0216 10:26:42.451697 139879088207680 _default.py:338] Authentication failed using Compute Engine authentication due to unavailable metadata server.
W0216 10:26:42.530295 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f4c0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
W0216 10:26:42.607035 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430efb0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
W0216 10:26:42.686556 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f130>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
W0216 10:26:42.767113 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 4 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f160>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
W0216 10:26:42.851304 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f7f0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
completed 0
Traceback (most recent call last):
File "/home/jon/LWM/scripts/eval_needle.py", line 447, in <module>
run(main)
File "/home/jon/miniconda3/envs/lwm/lib/python3.10/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/home/jon/miniconda3/envs/lwm/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/home/jon/LWM/scripts/eval_needle.py", line 444, in main
ht.start_test()
File "/home/jon/LWM/scripts/eval_needle.py", line 306, in start_test
self.run_test()
File "/home/jon/LWM/scripts/eval_needle.py", line 230, in run_test
full_contexts = self.read_context_files(FLAGS.n_rounds)
File "/home/jon/LWM/scripts/eval_needle.py", line 129, in read_context_files
text = json.loads(f.readline())['text']
KeyError: 'text'
i.e. some specific files are required that aren't shared, and some access to google is used, which isn't explained.
There seems to be no examples or clarity on how to run the torch version of models.
Sorry about that, I'll spend some time this coming weekend to write some more descriptions.
I can also include the dataset generation script. In general, it's just downloading pg19 and rewriting each entry into a jsonl file, with each row as {'text': <text>}
here in this commit is the info for the data formatting. hope this helps!
Hi, I wonder if those warnings could be ignored when the inference seems fine.
W0216 10:26:39.398258 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out W0216 10:26:39.447406 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: [Errno 113] No route to host W0216 10:26:42.451228 139879088207680 _metadata.py:139] Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out W0216 10:26:42.451697 139879088207680 _default.py:338] Authentication failed using Compute Engine authentication due to unavailable metadata server. W0216 10:26:42.530295 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f4c0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) W0216 10:26:42.607035 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430efb0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) W0216 10:26:42.686556 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f130>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) W0216 10:26:42.767113 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 4 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f160>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) W0216 10:26:42.851304 139879088207680 _metadata.py:208] Compute Engine Metadata server unavailable on attempt 5 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7f372430f7f0>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)")) ...... {'context_length': 1000, 'depth_percent': 0.0, 'response': 'The special magic Jakarta number is 8394266.', 'answer': '8394266', 'correct': True, 'seed': 0} {'context_length': 1000, 'depth_percent': 0.0, 'response': 'The special magic Damascus number is 1125686.', 'answer': '1125686', 'correct': True, 'seed': 1} 3%|████ | 2/60 [00:35<17:05, 17.69s/it] {'context_length': 1000, 'depth_percent': 0.0, 'response': 'The special magic Belgrade number is 1585963.', 'answer': '1585963', 'correct': True, 'seed': 2} {'context_length': 1000, 'depth_percent': 5.0, 'response': 'The special magic Los Angeles number is 2408249.', 'answer': '2408249', 'correct': True, 'seed': 0 } 7%|████████▏ | 4/60 [00:56<12:36, 13.52s/it] ......