请教:微调报错怎么解决?
硬件:单机,8张3090 配置: command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'yes' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: 192.168.33.201 main_process_port: 21889 main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false 和 num_machines=1 num_processes=$((num_machines * 8)) machine_rank=0
accelerate launch
--config_file ./configs/sft.yaml
--num_processes $num_processes
--num_machines $num_machines
--machine_rank $machine_rank
--deepspeed_multinode_launcher standard finetune_moss.py
--model_name_or_path fnlp/moss-moon-003-sft-int4
--data_dir ./sft_data
--output_dir ./ckpts/moss-moon-003-sft-int4
--log_dir ./train_logs/moss-moon-003-sft-int4
--n_epochs 2
--train_bsz_per_gpu 4
--eval_bsz_per_gpu 4
--learning_rate 0.000015
--eval_step 15
--save_step 35 \
报错: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 7
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 6
INFO:torch.distributed.distributed_c10d:Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
retries = retries.increment(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise
raise value.with_traceback(tb)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request
six.raise_from(e, None)
File "
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
self._validate_conn(conn)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn
conn.connect()
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect
self.sock = ssl_wrap_socket(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket
return self.sslsocket_class._create(
File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create
self.do_handshake()
File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake
self._sslobj.do_handshake()
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 461, in connect cert = self.sock.getpeercert() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1154, in getpeercert self._check_connected() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1119, in _check_connected self.getpeername() urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in
train(args)
File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True)
File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1760, in from_pretrained
resolved_vocab_files[file_id] = cached_file(
File "/opt/anaconda3/lib/python3.10/site-packages/transformers/utils/hub.py", line 409, in cached_file
resolved_file = hf_hub_download(
File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1226, in hf_hub_download
http_get(
File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 470, in http_get
r = _request_wrapper(
File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 433, in _request_wrapper
return http_backoff(
File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 105, in http_backoff
response = requests.request(method=method, url=url, **kwargs)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
resp = self.send(prep, **send_kwargs)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send
r = adapter.send(request, **kwargs)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected'))
Downloading: 44%|██████████████████████████▊ | 1.10M/2.50M [00:01<00:00, 1.50MB/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2556277 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2556278) of binary: /opt/anaconda3/bin/python
Traceback (most recent call last):
File "/opt/anaconda3/bin/accelerate", line 8, in
sys.exit(main())
File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 900, in launch_command
deepspeed_launcher(args)
File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher
distrib_run.run(args)
File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune_moss.py FAILED
Failures: [1]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 2 (local_rank: 2) exitcode : 1 (pid: 2556279) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 3 (local_rank: 3) exitcode : 1 (pid: 2556280) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 4 (local_rank: 4) exitcode : 1 (pid: 2556281) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 5 (local_rank: 5) exitcode : 1 (pid: 2556282) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 6 (local_rank: 6) exitcode : 1 (pid: 2556283) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 7 (local_rank: 7) exitcode : 1 (pid: 2556284) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure): [0]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 1 (local_rank: 1) exitcode : 1 (pid: 2556278) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
硬件:单机,8张3090 配置: command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'yes' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: 192.168.33.201 main_process_port: 21889 main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false 和 num_machines=1 num_processes=$((num_machines * 8)) machine_rank=0
accelerate launch --config_file ./configs/sft.yaml --num_processes $num_processes --num_machines $num_machines --machine_rank $machine_rank --deepspeed_multinode_launcher standard finetune_moss.py --model_name_or_path fnlp/moss-moon-003-sft-int4 --data_dir ./sft_data --output_dir ./ckpts/moss-moon-003-sft-int4 --log_dir ./train_logs/moss-moon-003-sft-int4 --n_epochs 2 --train_bsz_per_gpu 4 --eval_bsz_per_gpu 4 --learning_rate 0.000015 --eval_step 15 --save_step 35 \
报错: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 3 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 5 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 7 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 4 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 2 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 6 INFO:torch.distributed.distributed_c10d:Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. INFO:torch.distributed.distributed_c10d:Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes. Explicitly passing a
revisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without responseDuring handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(*args, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 449, in _make_request six.raise_from(e, None) File "", line 3, in raise_from File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 444, in _make_request httplib_response = conn.getresponse() File "/opt/anaconda3/lib/python3.10/http/client.py", line 1374, in getresponse response.begin() File "/opt/anaconda3/lib/python3.10/http/client.py", line 318, in begin version, status, reason = self._read_status() File "/opt/anaconda3/lib/python3.10/http/client.py", line 287, in _read_status raise RemoteDisconnected("Remote end closed connection without" urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(*args, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) Explicitly passing a
revisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Explicitly passing arevisionis encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [Errno 104] Connection reset by peerDuring handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(*args, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 626, in from_pretrained tokenizer_class = get_class_from_dynamic_module( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 363, in get_class_from_dynamic_module final_module = get_cached_module_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/dynamic_module_utils.py", line 261, in get_cached_module_file commit_hash = model_info(pretrained_model_name_or_path, revision=revision, token=token).sha File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(*args, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1283, in model_info r = requests.get( File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 73, in get return request("get", url, params=params, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1801, in from_pretrained return cls._from_pretrained( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1956, in _from_pretrained tokenizer = cls(*init_inputs, **init_kwargs) File "/home/zhangzhong/.cache/huggingface/modules/transformers_modules/fnlp/moss-moon-003-sft-int4/e3f0d7e7fba3944d5932ca2608b816678220ed25/tokenization_moss.py", line 173, in init with open(vocab_file, encoding="utf-8") as vocab_handle: TypeError: expected str, bytes or os.PathLike object, not NoneType Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrap_socket( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = ssl_wrap_socket_impl( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/opt/anaconda3/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/opt/anaconda3/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1760, in from_pretrained resolved_vocab_files[file_id] = cached_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/utils/hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1226, in hf_hub_download http_get( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 470, in http_get r = _request_wrapper( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 433, in _request_wrapper return http_backoff( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 105, in http_backoff response = requests.request(method=method, url=url, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 461, in connect cert = self.sock.getpeercert() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1154, in getpeercert self._check_connected() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1119, in _check_connected self.getpeername() OSError: [Errno 107] Transport endpoint is not connected
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 486, in send resp = conn.urlopen( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/util/retry.py", line 550, in increment raise six.reraise(type(error), error, _stacktrace) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/packages/six.py", line 769, in reraise raise value.with_traceback(tb) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/zhangzhong/.local/lib/python3.10/site-packages/urllib3/connection.py", line 461, in connect cert = self.sock.getpeercert() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1154, in getpeercert self._check_connected() File "/opt/anaconda3/lib/python3.10/ssl.py", line 1119, in _check_connected self.getpeername() urllib3.exceptions.ProtocolError: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zhangzhong/MOSS/finetune_moss.py", line 305, in train(args) File "/home/zhangzhong/MOSS/finetune_moss.py", line 177, in train tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path, trust_remote_code=True) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 641, in from_pretrained return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs) File "/opt/anaconda3/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1760, in from_pretrained resolved_vocab_files[file_id] = cached_file( File "/opt/anaconda3/lib/python3.10/site-packages/transformers/utils/hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1226, in hf_hub_download http_get( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 470, in http_get r = _request_wrapper( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 433, in _request_wrapper return http_backoff( File "/opt/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 105, in http_backoff response = requests.request(method=method, url=url, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, **send_kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, **kwargs) File "/home/zhangzhong/.local/lib/python3.10/site-packages/requests/adapters.py", line 501, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected')) Downloading: 44%|██████████████████████████▊ | 1.10M/2.50M [00:01<00:00, 1.50MB/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2556277 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2556278) of binary: /opt/anaconda3/bin/python Traceback (most recent call last): File "/opt/anaconda3/bin/accelerate", line 8, in sys.exit(main()) File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main args.func(args) File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 900, in launch_command deepspeed_launcher(args) File "/opt/anaconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in deepspeed_launcher distrib_run.run(args) File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhangzhong/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
finetune_moss.py FAILED
Failures:
[1]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 2 (local_rank: 2) exitcode : 1 (pid: 2556279) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 3 (local_rank: 3) exitcode : 1 (pid: 2556280) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [3]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 4 (local_rank: 4) exitcode : 1 (pid: 2556281) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [4]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 5 (local_rank: 5) exitcode : 1 (pid: 2556282) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [5]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 6 (local_rank: 6) exitcode : 1 (pid: 2556283) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [6]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 7 (local_rank: 7) exitcode : 1 (pid: 2556284) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]: time : 2023-05-26_09:56:29 host : s012.ai.ldap rank : 1 (local_rank: 1) exitcode : 1 (pid: 2556278) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
我配置跟你一样,报的错也一样,但我用没量化的模型+8个A100这个错就没了,很奇怪
用了base模型,还是报错,RuntimeError: Socket Timeout
用了base模型,还是报错,RuntimeError: Socket Timeout
我是单机8卡,不涉及到socket通信
我也是单机8卡3090,是哪里要设置
单卡 没遇到你们的问题