TopoNet icon indicating copy to clipboard operation
TopoNet copied to clipboard

train error

Open liuxinyiwssy opened this issue 10 months ago • 0 comments

Fatal Python error: Segmentation fault

Current thread 0x000074aa85dae740 (most recent call first): File "", line 219 in _call_with_frames_removed File "", line 1166 in create_module File "", line 556 in module_from_spec File "", line 657 in _load_unlocked File "", line 975 in _find_and_load_unlocked File "", line 991 in _find_and_load File "", line 219 in _call_with_frames_removed File "", line 1042 in _handle_fromlist File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/ortools/graph/pywrapgraph.py", line 13 in File "", line 219 in _call_with_frames_removed File "", line 843 in exec_module File "", line 671 in _load_unlocked File "", line 975 in _find_and_load_unlocked File "", line 991 in _find_and_load File "", line 219 in _call_with_frames_removed File "", line 1042 in _handle_fromlist File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/openlanev2/evaluation/f_score.py", line 40 in File "", line 219 in _call_with_frames_removed File "", line 843 in exec_module File "", line 671 in _load_unlocked File "", line 975 in _find_and_load_unlocked File "", line 991 in _find_and_load File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/openlanev2/evaluation/evaluate.py", line 26 in File "", line 219 in _call_with_frames_removed File "", line 843 in exec_module File "", line 671 in _load_unlocked File "", line 975 in _find_and_load_unlocked File "", line 991 in _find_and_load File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/openlanev2/evaluation/init.py", line 1 in File "", line 219 in _call_with_frames_removed File "", line 843 in exec_module File "", line 671 in _load_unlocked File "", line 975 in _find_and_load_unlocked File "", line 991 in _find_and_load File "/home/bydpc/lxy_ws/map_topo/TopoNet/projects/toponet/datasets/openlanev2_subset_A_dataset.py", line 20 in File "", line 219 in _call_with_frames_removed File "", line 843 in exec_module File "", line 671 in _load_unlocked File "", line 975 in _find_and_load_unlocked File "", line 991 in _find_and_load File "/home/bydpc/lxy_ws/map_topo/TopoNet/projects/toponet/datasets/init.py", line 2 in File "", line 219 in _call_with_frames_removed File "", line 843 in exec_module File "", line 671 in _load_unlocked File "", line 975 in _find_and_load_unlocked File "", line 991 in _find_and_load File "/home/bydpc/lxy_ws/map_topo/TopoNet/projects/toponet/init.py", line 1 in File "", line 219 in _call_with_frames_removed File "", line 843 in exec_module File "", line 671 in _load_unlocked File "", line 975 in _find_and_load_unlocked File "", line 991 in _find_and_load File "", line 1014 in _gcd_import File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/importlib/init.py", line 127 in import_module File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/mmcv/utils/misc.py", line 73 in import_modules_from_strings File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/mmcv/utils/config.py", line 343 in fromfile File "tools/train.py", line 171 in main File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361 in wrapper File "tools/train.py", line 316 in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -11) local_rank: 0 (pid: 99244) of binary: /home/bydpc/anaconda3/envs/toponet/bin/python /home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:


           CHILD PROCESS FAILED WITH NO ERROR_FILE                

CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 99244 (local_rank 0) FAILED (exitcode -11) Error msg: Signal 11 (SIGSEGV) received by PID 99244 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:

from torch.distributed.elastic.multiprocessing.errors import record

@record def trainer_main(args): # do train


warnings.warn(_no_error_file_warning_msg(rank, failure)) Traceback (most recent call last): File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in main() File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 361, in wrapper return f(*args, **kwargs) File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/run.py", line 698, in main run(args) File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/bydpc/anaconda3/envs/toponet/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:


          tools/train.py FAILED               

================================================== Root Cause: [0]: time: 2025-03-17_13:06:05 rank: 0 (local_rank: 0) exitcode: -11 (pid: 99244) error_file: <N/A> msg: "Signal 11 (SIGSEGV) received by PID 99244"

Other Failures: <NO_OTHER_FAILURES>


I only have one GPU, so I ran script ./tools/dist_train.sh 1, but it gave me an error. Can anyone help me fix this?

liuxinyiwssy avatar Mar 17 '25 05:03 liuxinyiwssy