magic icon indicating copy to clipboard operation
magic copied to clipboard

Simulations in Summit

Open YC-Liang opened this issue 2 years ago • 17 comments

Dear author(s),

I had some issues when running the basic Summit simulator from the summit_driverhard folder. After loading Carla in port 23000 and run the command, python simulator.py, I got the error, Pyro4.errors.CommunicationError: cannot connect to ('localhost', 23010): [Errno 111] Connection refused

So I started another Carla window with the port 23010 in a different terminal, the error disappears, but the car is not moving, there is no other crowds in the environment, the other window (port 23010) is empty, and nothing is reported in the terminal.

Screenshot from 2023-02-20 17-31-41

Is any of the above steps wrong? If you could kindly give some steps in running the simulator, that would awesome.

I am on Ubuntu 22.04, with a 4080 Nvidia GPU.

Thanks.

PS: I posted another issue in the Summit repo as I could not get the tutorial to work, not sure whether the issues are related.

YC-Liang avatar Feb 20 '23 06:02 YC-Liang

In principle everything should run with python3 simulator.py. It should internally launch an instance of the SUMMIT simulator, along with the gamma_crowd.py crowd controller, as well as an ego-agent that runs the handcrafted DESPOT strategy with macro-action length 1. In particular, you won't need to run the scripts in the SUMMIT repo separately. The exact details may be found in the script.

By default, the SUMMIT simulator is hidden in-memory and not shown on-screen. Could you try running python3 simulator.py --debug --visualize? This should enable the visualization of the spawned simulator, and print out a bunch of debug info to better pinpoint what went wrong.

LeeYiyuan avatar Feb 20 '23 14:02 LeeYiyuan

I suspect that simulator.py is not launching the SUMMIT simulator and the gamma_crowd.py script properly. These are on lines 344 and 475 respectively.

I had hardcoded the paths to the SUMMIT binary and the crowd script, which might have caused the error. Do you think you could debug near those lines and see if that is indeed the case?

LeeYiyuan avatar Feb 20 '23 16:02 LeeYiyuan

I have changed the code to point to the summit address on my machine and the program gets stuck when it tries to launch the crowds. Specifically, the following errors are reported,

Launching environment process...
Launching simulator process...
    Delaying for simulator to start up...
    Creating client...
Resetting world...
    Spawning meshes...
    Spawning ego-agent...
        Launching controller...
    Launching GAMMA process...
        Delaying for crowd service to launch...
        Waiting for spawn target to be reached...
Traceback (most recent call last):
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/Pyro4/core.py", line 511, in connect_and_handshake
    sock = socketutil.createSocket(connect=connect_location,
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/Pyro4/socketutil.py", line 307, in createSocket
    sock.connect(connect)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "simulator.py", line 628, in <module>
    sim.reset_world()
  File "simulator.py", line 503, in reset_world
    while not self.crowd_service.spawn_target_reached:
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/Pyro4/core.py", line 275, in __getattr__
    self._pyroGetMetadata()
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/Pyro4/core.py", line 615, in _pyroGetMetadata
    self.__pyroCreateConnection()
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/Pyro4/core.py", line 596, in __pyroCreateConnection
    connect_and_handshake(conn)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/Pyro4/core.py", line 549, in connect_and_handshake
    raise ce
Pyro4.errors.CommunicationError: cannot connect to ('localhost', 23010): [Errno 111] Connection refused

Does using conda environment cause some networking issue? It seems the problem is at the following lines of code,

self.crowd_service = Pyro4.Proxy('PYRO:crowdservice.warehouse@localhost:{}'.format(self.pyro_port))
        debug_print('        Delaying for crowd service to launch...')
        time.sleep(3)
        debug_print('        Waiting for spawn target to be reached...')
        while not self.crowd_service.spawn_target_reached:
            time.sleep(0.2)

In particular, when using crowd_service, above errors appear. I tried to adjust the sleep time to 10s, but that didn't help.

YC-Liang avatar Feb 21 '23 03:02 YC-Liang

When using the --visualize flag, does the simulator show up on the screen? If so, the crowd should spawn once Delaying for crowd service to launch... is printed, and you will be able to see it visually...

...if it does not appear on-screen, I think you may need to modify SUMMIT_SCRIPTS_PATH too, defined on line 58 and used on line 478 to point to the gamma_crowd.py script. The error refers to the inability to access the agent states tracked by gamma_crowd.py, which likely means that gamma_crowd.py did not even launch successfully.

I believe there should be no issues with networking caused by conda -- SUMMIT/CARLA relies on TCP to receive instructions, with spawning the maps as one such instruction. As long as the maps appear, the networking should be good.

LeeYiyuan avatar Feb 21 '23 03:02 LeeYiyuan

The crowd service did not launch, as the picture I put at the very top of this thread shows. The agent is not moving either. I believe I have set the SUMMIT_SCRIPTS_PATH to the correct one as the images and meshes are loaded correctly, and gamma_crowd.py is in the same directory.

YC-Liang avatar Feb 21 '23 04:02 YC-Liang

Okay, I think I know the issue. Your SUMMIT_SCRIPTS_PATH is definitely correct, as you pointed out from the fact that the meshes load correctly.

I had used a custom version of gamma_crowd.py with flags that don't exist in the official repostiroy, but is assumed to exist in simulator.py.

In simulator.py, delete lines 483 (--no-respawn) and 488 (--aim-center). This should let gamma_crowd.py launch correctly.

But you need a small change to gamma_crowd.py, since simulator.py assumes that gamma_crowd.py doesn't go and delete out-of-bound agents. In gamma_crowd.py, delete the lines:

            (car_agents, bike_agents, pedestrian_agents, destroy_list, statistics) = \
                    do_death(c, car_agents, bike_agents, pedestrian_agents, destroy_list, statistics)

which should be at or near line 1463

I forgot how the --aim-center flag was implemented. It is not necessary to run, but was meant to make sure that vehicles that spawned faced the center of the red bounding box you see in the simulator, so that the problem is actually hard (otherwise cars would just be moving away from the ego-agent from the start which won't be very interesting). You can probably restore it in gamma_crowd.py's do_spawn method via rejection sampling, by drawing a vector from the vehicle's position to the center of the bounding box, and checking if the dot product of the vehicle's heading against that vector is positive.

LeeYiyuan avatar Feb 21 '23 04:02 LeeYiyuan

Yea I got the crowd to launch now, however, the next line throw another error,

File "simulator.py", line 629, in <module>
    sim.reset_world()
  File "simulator.py", line 503, in reset_world
    while not self.crowd_service.spawn_target_reached:
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/Pyro4/core.py", line 280, in __getattr__
    raise AttributeError("remote object '%s' has no exposed attribute or method '%s'" % (self._pyroUri, name))
AttributeError: remote object 'PYRO:crowdservice.warehouse@localhost:23010' has no exposed attribute or method 'spawn_target_reached'

Is the spawn_target_reached method also something from the modified gamma_crowd?

YC-Liang avatar Feb 21 '23 04:02 YC-Liang

Yup... 🤦

For the moment, it would be easiest to replace that line (while not self.crowd_service.spawn_target_reached:) with:

time.sleep(5)
while self.crowd_service.spawn_car:
    time.sleep(0.2)

LeeYiyuan avatar Feb 21 '23 04:02 LeeYiyuan

If it runs without error but crashes afterward with the error message ERROR: Incorrect number of exo agents!, try changing SPAWN_DESTROY_RATE_MAX to 3 (gamma_crowd.py: 41) and SPAWN_DESTROY_REPETITIONS to 1 (gamma_crowd.py:45).

LeeYiyuan avatar Feb 21 '23 04:02 LeeYiyuan

Everything works smoother now, thanks. One more issue, when visualising, I got the warning msg shown below, Screenshot from 2023-02-21 15-58-19 I can click ok and the simulator can start fine, but when visualisation is turned off, the warning msg prevents the simulator from launching. Also, sometimes even when I clicked ok, the simulator window still shuts down straight away.

YC-Liang avatar Feb 21 '23 05:02 YC-Liang

Also, sometimes even when I clicked ok, the simulator window still shuts down straight away.

In this case, i need to restart the system and everything will work., not sure what causes it. Potentially some processes are not killed in the background?

YC-Liang avatar Feb 21 '23 05:02 YC-Liang

We bumped UE to 4.26, which seems to have deprecated -opengl, so you can remove that from simulator.py:349.

If you want to run it headless (i.e. no visualization), you would now need to add -RenderOffscreen.

I really need to update the codes here, as well as the SUMMIT docs, when I get the time for it...

LeeYiyuan avatar Feb 21 '23 05:02 LeeYiyuan

Thank you so much for answering patiently, I believe the core issues have been solved!

YC-Liang avatar Feb 21 '23 06:02 YC-Liang

Thank you! We've uncovered many existing issues related to getting the codes back up and running thanks to your help, really :)

Leaving this issue open until the docs/codes have been updated...

LeeYiyuan avatar Feb 21 '23 06:02 LeeYiyuan

Hi, sorry for bothering again! When using the magic model in the summit simulator, an error related to use CUDA in subprocesses is raised as below,

Traceback (most recent call last):
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "simulator.py", line 122, in environment_process
    gen_model = MAGICGenNet_DriveHard(MACRO_LENGTH, True, True).float().to(device)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 989, in to
    return self._apply(convert)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 641, in _apply
    module._apply(fn)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 664, in _apply
    param_applied = fn(param)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/torch/cuda/__init__.py", line 217, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I tried to change the start method to 'spawn', but that didn't solve the issue, instead I got the following error,

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/james/Documents/Uni/SCNC3021/magic-playground/python/summit_drivehard/simulator.py", line 16, in <module>
    from controller import Controller
  File "/home/james/Documents/Uni/SCNC3021/magic-playground/python/summit_drivehard/controller.py", line 12, in <module>
    import carla
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
  File "<frozen zipimport>", line 259, in load_module
  File "/home/james/summit/PythonAPI/carla/dist/carla-0.9.8-py3.8-linux-x86_64.egg/carla/__init__.py", line 8, in <module>
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 618, in _load_backward_compatible
  File "<frozen zipimport>", line 259, in load_module
  File "/home/james/summit/PythonAPI/carla/dist/carla-0.9.8-py3.8-linux-x86_64.egg/carla/libcarla.py", line 7, in <module>
  File "/home/james/summit/PythonAPI/carla/dist/carla-0.9.8-py3.8-linux-x86_64.egg/carla/libcarla.py", line 3, in __bootstrap__
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3260, in <module>
    def _initialize_master_working_set():
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3234, in _call_aside
    f(*args, **kwargs)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3272, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 572, in _build_master
    ws = cls()
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 565, in __init__
    self.add_entry(entry)
  File "/home/james/miniconda3/envs/magic2/lib/python3.8/site-packages/pkg_resources/__init__.py", line 619, in add_entry
    self.entry_keys.setdefault(entry, [])
TypeError: unhashable type: 'list'

YC-Liang avatar Feb 22 '23 11:02 YC-Liang

I can't seem to reproduce the error. The error is occuring because CUDA was used somewhere before environment_process started running. In principle, the only time anything from torch should be used is in in the environment_process thread, to prevent this error.

LeeYiyuan avatar Feb 23 '23 04:02 LeeYiyuan

... though my SUMMIT hangs whenever I use the GPU to call the generator.

I suspect that since UE4.26 dropped support for OpenGL, we now use Vulkan, and that somehow shares some memory with Torch/CUDA. This could explain why SUMMIT hangs for me and why your machine complains instead about CUDA being already initialized.

I've pushed a commit that integrates the changes in this thread. To fix the CUDA issue, I've switched the generator to be invoked on the CPU (simulator.py:116: use torch.device("cpu") instead).

LeeYiyuan avatar Feb 23 '23 04:02 LeeYiyuan