PySDKExamples icon indicating copy to clipboard operation
PySDKExamples copied to clipboard

Issues running model on raspberrypi5 + edgetpu

Open han-so1omon opened this issue 1 year ago • 90 comments

I have some issues running the degirum models on my raspberrypi5 + edgetpu environment with raspberrypi os 12 bookworm. Moving from this issue https://github.com/ultralytics/ultralytics/issues/1185. @shashichilappagari Can you provide some assistance?

First step

# Download the models
degirum download-zoo --path /home/errc/v --device EDGETPU --runtime TFLITE --precision QUANT --token dg_4JRLnVvtfNdKLzj4oL816wNtL9gQBT5dfqmi3 --url https://cs.degirum.com/degirum/edgetpu

Try to run with degirum pysdk

import cv2
import degirum as dg

image = cv2.imread("./test-posenet.jpg")

zoo = dg.connect(dg.LOCAL, "/home/errc/v/yolov8n_relu6_coco_pose--640x640_quant_tflite_edgetpu_1/yolov8n_relu6_coco_pose--640x640_quant_tflite_edgetpu_1.json")
model = zoo.load_model("yolov8n_relu6_coco_pose--640x640_quant_tflite_edgetpu_1")
print(model)

result = model.predict(image)
result_image = result.image_overlay
cv2.imwrite("./test-posenet-degirum.jpg", result_image)
# Result
> python pose-tracking-debug-degirum.py 
<degirum.model._ClientModel object at 0x7f73d62ed0>
terminate called without an active exception
terminate called without an active exception
Aborted

Try to run with base yolo ultralytics library

from ultralytics import YOLO

# Load model
model = YOLO('/home/errc/v/yolov8n_full_integer_quant_edgetpu.tflite') 
yolov8n_relu6_coco_pose--640x640_quant_tflite_edgetpu_1.tflite') 

# Track with the model
results = model.track(source="/home/errc/e/ai/test-infrared.mp4", save=True)
# Result
>  python pose-tracking-debug-yolo.py 
WARNING ⚠️ Unable to automatically guess model task, assuming 'task=detect'. Explicitly define task for your model, i.e. 'task=detect', 'segment', 'classify','pose' or 'obb'.
Loading /home/errc/v/yolov8n_relu6_coco_pose--640x640_quant_tflite_edgetpu_1/yolov8n_relu6_coco_pose--640x640_quant_tflite_edgetpu_1.tflite for TensorFlow Lite inference...
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Traceback (most recent call last):
  File "/home/errc/e/ai/pose-tracking-debug-yolo.py", line 8, in <module>
    results = model.track(source="/home/errc/e/ai/test-infrared.mp4", save=True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/errc/e/ai/venv/lib/python3.11/site-packages/ultralytics/engine/model.py", line 492, in track
    return self.predict(source=source, stream=stream, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/errc/e/ai/venv/lib/python3.11/site-packages/ultralytics/engine/model.py", line 445, in predict
    self.predictor.setup_model(model=self.model, verbose=is_cli)
  File "/home/errc/e/ai/venv/lib/python3.11/site-packages/ultralytics/engine/predictor.py", line 297, in setup_model
    self.model = AutoBackend(
                 ^^^^^^^^^^^^
  File "/home/errc/e/ai/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/errc/e/ai/venv/lib/python3.11/site-packages/ultralytics/nn/autobackend.py", line 341, in __init__
    interpreter.allocate_tensors()  # allocate
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/errc/e/ai/venv/lib/python3.11/site-packages/tflite_runtime/interpreter.py", line 531, in allocate_tensors
    return self._interpreter.AllocateTensors()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Encountered unresolved custom op: edgetpu-custom-op.
See instructions: https://www.tensorflow.org/lite/guide/ops_custom Node number 0 (edgetpu-custom-op) failed to prepare.Encountered unresolved custom op: edgetpu-custom-op.
See instructions: https://www.tensorflow.org/lite/guide/ops_custom Node number 0 (edgetpu-custom-op) failed to prepare.

I can see that the edgetpu is connected. Although I am not sure that it is being used

> lspci
0000:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries Device 2712 (rev 21)
0000:01:00.0 System peripheral: Global Unichip Corp. Coral Edge TPU
0001:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries Device 2712 (rev 21)
0001:01:00.0 Ethernet controller: Device 1de4:0001
> ls /dev/apex_0 
/dev/apex_0

han-so1omon avatar May 15 '24 19:05 han-so1omon

@kteodorovich or @boristeo Can you please help @han-so1omon. I suspect pcie driver is not installed properly. @han-so1omon can you please provide us the output of the following command

degirum sys-info

shashichilappagari avatar May 15 '24 20:05 shashichilappagari

Sure, here's the output @shashichilappagari @boristeo @kteodorovich

$ degirum sys-info
Devices:
  N2X/CPU:
  - '@Index': 0
  - '@Index': 1
  TFLITE/CPU:
  - '@Index': 0
  - '@Index': 1
  TFLITE/EDGETPU:
  - '@Index': 0
Software Version: 0.12.1

han-so1omon avatar May 16 '24 15:05 han-so1omon

@han-so1omon So, it appears that PySDK is able to recognize the EdgeTPU. To ensure that driver is properly installed and is working, we made a small test script that does not depend on pysdk (this will allow us to diagnose if problem is with pysdk or the basic setup). Please see if you can run the following code without errors:

import tflite_runtime.interpreter as tflite
from PIL import Image
import numpy as np
import os

print('Downloading test model and test image')
os.system('wget -nc https://raw.githubusercontent.com/google-coral/test_data/master/mobilenet_v1_1.0_224_quant_edgetpu.tflite')
os.system('wget -nc https://github.com/DeGirum/PySDKExamples/blob/main/images/Cat.jpg?raw=true -O Cat.jpg')

print('Running...')
m = tflite.Interpreter('mobilenet_v1_1.0_224_quant_edgetpu.tflite', experimental_delegates=[tflite.load_delegate('libedgetpu.so.1')])
img = Image.open('Cat.jpg')

m.allocate_tensors()
n, h, w, c = m.get_input_details()[0]['shape']
m.set_tensor(m.get_input_details()[0]['index'], np.array(img.resize((h, w)))[np.newaxis,...])
m.invoke()
out = m.get_tensor(m.get_output_details()[0]['index']).flatten()

assert np.argmax(out) == 288, 'Wrong output result'
assert out[np.argmax(out)] == 83, 'Wrong output probability'
print('OK')

shashichilappagari avatar May 16 '24 15:05 shashichilappagari

Here is the output:

$ python edge-tpu-debug-degirum.py 
Downloading test model and test image
--2024-05-16 10:45:04--  https://raw.githubusercontent.com/google-coral/test_data/master/mobilenet_v1_1.0_224_quant_edgetpu.tflite
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4749998 (4.5M) [application/octet-stream]
Saving to: ‘mobilenet_v1_1.0_224_quant_edgetpu.tflite’

mobilenet_v1_1.0_224_quant_edgetpu 100%[==============================================================>]   4.53M  13.5MB/s    in 0.3s    

2024-05-16 10:45:05 (13.5 MB/s) - ‘mobilenet_v1_1.0_224_quant_edgetpu.tflite’ saved [4749998/4749998]

--2024-05-16 10:45:05--  https://github.com/DeGirum/PySDKExamples/blob/main/images/Cat.jpg?raw=true
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/DeGirum/PySDKExamples/raw/main/images/Cat.jpg [following]
--2024-05-16 10:45:05--  https://github.com/DeGirum/PySDKExamples/raw/main/images/Cat.jpg
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/DeGirum/PySDKExamples/main/images/Cat.jpg [following]
--2024-05-16 10:45:06--  https://raw.githubusercontent.com/DeGirum/PySDKExamples/main/images/Cat.jpg
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8000::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 334467 (327K) [image/jpeg]
Saving to: ‘Cat.jpg’

Cat.jpg                            100%[==============================================================>] 326.63K  --.-KB/s    in 0.1s    

2024-05-16 10:45:06 (3.15 MB/s) - ‘Cat.jpg’ saved [334467/334467]

Running...
OK

han-so1omon avatar May 16 '24 15:05 han-so1omon

@han-so1omon Thanks for checking this. So, it does appear that edgetpu is properly functioning. Can you please try other models in the zoo to make sure that it is not a problem specific to the model?

shashichilappagari avatar May 16 '24 15:05 shashichilappagari

@shashichilappagari I have tried with another model and it works fine. Using the test image from the project posenet from google, the yolov8n_relu6_coco--640x640_quant_tflite_edgetpu_1 model is able to classify the image

import cv2
import degirum as dg

image = cv2.imread("./test-posenet.jpg")

#zoo = dg.connect(dg.LOCAL, "/home/errc/v/yolov8n_relu6_coco_pose--640x640_quant_tflite_edgetpu_1/yolov8n_relu6_coco_pose--640x640_quant_tflite_edgetpu_1.json")
zoo = dg.connect(dg.LOCAL, "/home/errc/v/yolov8n_relu6_coco--640x640_quant_tflite_edgetpu_1/yolov8n_relu6_coco--640x640_quant_tflite_edgetpu_1.json")
model = zoo.load_model("yolov8n_relu6_coco--640x640_quant_tflite_edgetpu_1")
print(model)

result = model.predict(image)
result_image = result.image_overlay
cv2.imwrite("./test-posenet-degirum.jpg", result_image)

han-so1omon avatar May 16 '24 15:05 han-so1omon

@han-so1omon Is there a typo in the above code? Did you load the pose model or the coco detection model? Assuming you loaded the detection model and that it is working, it is partially good news as this shows that pysdk is working with local edgetpu. It now seems that the problem is specific to the model. We will upload models compiled at lower resolution to see if they resolve the issue. Thanks for your patience and quick responses.

shashichilappagari avatar May 16 '24 16:05 shashichilappagari

Some more context, I'm using feranick's edgetpu runtime and python libraries, as recommended in the ultralytics setup instructions, as these runtimes are kept up to date since google abandoned the coral project. I'm also using the default python 3.9 on raspberry pi os bookworm. Should I perhaps try running from the docker container? If so, do you have examples of that?

han-so1omon avatar May 16 '24 16:05 han-so1omon

Thank you, I am trying to get this issue resolved this week, and I appreciate your responsiveness quite a lot

han-so1omon avatar May 16 '24 16:05 han-so1omon

Yes, I corrected the typo. I loaded the coco detection model, and it seems to work fine. I commented out loading the coco pose model, as it was throwing the error

han-so1omon avatar May 16 '24 16:05 han-so1omon

@han-so1omon Since other models are working, the issue seems to be specific to the pose model. As you can see from our cloud platform, all models in the edgetpu model zoo are working properly on our cloud farm machines which have the google edge tpu pcie module. As I mentioned before, we will compile pose models at lower resolution and see if the problem goes away. Another option is to use google's mobilenet posenet model.

shashichilappagari avatar May 16 '24 16:05 shashichilappagari

@shashichilappagari Do you have instructions on how you've setup your google edge tpu pcie modules? Additionally, do you have the mobilenet posenet in your model zoos for edgetpu?

han-so1omon avatar May 16 '24 16:05 han-so1omon

@han-so1omon We will add it and let you know. Please give us a couple of hours to get lower resolution pose models. We will also share our setup guide with you.

shashichilappagari avatar May 16 '24 16:05 shashichilappagari

Ok, thank you

han-so1omon avatar May 16 '24 16:05 han-so1omon

@han-so1omon From the error message you are seeing, there could be some race condition in the code. We are unable to replicate it on our side but we have some ideas to test. Before I explain the ideas, I want to mention that you do not have to download the models to run locally. You can connect to cloud zoo and pysdk will automatically download the models. This will make your code simpler. Once you have finished debugging, you can of course switch to local zoo in case you want offline deployment. Your code should look like below:

import cv2
import degirum as dg

image = cv2.imread("./test-posenet.jpg")

zoo = dg.connect(dg.LOCAL, "https://cs.degirum.com/degirum/edgetpu", <your token>)
model = zoo.load_model("yolov8n_relu6_coco--640x640_quant_tflite_edgetpu_1")
print(model)

result = model.predict(image)
result_image = result.image_overlay
cv2.imwrite("./test-posenet-degirum.jpg", result_image)

With the above code you can just change model name every time you want to experiment with a different model.

Now to rule out the race condition that could be killing your python, we can try the following. PySDK supports three types of inference: cloud, ai_server, and local. We can try ai_server. In a terminal window, activate the python environment in which you installed pysdk. Then type

degirum server

You will see a message saying that degirum server started.

Then run the following code:

import cv2
import degirum as dg

image = cv2.imread("./test-posenet.jpg")

zoo = dg.connect('localhost', "https://cs.degirum.com/degirum/edgetpu", <your token>)
model = zoo.load_model("yolov8n_relu6_coco--640x640_quant_tflite_edgetpu_1")
print(model)

result = model.predict(image)
result_image = result.image_overlay
cv2.imwrite("./test-posenet-degirum.jpg", result_image)

Note that we changed dg.Local to localhost to switch to ai_server.

Please try this code and see if it works. We also added 512x512 pose model and 320x320 pose model. You can try those models also. We are in the process of adding mobilenet_posenet to the zoo and will let you know once it is added.

Hope that this helps.

shashichilappagari avatar May 16 '24 18:05 shashichilappagari

Ok. What do you think the race condition is, and is there a way to perform a wait to prevent it?

han-so1omon avatar May 16 '24 19:05 han-so1omon

@han-so1omon At this point we are not sure as it could be system dependent. That is why we want you to try the localhost option. There will not be any performance impact on using this option. If localhost option works, we will at least know that the problem is localized to local inference case and we will investigate further. But if localhost also does not work on your side, we have to think of other ways to debug.

shashichilappagari avatar May 16 '24 19:05 shashichilappagari

@han-so1omon We also added the mobilenet_v1_posenet model to the edetpu model zoo. Please see if it works on your side.

shashichilappagari avatar May 16 '24 19:05 shashichilappagari

@shashichilappagari I will try all later in the day. Can you share how you've setup the edgetpu modules?

han-so1omon avatar May 16 '24 20:05 han-so1omon

@kteodorovich can you share our user guide for edge tpu with @han-so1omon?

shashichilappagari avatar May 16 '24 20:05 shashichilappagari

@han-so1omon Hello! Our guide for USB Edge TPU is available here. You might be past all these steps already, given that you got the detection model to work.

By the way, the base Ultralytics library will only recognize a model for Edge TPU if the filename ends with _edgetpu.tflite. Also, our export modifies the structure of the model in a way that is incompatible with the Ultralytics built-in postprocessor.

kteodorovich avatar May 16 '24 21:05 kteodorovich

@shashichilappagari @kteodorovich Thank you! It looks like it was indeed a race condition with the local degirum setup. All of the models appear to work correctly with the localhost-based ai server as recommended. Do you have recommendations on how to setup a pose tracking algorithm on top of the pose prediction algorithm?

han-so1omon avatar May 17 '24 15:05 han-so1omon

@han-so1omon We are glad to hear that localhost option is working with the models you need. Just to give you some background information: the pose models have a python based postprocessor and it could be causing some issues with the python interpreter running the model. In case of localhost the python interpreter running the inference and postprocessor are two separate instances and hence do not have issues. We are currently investigating how to fix the issue for dg.LOCAL use case and will let you know when we release a pysdk version that fixes the issue. Until then you can use localhost as it does not have any real performance impact.

shashichilappagari avatar May 17 '24 15:05 shashichilappagari

@han-so1omon Here is an example of how you can add tracking on top of a model: https://github.com/DeGirum/PySDKExamples/blob/main/examples/specialized/multi_object_tracking_video_file.ipynb

shashichilappagari avatar May 17 '24 15:05 shashichilappagari

@vlad-nn Python post-processor indeed seems to have a race condition when using dg.LOCAL option as confirmed by @han-so1omon

shashichilappagari avatar May 17 '24 15:05 shashichilappagari

@shashichilappagari Does your pose algorithm from yolo_pose support landmarks?

han-so1omon avatar May 17 '24 16:05 han-so1omon

@han-so1omon Do you mean if the tracking algorithm tracks landmarks?

shashichilappagari avatar May 17 '24 17:05 shashichilappagari

@shashichilappagari basically is there a way to present it as a skeleton with each part of the body denoted. Like 'right ear', 'right forearm', etc

han-so1omon avatar May 17 '24 17:05 han-so1omon

@han-so1omon so you want the output of prediction to have a label for each keypoint?

shashichilappagari avatar May 17 '24 17:05 shashichilappagari

Basically, yes. I would like to know what part of the body the keypoint comes from

han-so1omon avatar May 17 '24 18:05 han-so1omon