Worker can't find iHD_drv_video.so
Describe the bug When trying to play a transcoded video via a worker, the video fails to play. Worker logs indicate it cannot find iHD_drv_video.so. When I disable ClusterPlex and just use my "normal" PMS pod, HW transcoding works fine.
Intel GPU drivers are installed via Intel device plugins Helm chart: https://intel.github.io/helm-charts/
Same issue happens when using either standard Plex image with DOCKER_MOD or the ClusterPlex image
Relevant log file for worker:
[AVHWDeviceContext @ 0x7fa6496df6c0] libva: VA-API version 1.18.0
[AVHWDeviceContext @ 0x7fa6496df6c0] libva: Trying to open /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/iHD_drv_video.so
[AVHWDeviceContext @ 0x7fa6496df6c0] libva: va_openDriver() returns -1
[AVHWDeviceContext @ 0x7fa6496df6c0] libva: Trying to open /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/i965_drv_video.so
[AVHWDeviceContext @ 0x7fa6496df6c0] libva: va_openDriver() returns -1
[AVHWDeviceContext @ 0x7fa6496df6c0] Failed to initialise VAAPI connection: -1 (unknown libva error).
Device creation failed: -5.
Failed to set value 'vaapi=vaapi:/dev/dri/renderD128' for option 'init_hw_device': I/O error
Error parsing global options: I/O error
Completed transcode
Removing process from taskMap
The /config/Library/Application Support/ folder is empty, so it explains why it can't find the driver. Tried placing the driver that I pulled off the Plex server in the codecs PV, but no difference.
Environment K3S v1.26.5+k3s1 Nodes are Beelink U59's with Intel N5105 processor
Is that with the worker having the FFMPEG_HWACCEL environment variable set to "vaapi"?
Yes, it is. Here's the relevent ConfigMap:
---
kind: ConfigMap
apiVersion: v1
metadata:
name: clusterplex-worker-config
namespace: media-tools
labels:
app.kubernetes.io/name: clusterplex-worker-config
app.kubernetes.io/part-of: plex
data:
TZ: America/Toronto
PGID: '1000'
PUID: '1000'
VERSION: docker
DOCKER_MODS: 'ghcr.io/pabloromeo/clusterplex_worker_dockermod:latest'
ORCHESTRATOR_URL: 'http://clusterplex-orchestrator:3500'
LISTENING_PORT: '3501'
STAT_CPU_INTERVAL: '10000'
EAE_SUPPORT: '1'
FFMPEG_HWACCEL: 'vaapi'
This issue is stale because it has been open for 30 days with no activity.
I'm having the same issue.
Logging into the container, it looks like Plex isn't "fully-installed" there should be a cache with the extensions in those folders. See this reddit discussion, as it's the same error. https://www.reddit.com/r/PleX/comments/12ikwup/plex_docker_hardware_transcoding_issue/
What's odd to me is that local transcode works, its only on the remote workers that they fail.
@kenlasko @pabloromeo Ok, I got it working. The clue was the fact that Plex didn't have it's config directory setup in the worker nodes. Plex needs it's configuration otherwise it's going to fail because Plex basically isn't setup. Here's how I fixed it:
- Change
clusterplex-config-pvcPVC toReadWriteMany - Add the
configmount to theclusterplex-workerstatefulset just like you've already done with the pms deployment.
Here's what my two files look like, though yours will look different depending on storage.
Clusterplex-worker
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: clusterplex-worker
labels:
app.kubernetes.io/name: clusterplex-worker
app.kubernetes.io/part-of: clusterplex
spec:
serviceName: clusterplex-worker-service
replicas: 2
selector:
matchLabels:
app.kubernetes.io/name: clusterplex-worker
app.kubernetes.io/part-of: clusterplex
template:
metadata:
labels:
app.kubernetes.io/name: clusterplex-worker
app.kubernetes.io/part-of: clusterplex
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- podAffinityTerm:
labelSelector:
matchLabels:
name: clusterplex-worker
topologyKey: kubernetes.io/hostname
weight: 100
- podAffinityTerm:
labelSelector:
matchLabels:
name: clusterplex-pms
topologyKey: kubernetes.io/hostname
weight: 50
containers:
- name: plex-worker
image: lscr.io/linuxserver/plex:latest
startupProbe:
httpGet:
path: /health
port: 3501
failureThreshold: 40
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 3501
initialDelaySeconds: 60
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 3501
initialDelaySeconds: 10
timeoutSeconds: 10
ports:
- name: worker
containerPort: 3501
envFrom:
- configMapRef:
name: clusterplex-worker-config
volumeMounts:
- name: data
mountPath: /data
- name: codecs
mountPath: /codecs
- name: data
mountPath: /transcode
- name: config
mountPath: /config
resources: # adapt requests and limits to your needs
requests:
cpu: 500m
memory: 200Mi
limits:
gpu.intel.com/i915: 1
volumes:
- name: data
persistentVolumeClaim:
claimName: "plex-media"
- name: config
persistentVolumeClaim:
claimName: "clusterplex-config-pvc"
# - name: transcode
# persistentVolumeClaim:
# claimName: "plex-media"
volumeClaimTemplates:
- metadata:
name: codecs
labels:
app.kubernetes.io/name: clusterplex-codecs-pvc
app.kubernetes.io/part-of: clusterplex
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 1Gi
# specify your storage class
storageClassName: longhorn
clusterplex-config-pvc
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: clusterplex-config-pvc
labels:
app.kubernetes.io/name: clusterplex-config-pvc
app.kubernetes.io/part-of: clusterplex
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: "10Gi"
# specify your storage class
storageClassName: longhorn
I see! Yeah, the fact that Plex is not set up in the Workers is actually intentional. It shouldn't really be necessary, since the intention is to really only use the Plex transcoder (their fork from FFmpeg), without actually interacting with the local plex files. We use their base image to avoid redistributing their own transcoder ourselves, but plex doesn't really run on the worker. It's odd that it wants to use drivers within Plex's cache instead of the ones you installed on the node.
The reason we don't recommend sharing Plex's config in that way, using shares, is because Plex uses SQLLite as a database, which does not play well with network shares. And Longhorn's RWX is implemented with NFS behind the scenes. So you might end up corrupting the database or seeing odd issues.
Maybe you can mount JUST the cache location, to avoid any db corruption. meaning, just sharing /config/Library/Application Support/Plex Media Server/Cache/ or /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/
I'll see if I can set up a physical environment similar to yours, to see if there's a way around that. Maybe driver paths must be rewritten or something like that. I know others are running it with intel drivers on k8s, but I'm not aware if they had to do this same workaround or not.
@pabloromeo excellent, I've been thinking about potential issues with my setup and what you've said makes sense. I'll try to see if I can do just the cache.
I mounted Plex config in a different directory, then exec'd into the container and copied just the cache. No go, it throws errors.
[AVHWDeviceContext @ 0x7fdfdb7b2980] libva: Trying to open /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/iHD_drv_video.so
[AVHWDeviceContext @ 0x7fdfdb7b2980] libva: va_openDriver() returns -1
[AVHWDeviceContext @ 0x7fdfdb7b2980] libva: Trying to open /config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64/i965_drv_video.so
[AVHWDeviceContext @ 0x7fdfdb7b2980] libva: va_openDriver() returns -1
[AVHWDeviceContext @ 0x7fdfdb7b2980] Failed to initialise VAAPI connection: -1 (unknown libva error).
Device creation failed: -5.
Failed to set value 'vaapi=vaapi:/dev/dri/renderD128' for option 'init_hw_device': I/O error
Error parsing global options: I/O error
Completed transcode
Removing process from taskMap
After that, I copied everything from the temp folder and hardware transcoding works fine.
We might actually be running into something to do with Plex having to be on premium and have a claim token to run hw transcoding.
Another formulation I tried, adding the plex config as readonly, unfortunately the workers can't start because they can't run the fix permissions scripts that happen on start.
This issue is stale because it has been open for 30 days with no activity.
I am doing a helm chart deployment and ran into the issue. I already had to customize the charts to use env in the config for HW transcoding variable for workers, so I customized it to include the config and it no longer errors too. Not too knowledgeable on editing helm charts nor Plex but what if we make the directory or files with the sqlite DBS to be mounted read only?
Hello, I just started using this and came across this issue while verifying settings for HW Transcode on my NUC cluster.
Thanks for finding this issue before I experienced it :)
@todaywasawesome , I noticed the iHD_drv_video.so you referenced wasnt actually in the Plex Media Server/Cache, but linked to it from Plex Media Server/Drivers/imd-74-linux-x86_64/dri/iHD_drv_video.so'.
To get around the issue with both sharing the Cache and Drivers folders with the workers, as ReadOnly, but excluding other config so as not to disturb the DB, I have:
- Left the existing Config PVC as ReadWriteOnce and NOT mounted it to the Worker
- Created additional tiny PVCs for Cache and Drivers, mounted on PMS and Worker containers in appropriate locations, Worker nodes ReadOnly. 1Gi is overkill but I did 5Gi just in case.
Additional Cache and Driver PVC
---
#cluster-plex_cache-pvc.yml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: clusterplex-cache-pvc
namespace: plex-ns
labels:
app.kubernetes.io/name: clusterplex-cache-pvc
app.kubernetes.io/part-of: clusterplex
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
storageClassName: longhorn
---
#cluster-plex_drivers-pvc.yml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: clusterplex-drivers-pvc
namespace: plex-ns
labels:
app.kubernetes.io/name: clusterplex-drivers-pvc
app.kubernetes.io/part-of: clusterplex
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 5Gi
storageClassName: longhorn
Worker: (PMS is the same excluding the readOnly; true on the spec.volumes)
containers:
- name: plex-worker
image: lscr.io/linuxserver/plex:latest
startupProbe:
httpGet:
path: /health
port: 3501
failureThreshold: 40
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 3501
initialDelaySeconds: 60
timeoutSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 3501
initialDelaySeconds: 10
timeoutSeconds: 10
ports:
- name: worker
containerPort: 3501
envFrom:
- configMapRef:
name: clusterplex-worker-config
volumeMounts:
- name: media
mountPath: /mnt/media
- name: codecs
mountPath: /codecs
- name: transcode
mountPath: /transcode
- name: cache
mountPath: /config/Library/Application Support/Plex Media Server/Cache
- name: driver
mountPath: /config/Library/Application Support/Plex Media Server/Drivers
resources: # adapt requests and limits to your needs
requests:
cpu: 500m
memory: 200Mi
gpu.intel.com/i915: "1"
limits:
cpu: 2000m
memory: 2Gi
gpu.intel.com/i915: "1"
volumes:
- name: media
nfs:
path: /mediastuff
server: myserver.example.local
- name: transcode
persistentVolumeClaim:
claimName: "clusterplex-transcode-pvc"
- name: codecs
persistentVolumeClaim:
claimName: "clusterplex-codec-pvc"
- name: cache
persistentVolumeClaim:
claimName: "clusterplex-cache-pvc"
readOnly: true
- name: drivers
persistentVolumeClaim:
claimName: "clusterplex-drivers-pvc"
readOnly: true
Folders mounted inside Worker. Touch test for RO verify.
root@clusterplex-worker-0:/# ls -al /config/Library/Application\ Support/Plex\ Media\ Server/
total 10
drwxr-xr-x 4 abc abc 4096 Sep 11 13:43 .
drwxr-xr-x 3 abc abc 4096 Sep 11 13:43 ..
drwxrwxrwx 8 abc abc 1024 Sep 11 13:54 Cache
drwxrwxrwx 3 abc abc 1024 Sep 11 13:43 Driver
root@clusterplex-worker-0:/# touch /config/Library/Application\ Support/Plex\ Media\ Server/Cache/test
touch: cannot touch '/config/Library/Application Support/Plex Media Server/Cache/test': Read-only file system
Remote VAAPI Transcode Success:
JobPoster connected, announcing
Orchestrator requesting pending work
Sending request to orchestrator on: http://clusterplex-orchestrator:3500
Remote Transcoding was successful
Calling external transcoder: /app/transcoder.js
ON_DEATH: debug mode enabled for pid [1977]
Local Relay enabled, traffic proxied through PMS local port 32499
Setting VERBOSE to ON
Sending request to orchestrator on: http://clusterplex-orchestrator:3500
cwd => "/transcode/Transcode/Sessions/plex-transcode-ba2f8489-11e0-4fab-b08d-31f4b42686ae-6c51bcab-01cf-4780-b61e-b99f21fb343a"
args =>
....BLAHBLAHBLAHBLAH...
"LIBVA_DRIVERS_PATH":"/config/Library/Application Support/Plex Media Server/Cache/va-dri-linux-x86_64"
...BLAHBLAHBLAHBLAH...
FFMPEG_HWACCEL":"vaapi"
...BLAHBLAHBLAH...
"FFMPEG_EXTERNAL_LIBS":"/config/Library/Application\\ Support/Plex\\ Media\\ Server/Codec**s/8217c1c-4578-linux-x86_64/","TRANSCODER_VERBOSE":"1"}
Hope this helps
@audiophonicz that's an extremely clever approach, love it! :)
Now, I've finally set up a similar environment to test this, and have also been seeing the same issue, as well as trying to identify a few workarounds. Because I believe there may be an issue with this approach of depending on data from the main PMS. I believe this would only work if where you run PMS also has the same hardware, meaning, an intel iGPU as well. It seems that Plex creates the content of it's Drivers directory during initialization, based on the hardware available.
If that's the case, then there may be one other alternative approach, that doesn't depend on sharing Drivers and the Cache between PMS and the workers. And it's to initialize PMS on the workers at startup and then kill it once the local config has been created (I believe the linuxserver image does something along those lines too) so that the drivers for its hardware are downloaded.
I've manually tried it and it would appear to work, however, we have to be careful with doing that, as we can only do that if the config is NOT being shared with the main PMS. I'm guessing it could probably destroy or corrupt that real config, so this only applies to a standalone worker that is not sharing configs as shown above.
If this actually works I may add an additional optional parameter to force a PMS initialization on the Workers, but the default will be to not do it, to avoid breaking working installations as the ones mentioned above here.
Now, question to you @audiophonicz and @todaywasawesome, when HW Transcoding on the worker with your working setups, does Plex show that it's being transcoded by HW or is it obvlivious to it? In my initial test it's just saying "Transcode" not "Transcode (hw)".
@pabloromeo It's been transcode (hw) for me. Making sure to mount the needed hardware of course.
I do have a concern that it might be limiting based on license. HW is a pro feature, so if Plex doesn't initialize as pro, it wouldn't enable HW transcoding. Might be able to use a claim key.
So, weird Update: my method works but ONLY if the worker container is on the same physical node as the pms container. Theres no difference in the logs until it actually connects and starts to stream, then the remote workers simply kill the child process. I can even see the tile flash up in the pms Dashboard for half a second, then it disappears and tries another worker. when it finally gets to the worker on the same physical node, the logs pick up from the "segment:chunk-00000" and start playing.
[tcp @ 0x7ff2039fd440] Successfully connected to 10.10.2.20 port 32499
[AVIOContext @ 0x7ff203887cc0] Statistics: 57 bytes written, 0 seeks, 1 writeouts
[segment @ 0x7ff20d4356c0] segment:'chunk-00000' starts with packet stream:0 pts:274024 pts_time:274.024 frame:0
Killing child processes for task 35326182-1edb-49d9-86a4-9079d2e90e3d
Removing process from taskMap
@todaywasawesome can you confirm you can transcode on a worker container on a different physical node than pms when sharing the entire config? I'm thinking youre right about the PlexPass thing, and mine is matching the IP or something and only allowing it on the same node.
@pabloromeo
Yes, i have 6 identical nodes so I was counting on pms downloading the driver for my workers. Your approach of quick-init might be a better direction, but if server config and the existence of PlexPass is indeed interfering with HW transcoding on the remote workers, then a driver download alone might not work
Also, while transcoding on Worker-1
Can you check the logs on the workers? That might shed some light on what's going on.
Regarding plex Pass it's hard to say how they validate it. The X_PLEX_TOKEN should be reaching the worker and I believe it gets validated by a callback to PMS (through the relay). Unless something within that flow is broken. But without errors in logs it's quite difficult to identify. Maybe enabling debug logging in plex itself and seeing the messages in it's UI console.
I'll share my logs soon. My cluster is down for ISP issues ATM.
TL;DR; I got remote HW transcoding working pretty reliably by flipping my original workaround and providing the workers the entire /config PVC without readOnly (so far) but sub-mounting the /Plugin Support/ dir (with the databases and whatnot) to only the PMS container as a separate PVC ReadWriteOnce. One thing I still have to work on here is the pid file overwrite.
Long: Ok so some weird stuff happened after my last post, 60 seconds after I comment one of the workers (on another node) got stuck and was the only worker being used, but HW transcodes not only worked, they were damn near instant. Unforunately after restarting that pod all that went away.. but it led me to my other issue I opened about transcode processes not stopping.
Anyways, I made some progress on my remote HW transcodes. Providing just the drivers for HW transcode doesnt seem to be enough, as it would only work on the same machine as my pms pod. Seeing that it seemed to work for todaywasawesome by sharing the whole config dir, which happens to contain a token file and the preferences.xml with the machineid UUID, i tried his method, and was riddled with SQLite db slow; waiting or some such logs. So, I flipped my original method and created a single additional PVC just for the databases in the /Plugin Support/ folder to essentially carve them out of the main /config folder and it seems to have worked.
I am currently playing 7 plays simultaneously across 3 workers: 3x direct play HEVC10 3x HEVC10 SW decode > H264 HW encode 1x HEVC8 HW decode > H264 HW encode
I apparently have a bunch of devices that cant HW decode HEVC10 and it really pushes my little i3-6100U nodes, so they take a good 30-45 seconds to start playing, but it does work. Every now and then one play will freeze or fail and need to retry (pretty sure its HEVC10 playing havok), but for the most part auto-play next and seeks are working as well. 99% of my stuff is H264, I only found 1 with HEVC8 and HEVC10 so I should be good with this setup.
I do still want to try to separate out the pid file so the worker isnt constantly deleting and overwriting each others pid file. It doesnt seem to hurt right now but its not optimal.
The workaround I tried is copying the folder over manually from a temp config directory to the config directory. That way the worker can do whatever it wants with the local db, it's trashed anyway.
Not great still.
Ok guys I need some insight here. I still for the life of me cant get a worker to play if its not on the same node as PMS. its driving me mad.
Weird thing is, if both PMS and Worker-0 are on NODE1, Direct Plays will Direct Play, and Transcodes will Transcode, HW or SW, life is good.
If i simply move PMS to NODE2 while Worker-0 is on NODE1, all plays break. Direct Plays try to Transcode, and all Transcodes fail. Its not the /config dir. its not the /transcode or /codecs RWX speeds. Its purely on the same host or not, and I cant figure out what its using.
My only idea left is that the transcode job is using https://127.0.0.1 for the video transcode sessions and its not translating across pods/nodes:
[Req#745a/Transcode/JobRunner] Job running: FFMPEG_EXTERNAL_LIBS='/config/Library/Application\ Support/Plex\ Media\ Server/Codecs/8217c1c-4578-linux-x86_64/' X_PLEX_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx "/usr/lib/plexmediaserver/Plex Transcoder" -codec:0 mp3 -analyzeduration 20000000 -probesize 20000000 -i "/config/Library/Application Support/Plex Media Server/Metadata/TV Shows/3/d5dad9b0d635ffd439712c5dfd135b86a523101.bundle/Contents/_combined/themes/tv.plex.agents.series_e6fccc112eb130590ea2d245d869fedce8d276e9" -filter_complex "[0:0] aresample=async=1:ochl='stereo':rematrix_maxval=0.000000dB:osr=48000:rematrix_volume=-25.000000dB[0]" -map "[0]" -codec:0 libmp3lame -q:0 0 -f segment -segment_format mp3 -segment_time 1 -segment_header_filename header -segment_start_number 0 -segment_list "http://127.0.0.1:32400/video/:/transcode/session/3718081b-027b-4a0f-b1a1-fb99766945bf-64/cce15ce6-6af7-46fd-abbf-dd42e5e4609b/manifest?X-Plex-Http-Pipeline=infinite" -segment_list_type csv -segment_list_unfinished 1 -segment_list_size 5 -segment_list_separate_stream_times 1 -map_metadata -1 -map_chapters -1 "chunk-%05d" -y -nostats -loglevel quiet -loglevel_plex error -progressurl http://127.0.0.1:32400/video/:/transcode/session/3718081b-027b-4a0f-b1a1-fb99766945bf-64/cce15ce6-6af7-46fd-abbf-dd42e5e4609b/progress
My only idea left is that the transcode job is using https://127.0.0.1 for the video transcode sessions and its not translating across pods/nodes:
Plex definitely uses a loopback network for transcodes. On my freebsd plex jail if I don't give it a loopback address direct plays are fine but transcodes fail. (regardless of whether it needs to transcode audio or video). The address I give it is not 127.0.0.1 but it finds it okay.
If direct plays aren't working for you I'm not sure if this is the same problem but it very well might be. Also maybe the direct play you tested was transcoding audio?
Ok guys I need some insight here. I still for the life of me cant get a worker to play if its not on the same node as PMS. its driving me mad.
Honestly I think this is probably a different issue and perhaps plex network configuration? - this issue is just about hardware transcoding failing, if you're not getting workers to transcode at all thats a more root problem
Plex definitely uses a loopback network for transcodes. On my freebsd plex jail if I don't give it a loopback address direct plays are fine but transcodes fail. (regardless of whether it needs to transcode audio or video). The address I give it is not 127.0.0.1 but it finds it okay.
Thank you for your reply but my question is specifically about HW transcoding across physically separate kubernetes nodes, and Im not sure how a freebsd jail pertains. I do not see anywhere in this chart for transcode network settings, so I am not sure what this "it" you are giving a loopback address is.
Still looking for someone who has HW transcoding working across two physically separate nodes and what your plex network settings are for subnets and URL.
Sorry for the confusion the long and short of it is yes, thats where plex communicates with the transcoder. The transcoder stub here remaps that to a different container, and the nginx proxy passes it back in.
If direct play, and sw transcoding also are failing your issue isn't really about HW transcoding.. it's something else you have broken in the orchestration of the transcoder requests.
Same issue here (Dockermod on unprivileged LXC on Proxmox).
Mounting /config/Library/Application Support/Plex Media Server/Cache and /config/Library/Application Support/Plex Media Server/Drivers inside the workers did the trick.
Thanks !
Remapping just drivers and cache as RWX across pms and the workers fixed this issue for me.
This issue is stale because it has been open for 30 days with no activity.
Here to report a different setup that suffers from the same issue:
NAS Host with transcode and media shares exposed over NFS
Separate host running a docker-compose stack of one PMS instance, one worker, and one orchestrator. (no swarm).
Transcode and Media directories mounted over NFS as instructed (Read and Write).
Worker HW transcode fails (intel igpu), while "local" HW transcode succeeds (same physical intel igpu)
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.