thingsboard Server-side RPC not working in v3.6.x CE MSA

Describe the bug RPC messages are not queued or sent to devices

Your Server Environment

own setup
- Deployment: microservices (tb-core x2, tb-rule-engine x2, tb-mqtt-transport x2)
- Deployment type: k8s
- ThingsBoard Version: 3.6.2 - 3.6.4
- Community

Your Device

Connectivity
- Gateway/Devices

To Reproduce

send an RPC persistent command /api/rpc/oneway/DEVICE-ID for example curl -X 'POST' \ 'https://tb-localhost:443/api/rpc/twoway/c3150ff0-2a5c-11ef-802d-4faa3852181b' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H 'X-Authorization: Bearer eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJhZG1pbkB0ZW5hbnQtMS5pdCIsInVzZXJJZCI6IjVjZDhiYWMwLTJhNWMtMTFlZi04MDJkLTRmYWEzODUyMTgxYiIsInNjb3BlcyI6WyJURU5BTlRfQURNSU4iXSwic2Vzc2lvbklkIjoiNDFmYjgyOWUtNjQ1ZC00ZWE5LTllYzQtMjZlYjYxZmZlYmU2IiwiaXNzIjoidGhpbmdzYm9hcmQuaW8iLCJpYXQiOjE3MTgzNzYxODMsImV4cCI6MTcxODM4NTE4MywiZW5hYmxlZCI6dHJ1ZSwiaXNQdWJsaWMiOmZhbHNlLCJ0ZW5hbnRJZCI6IjRjMjljYjYwLTJhNWMtMTFlZi04MDJkLTRmYWEzODUyMTgxYiIsImN1c3RvbWVySWQiOiIxMzgxNDAwMC0xZGQyLTExYjItODA4MC04MDgwODA4MDgwODAifQ.8lEqCnA8DO4FxMmjvYszlxLHcVkR1TKXC08nJhF6CGn_KoEEGJn91S_6lpHQud8Y1GpcsfXKz4Nps-e83n8BxQ' \ -d '{ "method": "setGpio", "params": { "pin": 7, "value": 1 }, "persistent": true, "retries": 1, "expirationTime": 1718402400000 }'
check RPC status: /api/rpc/persistent/device/DEVICE-ID?pageSize=10&page=0&sortProperty=createdTime&sortOrder=DESC

What happen You will receive 500 OK on the dashboard instead of 200 and the message is not queued and RPC is not stored in the database. The API response is { "status": 500, "message": "Request timeout", "errorCode": 2, "timestamp": 1718381143859 }

Expected behavior RPC messages are queued. the API response should be of the following type

{
  "rpcId": "ff6c2058-6fcf-4be1-ab29-8fd172875ef6"
}

the image below highlights the correct expected result that was obtained in single node mode: 1x tb-node

Additional context in summary, it seems that RPC commands are processed correctly if the deployment is of a single tb-node: no rule engine and the tb-node must be scaled to 1

Jun 13 '24 14:06 maghibus

same problem with version 3.6.4

Jun 14 '24 15:06 maghibus

after investigating a bit, I hypothesized it could depend on the Kafka topic partitions, so I set TB_QUEUE_CORE_PARTITIONS=1 and the deployment without rule engine also works with tb-core x2

Jun 16 '24 20:06 maghibus

hi!

{ "status": 500, "message": "Request timeout", "errorCode": 2, "timestamp": 1718381143859 }

HTTP error 500 means some issues with backend services, you should check service/container logs if you see such an error on UI.

in summary, it seems that RPC commands are processed correctly if the deployment is of a single tb-node: no rule engine and the tb-node must be scaled to 1

if you downscale your services to 1 and issue resolves, you may not set up your services properly - you need to ensure that all TB-cores and Rule-Engines share Zookeeper and Redis.

TB_QUEUE_CORE_PARTITIONS=1 and the deployment without rule engine also works with tb-core x2

you have scaled queue partition to 1 = only 1 API (core/rule-engine) can process messages; see my advice above

Jun 20 '24 08:06 trikimiki

Hi, I confirm that zookeeper, redis and kafka are shared.

I know, in fact I also set partitions:10 in TB_QUEUE_KAFKA_CORE_TOPIC_PROPERTIES = retention.ms:604800000;segment.bytes:26214400;retention.bytes:1048576000;partitions:10;min.insync.replicas:1 TB_QUEUE_CORE_PARTITIONS = 1

to make it work with n tb-cores and no rule-engine

Sorry for the question but does it work for you? have you tried it?

Jun 20 '24 10:06 maghibus

could you share your deploy files when it is working and when it isn't?

RPC most definitely works with MSA; still, there are countless ways of deploying MSA, so it might be a corner case of some sort

Jun 24 '24 14:06 trikimiki

tb-core	tb-rule-engine	kafka topic	result
>= 2 pods	>= 2 pods		KO
>= 2 pods(monolith)	0 pods		KO
1 pod (monolith)	0 pods		OK
>= 2 pods (monolith)	0 pods	TB_QUEUE_CORE_PARTITIONS=1 TB_QUEUE_KAFKA_CORE_TOPIC_PROPERTIES=retention.ms:604800000;segment.bytes:26214400;retention.bytes:1048576000;partitions:10;min.insync.replicas:1	OK

Below are the configurations that don't work (in all cases we have shared Redis, Kafka and Zookeeper)

one or more tb-core, one or more tb-rule-engine

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: tb
    meta.helm.sh/release-namespace: tb
  labels:
    app.kubernetes.io/instance: tb
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: iot-platform
    app.kubernetes.io/version: 3.6.2
    helm.sh/chart: iot-platform-1.0.0
  name: tb-node
  namespace: tb
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: tb-node
      app.kubernetes.io/component: Core_Microservices
      app.kubernetes.io/instance: tb-node
      app.kubernetes.io/name: iot-platform-node
      app.kubernetes.io/part-of: Thingsboard
      app.kubernetes.io/version: 3.6.2
  serviceName: tb-node-headless
  template:
    metadata:
      annotations:
        cattle.io/timestamp: "2024-06-14T16:01:25Z"
      creationTimestamp: null
      labels:
        app: tb-node
        app.kubernetes.io/component: Core_Microservices
        app.kubernetes.io/instance: tb-node
        app.kubernetes.io/name: iot-platform-node
        app.kubernetes.io/part-of: Thingsboard
        app.kubernetes.io/version: 3.6.2
    spec:
      affinity:
        podAffinity: {}
        podAntiAffinity: {}
      containers:
      - env:
        - name: HTTP_LOG_CONTROLLER_ERROR_STACK_TRACE
          value: "false"
        - name: TB_SERVICE_TYPE
          value: tb-core
        - name: HTTP_ENABLED
          value: "false"
        - name: MQTT_ENABLED
          value: "true"
        - name: COAP_ENABLED
          value: "false"
        - name: SNMP_ENABLED
          value: "false"
        - name: LWM2M_ENABLED
          value: "false"
        - name: TB_QUEUE_TYPE
          value: kafka
        - name: TB_KAFKA_SERVERS
          value: my-kafka.middleware:9092
        - name: TB_QUEUE_KAFKA_REPLICATION_FACTOR
          value: "3"
        - name: TB_KAFKA_BATCH_SIZE
          value: "65536"
        - name: TB_KAFKA_LINGER_MS
          value: "5"
        - name: TB_KAFKA_COMPRESSION_TYPE
          value: gzip
        - name: TB_QUEUE_KAFKA_MAX_POLL_RECORDS
          value: "4096"
        - name: ZOOKEEPER_ENABLED
          value: "true"
        - name: ZOOKEEPER_URL
          value: my-zookeeper.middleware:2181
        - name: SPRING_DATASOURCE_USERNAME
          valueFrom:
            secretKeyRef:
              key: username
              name: tb-postgre-secret
        - name: SPRING_DATASOURCE_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: tb-postgre-secret
        - name: TB_QUEUE_KAFKA_CONFLUENT_SECURITY_PROTOCOL
          value: SASL_PLAINTEXT
        - name: TB_QUEUE_KAFKA_USE_CONFLUENT_CLOUD
          value: "true"
        - name: TB_QUEUE_KAFKA_CONFLUENT_SASL_JAAS_CONFIG
          value: org.apache.kafka.common.security.plain.PlainLoginModule required
            username="user1" password="ni0vHNQEjk";
        - name: REDIS_PASSWORD
          value: zJFsDycY1p
        - name: CACHE_TYPE
          value: redis
        - name: REDIS_HOST
          value: my-redis-master.middleware
        envFrom:
        - configMapRef:
            name: tb-node-db-config
            optional: false
        image: docker.io/thingsboard/tb-node:3.6.2
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /login
            port: http
            scheme: HTTP
          initialDelaySeconds: 360
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 10
        name: iot-platform
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        - containerPort: 7070
          name: rpc
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /login
            port: http
            scheme: HTTP
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources:
          limits:
            cpu: "1"
            memory: 1000Mi
          requests:
            cpu: "1"
            memory: 1000Mi
        securityContext: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /config
          name: tb-node-config
        - mountPath: /var/log/thingsboard
          name: tb-node-logs
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - sh
        - -c
        - echo "Waiting for PostgreSQL to launch..."; until nc -w 1 -z my-postgresql.db
          5432; do echo "Waiting for PostgreSQL"; sleep 1; done; echo "PostgreSQL
          launched";
        image: busybox:1.36
        imagePullPolicy: IfNotPresent
        name: wait-for-postgre
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - command:
        - sh
        - -c
        - echo "Waiting for zk to launch..."; until nc -w 1 -z my-zookeeper.middleware
          2181; do echo "Waiting for zk"; sleep 1; done; echo "Zookeeper launched";
        image: busybox:1.36
        imagePullPolicy: IfNotPresent
        name: wait-for-zookeeper
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - command:
        - sh
        - -c
        - echo "Waiting for Kafka to launch..."; until nc -w 1 -z my-kafka.middleware
          9092; do echo "Waiting for kafka"; sleep 1; done; echo "Kafka launched";
        image: busybox:1.36
        imagePullPolicy: IfNotPresent
        name: wait-for-kafka
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: tb-iot-platform
      serviceAccountName: tb-iot-platform
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: conf
            path: thingsboard.conf
          - key: logback
            path: logback.xml
          name: tb-node-config
        name: tb-node-config
      - emptyDir: {}
        name: tb-node-logs
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: tb
    meta.helm.sh/release-namespace: tb
  generation: 41
  labels:
    app.kubernetes.io/instance: tb
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: iot-platform
    app.kubernetes.io/version: 3.6.2
    helm.sh/chart: iot-platform-1.0.0
  name: tb-rule-engine
  namespace: tb
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: tb-rule-engine
      app.kubernetes.io/component: Rule_Engine_Microservices
      app.kubernetes.io/instance: tb-rule-engine
      app.kubernetes.io/name: iot-platform-rule-engine
      app.kubernetes.io/part-of: Thingsboard
      app.kubernetes.io/version: 3.6.2
  serviceName: tb-rule-engine-headless
  template:
    metadata:
      annotations:
        cattle.io/timestamp: "2024-06-14T13:51:20Z"
      creationTimestamp: null
      labels:
        app: tb-rule-engine
        app.kubernetes.io/component: Rule_Engine_Microservices
        app.kubernetes.io/instance: tb-rule-engine
        app.kubernetes.io/name: iot-platform-rule-engine
        app.kubernetes.io/part-of: Thingsboard
        app.kubernetes.io/version: 3.6.2
    spec:
      affinity:
        podAffinity: {}
        podAntiAffinity: {}
      containers:
      - env:
        - name: TB_SERVICE_TYPE
          value: tb-rule-engine
        - name: HTTP_LOG_CONTROLLER_ERROR_STACK_TRACE
          value: "false"
        - name: HTTP_ENABLED
          value: "false"
        - name: MQTT_ENABLED
          value: "false"
        - name: COAP_ENABLED
          value: "false"
        - name: SNMP_ENABLED
          value: "false"
        - name: LWM2M_ENABLED
          value: "false"
        - name: TB_QUEUE_TYPE
          value: kafka
        - name: TB_KAFKA_SERVERS
          value: my-kafka.middleware:9092
        - name: TB_QUEUE_KAFKA_REPLICATION_FACTOR
          value: "3"
        - name: TB_KAFKA_BATCH_SIZE
          value: "65536"
        - name: TB_KAFKA_LINGER_MS
          value: "5"
        - name: TB_KAFKA_COMPRESSION_TYPE
          value: gzip
        - name: TB_QUEUE_KAFKA_MAX_POLL_RECORDS
          value: "4096"
        - name: ZOOKEEPER_ENABLED
          value: "true"
        - name: ZOOKEEPER_URL
          value: my-zookeeper.middleware:2181
        - name: SPRING_DATASOURCE_USERNAME
          valueFrom:
            secretKeyRef:
              key: username
              name: tb-postgre-secret
        - name: SPRING_DATASOURCE_PASSWORD
          valueFrom:
            secretKeyRef:
              key: password
              name: tb-postgre-secret
        - name: TB_QUEUE_KAFKA_CONFLUENT_SASL_JAAS_CONFIG
          value: org.apache.kafka.common.security.plain.PlainLoginModule required
            username="user1" password="ni0vHNQEjk";
        - name: TB_QUEUE_KAFKA_USE_CONFLUENT_CLOUD
          value: "true"
        - name: TB_QUEUE_KAFKA_CONFLUENT_SECURITY_PROTOCOL
          value: SASL_PLAINTEXT
        - name: CACHE_TYPE
          value: redis
        - name: REDIS_HOST
          value: my-redis-master.middleware
        - name: REDIS_PASSWORD
          value: zJFsDycY1p
        envFrom:
        - configMapRef:
            name: tb-node-db-config
        image: docker.io/thingsboard/tb-node:3.6.2
        imagePullPolicy: IfNotPresent
        name: iot-platform
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        - containerPort: 7070
          name: rpc
          protocol: TCP
        resources:
          limits:
            cpu: "1"
            memory: 1000Mi
          requests:
            cpu: "1"
            memory: 1000Mi
        securityContext: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /config
          name: tb-node-config
        - mountPath: /var/log/thingsboard
          name: tb-node-logs
      dnsPolicy: ClusterFirst
      initContainers:
      - command:
        - sh
        - -c
        - echo "Waiting for PostgreSQL to launch..."; until nc -w 1 -z my-postgresql.db
          5432; do echo "Waiting for PostgreSQL"; sleep 1; done; echo "PostgreSQL
          launched";
        image: busybox:1.36
        imagePullPolicy: IfNotPresent
        name: wait-for-postgre
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - command:
        - sh
        - -c
        - echo "Waiting for zk to launch..."; until nc -w 1 -z my-zookeeper.middleware
          2181; do echo "Waiting for zk"; sleep 1; done; echo "Zookeeper launched";
        image: busybox:1.36
        imagePullPolicy: IfNotPresent
        name: wait-for-zookeeper
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      - command:
        - sh
        - -c
        - echo "Waiting for Kafka to launch..."; until nc -w 1 -z my-kafka.middleware
          9092; do echo "Waiting for kafka"; sleep 1; done; echo "Kafka launched";
        image: busybox:1.36
        imagePullPolicy: IfNotPresent
        name: wait-for-kafka
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: tb-iot-platform
      serviceAccountName: tb-iot-platform
      terminationGracePeriodSeconds: 30
      volumes:
      - configMap:
          defaultMode: 420
          items:
          - key: conf
            path: thingsboard.conf
          - key: logback
            path: logback.xml
          name: tb-node-config
        name: tb-node-config
      - emptyDir: {}
        name: tb-node-logs
  updateStrategy:
    rollingUpdate:
      partition: 0
    type: RollingUpdate

as above except for: two or more tb-core, no tb-rule-engine

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: tb
    meta.helm.sh/release-namespace: tb
  labels:
    app.kubernetes.io/instance: tb
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: iot-platform
    app.kubernetes.io/version: 3.6.2
    helm.sh/chart: iot-platform-1.0.0
  name: tb-node
  namespace: tb
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: Parallel
  replicas: 2
...
    spec:
      containers:
      - env:
        - name: TB_SERVICE_TYPE
          value: monolith
...

Below are the configurations that work (in all cases we have shared Redis, Kafka and Zookeeper)

as above (no tb-rule-engine) except the monolith tb-node scaled to 1
as above (no tb-rule-engine) except kafka topic configuration and the monolith tb-node scaled as desired

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    meta.helm.sh/release-name: tb
    meta.helm.sh/release-namespace: tb
  labels:
    app.kubernetes.io/instance: tb
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: iot-platform
    app.kubernetes.io/version: 3.6.2
    helm.sh/chart: iot-platform-1.0.0
  name: tb-node
  namespace: tb
spec:
  persistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  podManagementPolicy: Parallel
  replicas: 2
...
    spec:
      containers:
      - env:
        - name: TB_SERVICE_TYPE
          value: monolith
        - name: TB_QUEUE_CORE_PARTITIONS
          value: '1'
        - name: TB_QUEUE_KAFKA_CORE_TOPIC_PROPERTIES
          value: >-
            retention.ms:604800000;segment.bytes:26214400;retention.bytes:1048576000;partitions:10;min.insync.replicas:1
...

Jun 24 '24 17:06 maghibus

The problem might be with the Kafka service. ThingsBoard consumer cannot properly fetch from Kafka, which is why there is a timeout error. Could you try using Bitnami Kafka (on-prem) instead? E.g. https://github.com/thingsboard/thingsboard-ce-k8s/blob/7d259f173f47768f13ce19bba62684e155118b19/azure/microservices/thirdparty.yml If you are already using on-prem Kafka, could you share your deployment?

Jul 19 '24 12:07 trikimiki

Generally in my use cases I use kafka on-prem (using Bitnami Helm Chart ) or Amazon MSK. The problem appears in both cases. in both cases, kafka and zookeeper are in replica 3. everything works except the rpc stream.

I'll try your thirdparty.yml today, and let you know This https://github.com/thingsboard/thingsboard-ce-k8s/blob/7d259f173f47768f13ce19bba62684e155118b19/azure/microservices/thirdparty.yml
and that https://github.com/thingsboard/thingsboard-ce-k8s/blob/release-3.7.0/azure/microservices/thirdparty.yml

Jul 22 '24 10:07 maghibus

I tried https://github.com/thingsboard/thingsboard-ce-k8s/blob/release-3.7.0/azure/microservices/thirdparty.yml I think it's useless to try the previous version https://github.com/thingsboard/thingsboard-ce-k8s/blob/7d259f173f47768f13ce19bba62684e155118b19/azure/microservices/thirdparty.yml

tb-rule-engine-0_iot-platform.log

Jul 22 '24 11:07 maghibus

hi! apologies for the long delay we found out it is actually a bug see https://github.com/thingsboard/thingsboard/pull/11686/commits/9553a6958b313ae85a463609e25d88c6232e963c the official fix would be deployed as a part of the next release

Oct 01 '24 14:10 trikimiki