Server-side RPC not working in v3.6.x CE MSA
Describe the bug RPC messages are not queued or sent to devices
Your Server Environment
- own setup
- Deployment: microservices (tb-core x2, tb-rule-engine x2, tb-mqtt-transport x2)
- Deployment type: k8s
- ThingsBoard Version: 3.6.2 - 3.6.4
- Community
Your Device
- Connectivity
- Gateway/Devices
To Reproduce
- send an RPC persistent command /api/rpc/oneway/DEVICE-ID
for example
curl -X 'POST' \ 'https://tb-localhost:443/api/rpc/twoway/c3150ff0-2a5c-11ef-802d-4faa3852181b' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -H 'X-Authorization: Bearer eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJhZG1pbkB0ZW5hbnQtMS5pdCIsInVzZXJJZCI6IjVjZDhiYWMwLTJhNWMtMTFlZi04MDJkLTRmYWEzODUyMTgxYiIsInNjb3BlcyI6WyJURU5BTlRfQURNSU4iXSwic2Vzc2lvbklkIjoiNDFmYjgyOWUtNjQ1ZC00ZWE5LTllYzQtMjZlYjYxZmZlYmU2IiwiaXNzIjoidGhpbmdzYm9hcmQuaW8iLCJpYXQiOjE3MTgzNzYxODMsImV4cCI6MTcxODM4NTE4MywiZW5hYmxlZCI6dHJ1ZSwiaXNQdWJsaWMiOmZhbHNlLCJ0ZW5hbnRJZCI6IjRjMjljYjYwLTJhNWMtMTFlZi04MDJkLTRmYWEzODUyMTgxYiIsImN1c3RvbWVySWQiOiIxMzgxNDAwMC0xZGQyLTExYjItODA4MC04MDgwODA4MDgwODAifQ.8lEqCnA8DO4FxMmjvYszlxLHcVkR1TKXC08nJhF6CGn_KoEEGJn91S_6lpHQud8Y1GpcsfXKz4Nps-e83n8BxQ' \ -d '{ "method": "setGpio", "params": { "pin": 7, "value": 1 }, "persistent": true, "retries": 1, "expirationTime": 1718402400000 }' - check RPC status: /api/rpc/persistent/device/DEVICE-ID?pageSize=10&page=0&sortProperty=createdTime&sortOrder=DESC
What happen
You will receive 500 OK on the dashboard instead of 200 and the message is not queued and RPC is not stored in the database. The API response is
{ "status": 500, "message": "Request timeout", "errorCode": 2, "timestamp": 1718381143859 }
Expected behavior RPC messages are queued. the API response should be of the following type
{
"rpcId": "ff6c2058-6fcf-4be1-ab29-8fd172875ef6"
}
the image below highlights the correct expected result that was obtained in single node mode: 1x tb-node
Additional context in summary, it seems that RPC commands are processed correctly if the deployment is of a single tb-node: no rule engine and the tb-node must be scaled to 1
same problem with version 3.6.4
after investigating a bit, I hypothesized it could depend on the Kafka topic partitions, so I set TB_QUEUE_CORE_PARTITIONS=1 and the deployment without rule engine also works with tb-core x2
hi!
{ "status": 500, "message": "Request timeout", "errorCode": 2, "timestamp": 1718381143859 }
HTTP error 500 means some issues with backend services, you should check service/container logs if you see such an error on UI.
in summary, it seems that RPC commands are processed correctly if the deployment is of a single tb-node: no rule engine and the tb-node must be scaled to 1
if you downscale your services to 1 and issue resolves, you may not set up your services properly - you need to ensure that all TB-cores and Rule-Engines share Zookeeper and Redis.
TB_QUEUE_CORE_PARTITIONS=1 and the deployment without rule engine also works with tb-core x2
you have scaled queue partition to 1 = only 1 API (core/rule-engine) can process messages; see my advice above
Hi, I confirm that zookeeper, redis and kafka are shared.
I know, in fact I also set partitions:10 in
TB_QUEUE_KAFKA_CORE_TOPIC_PROPERTIES = retention.ms:604800000;segment.bytes:26214400;retention.bytes:1048576000;partitions:10;min.insync.replicas:1
TB_QUEUE_CORE_PARTITIONS = 1
to make it work with n tb-cores and no rule-engine
Sorry for the question but does it work for you? have you tried it?
could you share your deploy files when it is working and when it isn't?
RPC most definitely works with MSA; still, there are countless ways of deploying MSA, so it might be a corner case of some sort
| tb-core | tb-rule-engine | kafka topic | result |
|---|---|---|---|
| >= 2 pods | >= 2 pods | KO | |
| >= 2 pods(monolith) | 0 pods | KO | |
| 1 pod (monolith) | 0 pods | OK | |
| >= 2 pods (monolith) | 0 pods | TB_QUEUE_CORE_PARTITIONS=1 TB_QUEUE_KAFKA_CORE_TOPIC_PROPERTIES=retention.ms:604800000;segment.bytes:26214400;retention.bytes:1048576000;partitions:10;min.insync.replicas:1 | OK |
Below are the configurations that don't work (in all cases we have shared Redis, Kafka and Zookeeper)
- one or more tb-core, one or more tb-rule-engine
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
meta.helm.sh/release-name: tb
meta.helm.sh/release-namespace: tb
labels:
app.kubernetes.io/instance: tb
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: iot-platform
app.kubernetes.io/version: 3.6.2
helm.sh/chart: iot-platform-1.0.0
name: tb-node
namespace: tb
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: Parallel
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: tb-node
app.kubernetes.io/component: Core_Microservices
app.kubernetes.io/instance: tb-node
app.kubernetes.io/name: iot-platform-node
app.kubernetes.io/part-of: Thingsboard
app.kubernetes.io/version: 3.6.2
serviceName: tb-node-headless
template:
metadata:
annotations:
cattle.io/timestamp: "2024-06-14T16:01:25Z"
creationTimestamp: null
labels:
app: tb-node
app.kubernetes.io/component: Core_Microservices
app.kubernetes.io/instance: tb-node
app.kubernetes.io/name: iot-platform-node
app.kubernetes.io/part-of: Thingsboard
app.kubernetes.io/version: 3.6.2
spec:
affinity:
podAffinity: {}
podAntiAffinity: {}
containers:
- env:
- name: HTTP_LOG_CONTROLLER_ERROR_STACK_TRACE
value: "false"
- name: TB_SERVICE_TYPE
value: tb-core
- name: HTTP_ENABLED
value: "false"
- name: MQTT_ENABLED
value: "true"
- name: COAP_ENABLED
value: "false"
- name: SNMP_ENABLED
value: "false"
- name: LWM2M_ENABLED
value: "false"
- name: TB_QUEUE_TYPE
value: kafka
- name: TB_KAFKA_SERVERS
value: my-kafka.middleware:9092
- name: TB_QUEUE_KAFKA_REPLICATION_FACTOR
value: "3"
- name: TB_KAFKA_BATCH_SIZE
value: "65536"
- name: TB_KAFKA_LINGER_MS
value: "5"
- name: TB_KAFKA_COMPRESSION_TYPE
value: gzip
- name: TB_QUEUE_KAFKA_MAX_POLL_RECORDS
value: "4096"
- name: ZOOKEEPER_ENABLED
value: "true"
- name: ZOOKEEPER_URL
value: my-zookeeper.middleware:2181
- name: SPRING_DATASOURCE_USERNAME
valueFrom:
secretKeyRef:
key: username
name: tb-postgre-secret
- name: SPRING_DATASOURCE_PASSWORD
valueFrom:
secretKeyRef:
key: password
name: tb-postgre-secret
- name: TB_QUEUE_KAFKA_CONFLUENT_SECURITY_PROTOCOL
value: SASL_PLAINTEXT
- name: TB_QUEUE_KAFKA_USE_CONFLUENT_CLOUD
value: "true"
- name: TB_QUEUE_KAFKA_CONFLUENT_SASL_JAAS_CONFIG
value: org.apache.kafka.common.security.plain.PlainLoginModule required
username="user1" password="ni0vHNQEjk";
- name: REDIS_PASSWORD
value: zJFsDycY1p
- name: CACHE_TYPE
value: redis
- name: REDIS_HOST
value: my-redis-master.middleware
envFrom:
- configMapRef:
name: tb-node-db-config
optional: false
image: docker.io/thingsboard/tb-node:3.6.2
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 3
httpGet:
path: /login
port: http
scheme: HTTP
initialDelaySeconds: 360
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 10
name: iot-platform
ports:
- containerPort: 8080
name: http
protocol: TCP
- containerPort: 7070
name: rpc
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /login
port: http
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: "1"
memory: 1000Mi
requests:
cpu: "1"
memory: 1000Mi
securityContext: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /config
name: tb-node-config
- mountPath: /var/log/thingsboard
name: tb-node-logs
dnsPolicy: ClusterFirst
initContainers:
- command:
- sh
- -c
- echo "Waiting for PostgreSQL to launch..."; until nc -w 1 -z my-postgresql.db
5432; do echo "Waiting for PostgreSQL"; sleep 1; done; echo "PostgreSQL
launched";
image: busybox:1.36
imagePullPolicy: IfNotPresent
name: wait-for-postgre
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- command:
- sh
- -c
- echo "Waiting for zk to launch..."; until nc -w 1 -z my-zookeeper.middleware
2181; do echo "Waiting for zk"; sleep 1; done; echo "Zookeeper launched";
image: busybox:1.36
imagePullPolicy: IfNotPresent
name: wait-for-zookeeper
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- command:
- sh
- -c
- echo "Waiting for Kafka to launch..."; until nc -w 1 -z my-kafka.middleware
9092; do echo "Waiting for kafka"; sleep 1; done; echo "Kafka launched";
image: busybox:1.36
imagePullPolicy: IfNotPresent
name: wait-for-kafka
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: tb-iot-platform
serviceAccountName: tb-iot-platform
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
items:
- key: conf
path: thingsboard.conf
- key: logback
path: logback.xml
name: tb-node-config
name: tb-node-config
- emptyDir: {}
name: tb-node-logs
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
meta.helm.sh/release-name: tb
meta.helm.sh/release-namespace: tb
generation: 41
labels:
app.kubernetes.io/instance: tb
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: iot-platform
app.kubernetes.io/version: 3.6.2
helm.sh/chart: iot-platform-1.0.0
name: tb-rule-engine
namespace: tb
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: Parallel
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: tb-rule-engine
app.kubernetes.io/component: Rule_Engine_Microservices
app.kubernetes.io/instance: tb-rule-engine
app.kubernetes.io/name: iot-platform-rule-engine
app.kubernetes.io/part-of: Thingsboard
app.kubernetes.io/version: 3.6.2
serviceName: tb-rule-engine-headless
template:
metadata:
annotations:
cattle.io/timestamp: "2024-06-14T13:51:20Z"
creationTimestamp: null
labels:
app: tb-rule-engine
app.kubernetes.io/component: Rule_Engine_Microservices
app.kubernetes.io/instance: tb-rule-engine
app.kubernetes.io/name: iot-platform-rule-engine
app.kubernetes.io/part-of: Thingsboard
app.kubernetes.io/version: 3.6.2
spec:
affinity:
podAffinity: {}
podAntiAffinity: {}
containers:
- env:
- name: TB_SERVICE_TYPE
value: tb-rule-engine
- name: HTTP_LOG_CONTROLLER_ERROR_STACK_TRACE
value: "false"
- name: HTTP_ENABLED
value: "false"
- name: MQTT_ENABLED
value: "false"
- name: COAP_ENABLED
value: "false"
- name: SNMP_ENABLED
value: "false"
- name: LWM2M_ENABLED
value: "false"
- name: TB_QUEUE_TYPE
value: kafka
- name: TB_KAFKA_SERVERS
value: my-kafka.middleware:9092
- name: TB_QUEUE_KAFKA_REPLICATION_FACTOR
value: "3"
- name: TB_KAFKA_BATCH_SIZE
value: "65536"
- name: TB_KAFKA_LINGER_MS
value: "5"
- name: TB_KAFKA_COMPRESSION_TYPE
value: gzip
- name: TB_QUEUE_KAFKA_MAX_POLL_RECORDS
value: "4096"
- name: ZOOKEEPER_ENABLED
value: "true"
- name: ZOOKEEPER_URL
value: my-zookeeper.middleware:2181
- name: SPRING_DATASOURCE_USERNAME
valueFrom:
secretKeyRef:
key: username
name: tb-postgre-secret
- name: SPRING_DATASOURCE_PASSWORD
valueFrom:
secretKeyRef:
key: password
name: tb-postgre-secret
- name: TB_QUEUE_KAFKA_CONFLUENT_SASL_JAAS_CONFIG
value: org.apache.kafka.common.security.plain.PlainLoginModule required
username="user1" password="ni0vHNQEjk";
- name: TB_QUEUE_KAFKA_USE_CONFLUENT_CLOUD
value: "true"
- name: TB_QUEUE_KAFKA_CONFLUENT_SECURITY_PROTOCOL
value: SASL_PLAINTEXT
- name: CACHE_TYPE
value: redis
- name: REDIS_HOST
value: my-redis-master.middleware
- name: REDIS_PASSWORD
value: zJFsDycY1p
envFrom:
- configMapRef:
name: tb-node-db-config
image: docker.io/thingsboard/tb-node:3.6.2
imagePullPolicy: IfNotPresent
name: iot-platform
ports:
- containerPort: 8080
name: http
protocol: TCP
- containerPort: 7070
name: rpc
protocol: TCP
resources:
limits:
cpu: "1"
memory: 1000Mi
requests:
cpu: "1"
memory: 1000Mi
securityContext: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /config
name: tb-node-config
- mountPath: /var/log/thingsboard
name: tb-node-logs
dnsPolicy: ClusterFirst
initContainers:
- command:
- sh
- -c
- echo "Waiting for PostgreSQL to launch..."; until nc -w 1 -z my-postgresql.db
5432; do echo "Waiting for PostgreSQL"; sleep 1; done; echo "PostgreSQL
launched";
image: busybox:1.36
imagePullPolicy: IfNotPresent
name: wait-for-postgre
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- command:
- sh
- -c
- echo "Waiting for zk to launch..."; until nc -w 1 -z my-zookeeper.middleware
2181; do echo "Waiting for zk"; sleep 1; done; echo "Zookeeper launched";
image: busybox:1.36
imagePullPolicy: IfNotPresent
name: wait-for-zookeeper
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
- command:
- sh
- -c
- echo "Waiting for Kafka to launch..."; until nc -w 1 -z my-kafka.middleware
9092; do echo "Waiting for kafka"; sleep 1; done; echo "Kafka launched";
image: busybox:1.36
imagePullPolicy: IfNotPresent
name: wait-for-kafka
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: tb-iot-platform
serviceAccountName: tb-iot-platform
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
items:
- key: conf
path: thingsboard.conf
- key: logback
path: logback.xml
name: tb-node-config
name: tb-node-config
- emptyDir: {}
name: tb-node-logs
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
- as above except for: two or more tb-core, no tb-rule-engine
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
meta.helm.sh/release-name: tb
meta.helm.sh/release-namespace: tb
labels:
app.kubernetes.io/instance: tb
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: iot-platform
app.kubernetes.io/version: 3.6.2
helm.sh/chart: iot-platform-1.0.0
name: tb-node
namespace: tb
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: Parallel
replicas: 2
...
spec:
containers:
- env:
- name: TB_SERVICE_TYPE
value: monolith
...
Below are the configurations that work (in all cases we have shared Redis, Kafka and Zookeeper)
- as above (no tb-rule-engine) except the monolith tb-node scaled to 1
- as above (no tb-rule-engine) except kafka topic configuration and the monolith tb-node scaled as desired
apiVersion: apps/v1
kind: StatefulSet
metadata:
annotations:
meta.helm.sh/release-name: tb
meta.helm.sh/release-namespace: tb
labels:
app.kubernetes.io/instance: tb
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: iot-platform
app.kubernetes.io/version: 3.6.2
helm.sh/chart: iot-platform-1.0.0
name: tb-node
namespace: tb
spec:
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: Parallel
replicas: 2
...
spec:
containers:
- env:
- name: TB_SERVICE_TYPE
value: monolith
- name: TB_QUEUE_CORE_PARTITIONS
value: '1'
- name: TB_QUEUE_KAFKA_CORE_TOPIC_PROPERTIES
value: >-
retention.ms:604800000;segment.bytes:26214400;retention.bytes:1048576000;partitions:10;min.insync.replicas:1
...
The problem might be with the Kafka service. ThingsBoard consumer cannot properly fetch from Kafka, which is why there is a timeout error. Could you try using Bitnami Kafka (on-prem) instead? E.g. https://github.com/thingsboard/thingsboard-ce-k8s/blob/7d259f173f47768f13ce19bba62684e155118b19/azure/microservices/thirdparty.yml If you are already using on-prem Kafka, could you share your deployment?
Generally in my use cases I use kafka on-prem (using Bitnami Helm Chart ) or Amazon MSK. The problem appears in both cases. in both cases, kafka and zookeeper are in replica 3. everything works except the rpc stream.
I'll try your thirdparty.yml today, and let you know
This https://github.com/thingsboard/thingsboard-ce-k8s/blob/7d259f173f47768f13ce19bba62684e155118b19/azure/microservices/thirdparty.yml
and that https://github.com/thingsboard/thingsboard-ce-k8s/blob/release-3.7.0/azure/microservices/thirdparty.yml
I tried https://github.com/thingsboard/thingsboard-ce-k8s/blob/release-3.7.0/azure/microservices/thirdparty.yml I think it's useless to try the previous version https://github.com/thingsboard/thingsboard-ce-k8s/blob/7d259f173f47768f13ce19bba62684e155118b19/azure/microservices/thirdparty.yml
hi! apologies for the long delay we found out it is actually a bug see https://github.com/thingsboard/thingsboard/pull/11686/commits/9553a6958b313ae85a463609e25d88c6232e963c the official fix would be deployed as a part of the next release