performance-tests icon indicating copy to clipboard operation
performance-tests copied to clipboard

Config guidelines to get the best perf

Open AndreMaz opened this issue 2 years ago • 5 comments

Hi @smatvienko-tb First of all thank you for the great tool. It's super useful.

I've checked the tests that you've made for different perf scenarios (e.g, messages per second, num connected devices, etc.). It's amazing to see how much perf you manage to squeeze from the TB.

Would you mind to provide some pointers about the configs that you've used for these tests? For example the value of REMOTE_JS_MAX_PENDING_REQUESTS, REMOTE_JS_MAX_REQUEST_TIMEOUT, kafka queue configs and etc.

Overall, is there any rule of thumb that given the number of messages per second gives the TB+Kafka config that's required to handle the data?

AndreMaz avatar Jun 10 '23 15:06 AndreMaz

Hi, @AndreMaz !

Thank you for your feedback! The parameters set is about how many requests in flight the system can handle simultaneously. The Kafka parameters are about sending data with big batches. In some cases, there is a trade-off between max latency and throughput. When you push the system to the limit, it is a good idea to tolerate a latency that can be reasonably high for a few messages. To keep the latency as low as possible - you can supply a reasonable amount of messages in a batch for the rule engine. You can find all the hints for performance tuning for your environments and workload in the logs and JMX Beans.

You can find the configs for each workload in the section "How to reproduce the test." For a high load and availability, it is a good idea to deploy a ThingsBoard cluster.

Best regards!

smatvienko-tb avatar Jun 12 '23 08:06 smatvienko-tb

@smatvienko-tb thank you for the reply

You can find the configs for each workload in the section "How to reproduce the test."

facepalm completely missed that 😅 Maybe you can mark it as bold to make it more visible.

Once thing that I've noticed is that in all the scenarios that you've created TB is running as a monolith. Do you have any example for a microservices deployment? For example, what would be the configs for the Scenario B with Kafka + PostgreSQL with 5000 msg/sec over MQTT? How many JS-executors would be required for that? And what would be the Rule Engine timeouts for this case?

AndreMaz avatar Jun 14 '23 20:06 AndreMaz

Hi, @AndreMaz ! For a microservices architecture, you can find examples on https://github.com/thingsboard/thingsboard-ce-k8s . You need at least three nodes to reach a quorum to deploy all the stack available in a cluster mode (Kafka, Redis, Postgres, Cassandra...) Talking about performance tests, we are concentrating on the capabilities. Your particular use case may require some tunings and resource planning. Could you look at the disk size for 5k msg/sec each day with Postgres? https://thingsboard.io/docs/reference/performance-aws-instances/#disk-usage . In production, you must maintain the databases in a reasonable amount of time, e.g., backup, restore, repair, and compaction. So having terabytes as a single piece may not be the best idea. The JS executor's capacity depends on the complexity of your rule chains and processing algorithm. I recommend all to have O(1). You should only notice that every JS-call remote execution call adds some latency as request-response round trip + execution time. You can use the TBEL script engine that looks almost like js to reduce the latency. For a significant data flow, it might be helpful to do some short-lived in-memory aggregations and persist to the database's state every minute instead of every second (x60 reduce). Try to persist the latest telemetry values to the Postgres only if you need it in SQL filters on your dashboards (you can always get the latest from time series to show the number) Also, a great custom rule node project can help you create super-fast rule nodes to process the data. For super low latency, you could do more in-memory (like using Redis or Java collections) and apply a skipping processing strategy for a rule engine in case of some slow response appear.

If you have a high-load use case with particular requirements, I recommend reaching the ThingsBoard project team and getting help with architecture and resource planning.

smatvienko-tb avatar Jun 15 '23 11:06 smatvienko-tb

Thank you for a very detailed response @smatvienko-tb

The main problem that I'm facing is related to Rule Engine timeouts. What I'm seeing is the following:

  • SLOW PROCESSING warning @ the js-execution nodes
  • Timeout to process [<insert-number-here>] messages @ core node

Here's the example of what I'm seeing at core node:

2023-05-31 09:23:29,131 [tb-rule-engine-consumer-46-thread-5 | QK(Main,TB_RULE_ENGINE,system)-4] INFO o.t.s.s.q.DefaultTbRuleEngineConsumerService - Timeout to process [105] messages

2023-05-31 09:23:29,131 [tb-rule-engine-consumer-46-thread-5 | QK(Main,TB_RULE_ENGINE,system)-4] INFO o.t.s.s.q.DefaultTbRuleEngineConsumerService - [092b7270-9092-11ec-8ffb-b557a03f5c3b] Timeout to process message: TbMsg(queueName=Main, id=114401b0-643e-446d-bb2f-b38b7722ecd0, ts=1685525000076, type=POST_TELEMETRY_REQUEST, originator=18ef1cd0-f035-11ed-a040-1522e4561122, customerId=null, metaData=TbMsgMetaData(data={deviceType=SMART_TRACKER, deviceName=DW00000162, ts=1685525000047}), dataType=JSON, data={"latitude":39.874901,"longitude":27.293319,"speed":51.3,"fuel":17,"batteryLevel":65}, ruleChainId=fb1fffd0-18bc-11ed-9ea3-0b3471748b6a, ruleNodeId=1dc094e0-f310-11ed-ba2f-df1cb21d363b, ctx=org.thingsboard.server.common.msg.TbMsgProcessingCtx@5e707946, callback=org.thingsboard.server.common.msg.queue.TbMsgCallback$1@29321c61), Last Rule Node: [RuleChain: Push To Kafka|RuleNode: Add device IDs(5d01cd50-a67f-11ed-b24c-3d99fd0c45ac)]

Here's a print of the Push To Kafka rule chain. image

Note These timeouts happen randomly for any of the rule nodes in the rule chain above. It's not just the Add device IDs rule node that produces the timeout

My initial idea was to increase the timeouts and, therefore, leave more time to process the messages. However, when I increase the timeouts the performance degrades.

Here are my configs values that I have atm:

  METRICS_ENABLED: "true"
  METRICS_ENDPOINTS_EXPOSE: prometheus
  REMOTE_JS_RESPONSE_POLL_INTERVAL_MS: "200"
  TB_KAFKA_SERVERS: kafka-broker-headless.local:9092
  TB_QUEUE_CORE_PARTITIONS: "2"
  TB_QUEUE_CORE_POLL_INTERVAL_MS: "100"
  TB_QUEUE_KAFKA_CORE_TOPIC_PROPERTIES: retention.ms:604800000;segment.bytes:52428800;retention.bytes:1048576000
  TB_QUEUE_KAFKA_JE_TOPIC_PROPERTIES: retention.ms:604800000;segment.bytes:52428800;retention.bytes:1048576000;partitions:100
  TB_QUEUE_KAFKA_NOTIFICATIONS_TOPIC_PROPERTIES: retention.ms:604800000;segment.bytes:52428800;retention.bytes:1048576000
  TB_QUEUE_KAFKA_OTA_TOPIC_PROPERTIES: retention.ms:604800000;segment.bytes:52428800;retention.bytes:104857600
  TB_QUEUE_KAFKA_RE_TOPIC_PROPERTIES: retention.ms:604800000;segment.bytes:52428800;retention.bytes:1048576000
  TB_QUEUE_KAFKA_REPLICATION_FACTOR: "3"
  TB_QUEUE_KAFKA_TA_TOPIC_PROPERTIES: retention.ms:604800000;segment.bytes:52428800;retention.bytes:1048576000
  TB_QUEUE_KAFKA_VC_TOPIC_PROPERTIES: retention.ms:604800000;segment.bytes:52428800;retention.bytes:104857600
  TB_QUEUE_RE_HP_PARTITIONS: "1"
  TB_QUEUE_RE_HP_POLL_INTERVAL_MS: "100"
  TB_QUEUE_RE_MAIN_PARTITIONS: "2"
  TB_QUEUE_RE_MAIN_POLL_INTERVAL_MS: "100"
  TB_QUEUE_RE_SQ_PARTITIONS: "1"
  TB_QUEUE_RE_SQ_POLL_INTERVAL_MS: "100"
  TB_QUEUE_RULE_ENGINE_POLL_INTERVAL_MS: "100"
  TB_QUEUE_TRANSPORT_NOTIFICATIONS_POLL_INTERVAL_MS: "100"
  TB_QUEUE_TRANSPORT_REQUEST_POLL_INTERVAL_MS: "100"
  TB_QUEUE_TRANSPORT_RESPONSE_POLL_INTERVAL_MS: "100"
  TB_QUEUE_TYPE: kafka
  TB_TRANSPORT_RATE_LIMITS_DEVICE: 10:1,300:60

I'm not sure about what should I change and in what direction. The main problem is that docs for ENV vars are not very clear. For example,

  • REMOTE_JS_MAX_EVAL_REQUEST_TIMEOUT - JS Eval max request timeout. Q: Is this a timeout to process a message?
  • REMOTE_JS_MAX_REQUEST_TIMEOUT - JS max request timeout. Q: Who makes the request?
  • REMOTE_JS_RESPONSE_POLL_INTERVAL_MS - JS response poll interval. Q: Who does the polling in this case? Core node or JS execution node?
  • TB_QUEUE_RULE_ENGINE_PACK_PROCESSING_TIMEOUT_MS - Timeout for processing a message pack. Q: What's a message pack? What's the size of the pack?

There are much more vars that are critical to get the best perf but unfortunately are not very well documented.

Don't know if it helps but here's a print of my pods: image

Overall, what I'm looking for is some pointers for where should I be heading to solve the timeouts. Do I decrease the poll interval? Increase the timeouts?

More context about my setup:

  • number of evaluator nodes: 24
  • number of kafka brokers: 3

AndreMaz avatar Jun 15 '23 17:06 AndreMaz

Hi @AndreMaz !

I think your question is better suited to the ThingsBoard project. This is a performance test tool project. I think we should discuss performance in the ThingsBoard project next time.

As you see, the last rule node is not always the JS execution. This means that the bottleneck can be in some different place.

Could you try to start with a log analysis? Please, enable the debug for TbMsgPackProcessingContext (during investigation only). See an example at https://github.com/thingsboard/thingsboard/blob/master/application/src/main/resources/logback.xml

<!-- Top Rule Nodes by max execution time -->
<logger name="org.thingsboard.server.service.queue.TbMsgPackProcessingContext" level="DEBUG" />

The message in your logs looks like reading attributes from the originator is slow. Last Rule Node: [RuleChain: Push To Kafka|RuleNode: Add device IDs(5d01cd50-a67f-11ed-b24c-3d99fd0c45ac)

In cluster mode, you have to use the Redis cache. Redis clients by default have only 8 concurrent connections. To identify the problem, please use JMX connection and check the MBean named like apache pool. If you see the pool capacity exhausted some time - you need to adjust Redis pool settings. Here an example for redis-cluster (redis stand-alone has a different config)

  # Redis
  CACHE_TYPE: "redis"
  REDIS_CONNECTION_TYPE: "cluster"
  REDIS_NODES: "redis-headless:6379"
  REDIS_USE_DEFAULT_POOL_CONFIG: "false" # that the key parameter to run your own pool config
  REDIS_POOL_CONFIG_BLOCK_WHEN_EXHAUSTED: "false"
  REDIS_POOL_CONFIG_TEST_ON_BORROW: "false"
  REDIS_POOL_CONFIG_TEST_ON_RETURN: "false"
  CACHE_MAXIMUM_POOL_SIZE: "50"

Please, read each parameter carefully and set exact values instead of an unlimited (non-blocking) Redis pool.

to increase performance on js-executor you can do smth like that

env:
            - name: KAFKA_CLIENT_ID
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.name
            - name: SLOW_QUERY_LOG_MS
              value: "100.000"
            - name: SCRIPT_STAT_PRINT_FREQUENCY
              value: "10000"
            - name: SCRIPT_BODY_TRACE_FREQUENCY
              value: "1000000"
            - name: SLOW_QUERY_LOG_BODY
              value: "false"
            - name: TB_KAFKA_LINGER_MS
              value: "10"
            - name: TB_KAFKA_BATCH_SIZE
              value: "250"
            - name: TB_KAFKA_COMPRESSION_TYPE
              value: "gzip"
            - name: SCRIPT_USE_SANDBOX
              value: "false"

Disabling js-executor sandbox feature increases performance about 10 times, but suitable when the rule engine is not exposed to customers.

To handle more with Kafka, try parameters like this:

  TB_KAFKA_ACKS: "1"
  TB_KAFKA_BATCH_SIZE: "65536" # default is 16384 - it helps to produce messages much more efficiently
  TB_KAFKA_LINGER_MS: "5" # default is 1
  TB_KAFKA_COMPRESSION_TYPE: "gzip" # none or gzip
  TB_QUEUE_KAFKA_MAX_POLL_RECORDS: "4096" # default is 8192

The critical point is to message by batches, not single messages. default max requests in flight are only five and not recommended to increase. it will be a bottleneck if you are sending one message in batch. Roundtrip and overhead will exceed the valuable load. Acknowledging one Kafka broker is reasonable for most IoT. Compression is better faster than network transfer time in most cases, and it also saves disk space and throughput.

For the rule engine, you can find the queue settings under sysadmin. You can increase TB_QUEUE_RE_MAIN_PACK_PROCESSING_TIMEOUT_MS from the UI to avoid the timeouts if your pipeline is too heavy to accomplish in default 2 seconds. The processing will go as fast as possible, but it will wait for slow responses.

Enabling JMX example on port 9999 (memory management as well):

            - name: JAVA_OPTS
              value: "-Xmx2048M -Xms2048M -Xss384k -XX:+AlwaysPreTouch -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.rmi.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=127.0.0.1"

For JS executors, timeout has to be less or equal to the rule engine timeout.

  # JS executors
  JS_EVALUATOR: "remote"
  REMOTE_JS_MAX_PENDING_REQUESTS: "100000" # max requests in flight, waiting for a response from js-executor
  REMOTE_JS_MAX_EVAL_REQUEST_TIMEOUT: "60000" # not in use parameter. pending delete
  REMOTE_JS_MAX_REQUEST_TIMEOUT: "60000" # This time, we will wait for a response from the js-executor

Js-executor is a node.js application. It can utilize about one vCPU. How many js-executor instances you need is up to you.

Try to use TBEL instead of js-executors. You will have a much better experience.

Those are the main hints to setting up a high-performance environment. There are more settings. In microservices, it is all about orchestrating, depending on the load.

Please, use the open-source benefits, clone the ThingsBoard project on your computer and do a search on any parameter you are concerned about if you experience shortness of information from the docs.

Reach out to the community for help or support from the ThingsBoard team!

Have a good day!

smatvienko-tb avatar Jun 19 '23 09:06 smatvienko-tb