setting number of thread per executor
I'm using version com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc2 on a k8s cluster. Specifically the lightgbm classifier with binary response. I'm setting num executors to 10 and allocating spark.executor.cores=80 and park.task.cpus=80 so that each machine runs exactly one task with 80 cores available for the task. What I was expecting to see is full utilisation of 80 cores, instead only ~8 cores are utilised. My best guess is that it is related to the num_threads parameter, which you have exposed as numThreads a long time ago, but currently is not present in the param set. I have dataset of 30,000,000 samples with a large amount of features, each tree take around 10 sec to generate rebagging takes a lot of time too and uses a single core. Please advise
Same problem as in https://github.com/Azure/mmlspark/issues/292. Not fixed yet.
Try to play with the number of partitions, and partition key(s). This resulted in 20% utilisation instead of 10%. So still not there.
Same problem +1
@JWenBin have you tried the new single dataset mode parameter on latest master? https://github.com/Azure/mmlspark/pull/1066 In our benchmarking it resolved the low CPU utilization issue.