SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

Add MMLSPARK_PYSPARK_CORES to specify CPU core count for PySpark

Open ghost opened this issue 6 years ago • 5 comments

This is just a POC to get early feedback.

I run mmlspark locally on my notebook and figured out that only 2 of my 6 CPU cores were used when calculating Pi with PySpark, with code as below. I couldn't find an easy out-of-the-box mechanism to tweak this behavior. Therefore, I thought it'd be nice to make this configurable through env-vars so that users can tweak this during container creating. Thus, this pull request.

Please give me feedback whether you like this feature. If you do, I'll extend the documentation accordingly.

For reference, here's the code I run

import random
num_samples = 100000000
def inside(p):
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4.0 * count / num_samples
print(pi)

ghost avatar Jun 01 '19 08:06 ghost

Looks good @zulli73 if you add a line in the docs ill merge!

mhamilton723 avatar Jun 11 '19 19:06 mhamilton723

CLA assistant check
All CLA requirements met.

msftclas avatar Jun 16 '19 09:06 msftclas

I've added documentation.

I added a whole new section covering all environment variables because I felt it didn't fit into any of the existing part of the documentation. Moreover, I thought about adding it to the example docker run command, but I didn't want to make that example more complicated than necessary. Hence, the new section.

ghost avatar Jun 16 '19 09:06 ghost

Hello @zulli73, if you don't mind, please resolve the conflict and I'll trigger the merge. Thank you for your contribution!

drdarshan avatar Jul 17 '19 01:07 drdarshan

Hi @drdarshan.

Short story: Has this pull request become obsolete? Searching for "local[", all results use "local[*]" which indicates that the latest version at master may already use all CPU cores.

Long story: I'd happily fix merge conflicts, but I have troubles to understand the change that caused this merge conflict d34f9d1: The file I modified got removed and it's not obvious to me why it became obsolete. It seems to me that since that change, no new Docker image has been pushed - therefore I can't easily check whether Spark utilizes all available CPU cores since that commit. Finally: I couldn't find the docs for building the Docker image myself/locally

ghost avatar Aug 12 '19 05:08 ghost