elephas Error in to_data_frame() when feeding numpy matrix as label

I am trying to reproduce the basic autoencoder example from the Keras blog:

https://blog.keras.io/building-autoencoders-in-keras.html

# Define basic parameters
encoding_dim = 32
batch_size = 32
epochs = 1

# Build model
input_img = Input(shape=(784,))
encoded = Dense(encoding_dim, activation='relu')(input_img)
decoded = Dense(784, activation='sigmoid')(encoded)
autoencoder = Model(input_img, decoded)
encoder = Model(input_img, encoded)
encoded_input = Input(shape=(encoding_dim,))
decoded_layer = autoencoder.layers[-1]
decoder = Model(encoded_input, decoded_layer(encoded_input))
print(autoencoder.summary())


# Load data
(x_train, _), (x_test, _) = mnist.load_data()

x_train = x_train.reshape(60000, 784).astype('float32') / 255.
x_test = x_test.reshape(10000, 784).astype('float32') / 255.
plt.imshow(x_train[randint(0,60000-1),:].reshape(28,28))
plt.gray()
plt.show()

print(x_train.shape, 'train samples')
print(x_test.shape, 'test samples')

# Create Spark context
conf = SparkConf().setAppName('Mnist_Spark_MLP').setMaster('local[8]')
sc = SparkContext(conf=conf)

# Build RDD from numpy features and labels
test_df = to_data_frame(sc, x_test, x_test, categorical=False)

This generates an error, which I believe is due to the label only being accepted as input if it is a scalar array, not a matrix of vectors:

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "/Users/***/ams/px-seed-model/scripts/elephas_ae_mnist.py", line 79, in <module>
    test_df = to_data_frame(sc, x_test, x_test, categorical=False)
  File "/Users/***/ams/px-seed-model/pinpoint/src/elephas/elephas/ml/adapter.py", line 11, in to_data_frame
    lp_rdd = to_labeled_point(sc, features, labels, categorical)
  File "/Users/***/ams/px-seed-model/pinpoint/src/elephas/elephas/utils/rdd_utils.py", line 38, in to_labeled_point
    lp = LabeledPoint(y, to_vector(x))
  File "/Users/***/ams/px-seed-model/pinpoint/lib/python3.6/site-packages/pyspark/mllib/regression.py", line 53, in __init__
    self.label = float(label)
TypeError: only size-1 arrays can be converted to Python scalars

Sep 13 '18 16:09 EvanZ

@maxpumperla
Do you think this issue is fixable? I can try contribute a pull request if you point me to the right direction

Mar 25 '19 11:03 David-Taub

@DavidAriel this is a difficult one. If you look at the implementation, @EvanZ is right about the root cause:

https://github.com/maxpumperla/elephas/blob/db8147c501d9ff5dda63931b7c773db999b9743f/elephas/ml/adapter.py#L9-L23

We go through LabeledPoint, whose label by definition can just be a scalar. Having said that, nobody forces me to do that - there are other ways of creating a DataFrame. but here's the but: If I want to implement the Estimator interface, I need to build on the provided Traits, namely

HasCategoricalLabels,  HasLabelCol, HasOutputCol

which also assume that there is a single column corresponding to labels. This puts us in a tough spot for using SparkML interfaces. OK, let's have a look at something simpler, namely to_simple_rdd. No assumptions from Spark go in there, see here:

https://github.com/maxpumperla/elephas/blob/db8147c501d9ff5dda63931b7c773db999b9743f/elephas/utils/rdd_utils.py#L38

Labels could be a matrix there as well. And here you see how this is reflected in Spark workers:

https://github.com/maxpumperla/elephas/blob/master/elephas/worker.py#L37

Something that's much more feasible right now is to support multi-in/out for the basic case that I'd really appreciate some help with. See here: https://github.com/maxpumperla/elephas/issues/16

Mar 25 '19 13:03 maxpumperla