substra [Feature Request] Passing list of data hashes instead of sets to traintuples

One use-case that is not supported as of today in Substra would be to easily use a static algo.py to do either only one training epoch over a dataset or multiple epochs without any modifications to the algo or the Dockerfile.

The limitation is due to the fact that there is a uniqueness test on the data-hashes one passes to a traintuple to make sure they are unique (can be cast into a set). If the hashes given to the traintuple are not unique Substra raises: error is: Key: 'inputTraintuple.DataSampleKeys' Error:Field validation for 'DataSampleKeys' failed on the 'unique' tag"}
and the traintuple cannot be processed.

For instance consider the case where you want your traintuple to operate on the following hashes: ["A","B","C","D"]:

One might want to use a single algorithm similar to the following: algo.py:

import json
import substratools as tools
from networks import MyNetwork

class ComputeUpdates(tools.Algo):
    def train(self, X, y, models, rank):
        #assuming the opener created a [n_hashes, d] numpy array X and a vector y of size [n_hashes, dtarget]
        my_model = MyNetwork()
        for i in range(X.shape[0]):
            my_model.update_weights(X[i], y[i])
        return my_model

    def predict(self, X, model):
        predictions = 0
        return predictions

    def load_model(self, path):
        return json.load(path)

    def save_model(self, model, path):
        json.dump(model, path)


if __name__ == '__main__':
    tools.algo.execute(ComputeUpdates())

So to do one epoch one would register a traintuple using this algorithm and the following data samples hashes set s=["A","B","C","D"] and it would work.

Now to do multiple epochs (N) instead of one, without modifying algo, the obvious solution would be to just pass s*N instead of s during the registration of the traintuple.

However it is not possible as it would raise the above error in Substra as the hashes cannot be cast as a set because each hash is present N times.

This feature would be very valuable for my workflow !

It would also allow to support more complicated plans where we want to do a floating number of epochs by passing just the hashes of the samples that we intend to see (some samples multiple times some samples just once).

Thanks again !

Mar 18 '20 12:03 jeandut

My question is very naive, but wouldn't it be equivalent to creating N traintuples each taking as input the full set of data samples s and the trained model from the previous step?

Mar 18 '20 13:03 jmorel

Yes it would be equivalent mathematically in this case, but would induce lag because of the docker spawning at each traintuple processing and be a bit more complicated in terms of logic.

Mar 18 '20 13:03 jeandut

Un order to reduce the lag to very little, you could use a compute plan to create all the traintuples at once. The docker images created for the compute plan won't be removed until the end of the compute plan, so spawning a new container is very fast (no need to rebuild between traintuples).

Mar 18 '20 13:03 jmorel

Closing as stale

Sep 01 '22 07:09 Esadruhn