fedjax icon indicating copy to clipboard operation
fedjax copied to clipboard

Feature request: Convert standard dataset into a federated dataset

Open Saipraneet opened this issue 4 years ago • 5 comments

Synthetic federated datasets can constructed from standard centralized ones by artificially splitting them among clients. This is usually done using a Dirichlet distribution (e.g. Hsu et al. 2019). Such synthetic datasets are very useful since we can explicitly control the total number of users, as well as the heterogeneity.

It would be great to have primitives which can automatically convert standard numpy dataset into a FedJax datset.

Saipraneet avatar Sep 14 '21 15:09 Saipraneet

Thanks for filing this! I also think that this will be very useful.

A couple of clarifying questions:

  • What exactly constitutes a "standard numpy dataset"? An iterator of numpy arrays? A tf.data.Dataset? A single numpy array encapsulating the entire dataset (assuming it fits in memory)?

  • When you say "FedJax dataset", does this refer to fedjax.FederatedData?

jaehunro avatar Sep 14 '21 18:09 jaehunro

I think if an iterator of numpy arrays is supported, that would be the most general. The tf.data.Dataset can be converted using as_numpy_iterator.

does this refer to fedjax.FederatedData

yes. The goal would be to be able to use this dataset with the rest of the fedjax framework.

Saipraneet avatar Sep 14 '21 18:09 Saipraneet

Hi, has any work been done for this issue? Is there still a need for it?

More generally, what is the state of this repo? Is it still active? Is there work that needs some contribution? I am more than happy to help.

BaselOmari avatar Oct 25 '22 19:10 BaselOmari

Hi there. There hasn't been much work done for checking in a general implementation for this but it would be nice to have. We still actively use and maintain this repo and would be more than happy to have you contribute!

jaehunro avatar Oct 26 '22 15:10 jaehunro

Hi, has any work been done for this issue? Is there still a need for it?

More generally, what is the state of this repo? Is it still active? Is there work that needs some contribution? I am more than happy to help.

Have you checked out InMemoryFederatedData? It should be sufficient for creating synthetic datasets in most cases.

kho avatar Oct 26 '22 16:10 kho