Import and Export data from Agents and Neurons
Made it so that one can call the following to save/import data from parquets:-
# Export data into a dataframe
ag_his = Ag.export_history(save_to_file=True)
pcs_his = PCs.export_history(save_to_file=True)
Ag.import_history(filename='agent_agent_0_history.parquet')
PCs.import_history(filename='neuron_PlaceCells_history.parquet')
I choose parquets as we need to able to save lists in a df and parquet make it easy. I have also made an option in the functions so that the users can export it as a csv but the functionality might be limited (import might cause an issue-testing this)
Solves #120
Hi Mehul, this is a great start! I have a few points I'd like to discuss and potentially resolve:
1. Streamlining Data Export Logic
The most significant aspect of this PR (and not your fault at all) is the current necessity to manually loop over and prepare each history entry. While I understand this is done to ensure each variable is saved as a single column in a CSV, I'm concerned about the repeated logic for Neurons and Agent history and the hardcoding for specific keys.
Proposal: Generic _dict_to_dataframe() utility
I suggest creating a generic utility function, perhaps named _dict_to_dataframe(), that takes any dictionary (like our history dictionaries) and transforms it into a Pandas DataFrame suitable for export. This function would:
- Iterate over the dictionary's keys.
- Dynamically inspect the type and dimensionality of each value (e.g., list, list of lists, NumPy array).
- Handle multi-dimensional variables (like
posorfiring_rate) by flattening them into multiple columns. For example,poscould becomepos_0,pos_1, andfiring_ratecould becomefiring_rate_0, ..., `firing_rate_{N-1}. All other 1D keys would retain their original names.
Benefits of this approach:
- Maintainability: One centralized function instead of duplicated logic across different parts of the code.
- Flexibility: It won't be limited to the currently saved keys. If users add new variables to history, they might be automatically handled.
- Readability: Decouples the export preparation from the history-saving mechanism.
Sidenote: Take a look at Agent.get_history_arrays(). It converts the history dictionary into a strict dictionary of arrays, which could be a useful first step for _dict_to_dataframe().
2. Importing Exported Data
Following the above, we would ideally need a corresponding import utility that can invert this logic, reconstructing the original data structures from the exported DataFrame. This might be tricky, but I believe it's possible, you've kind of already been writing this anyway. Specifically, if any of the keys had matching prefixes followed by _0, _1 ec. these could be stacked into an array.
3. Dependencies and File Format
It looks like new dependencies might be introduced. My preference is to stick primarily to pandas for handling the data frame operations (as you currently do). I also lean towards CSVs as the primary export format, as it aligns with common practices in neuroscience tools (e.g., DeepLabCut) and is generally more novice-friendly. Could the use of parquets somehow be allowed but not encouraged nor required as a dependency.
ALTERNATIVE APPROACH: Dual History Maintenance
I'm also open to an alternative idea: modifying Agent.save_to_history() to simultaneously maintain both the existing history dictionary (which is good for plotting, etc.) and a history_dataframe. This history_dataframe would contain the same information but exclusively as 1D variables, added directly to a Pandas DataFrame.
This approach would eliminate the need to retrospectively convert the dictionary to a DataFrame, as both structures would always exist in parallel. However, my current preference still lies with the _dict_to_dataframe() utility, but I wanted to put this alternative out there for discussion.
In light of a further read and better understanding I think this might be better organised as follows:
In utils a single function convert_dictionary_to_dataframe() maps any dictionary to a pandas dataframe. Agents and Neurons then each have a user-facing .convert_history_to_dataframe() and.export_history_to_file() method. The first one simply calls the util on its own history, the second one calls the first one and saves it (exposing some kwargs for csv or parquet or whatever).
Once these are built we can think about an environment (and agent) level AP which saves all subagent (and subneuron) level data into a single csv.
How does this sound? Very open to push back!
1. Streamlining Data Export Logic
The most significant aspect of this PR (and not your fault at all) is the current necessity to manually loop over and prepare each history entry. While I understand this is done to ensure each variable is saved as a single column in a CSV, I'm concerned about the repeated logic for
NeuronsandAgenthistory and the hardcoding for specific keys.Proposal: Generic
_dict_to_dataframe()utilityI suggest creating a generic utility function, perhaps named
_dict_to_dataframe(), that takes any dictionary (like our history dictionaries) and transforms it into a Pandas DataFrame suitable for export. This function would:
- Iterate over the dictionary's keys.
- Dynamically inspect the type and dimensionality of each value (e.g., list, list of lists, NumPy array).
- Handle multi-dimensional variables (like
posorfiring_rate) by flattening them into multiple columns. For example,poscould becomepos_0,pos_1, andfiring_ratecould becomefiring_rate_0, ..., `firing_rate_{N-1}. All other 1D keys would retain their original names.Benefits of this approach:
- Maintainability: One centralized function instead of duplicated logic across different parts of the code.
- Flexibility: It won't be limited to the currently saved keys. If users add new variables to history, they might be automatically handled.
- Readability: Decouples the export preparation from the history-saving mechanism.
Sidenote: Take a look at
Agent.get_history_arrays(). It converts the history dictionary into a strict dictionary of arrays, which could be a useful first step for_dict_to_dataframe().
I completely agree with this approach and basically had 2 choices:-
- either we make an overall to the way we save everything in the history (standardising as far as possible) - if we go this way maybe we should be able to import entire agents in a nice way?(including the params) or we limit ourselves to only things related to "poses"? in which case we should only export the position and the heading (no vel), but here a harcoded solution limiting what users can export be much more useful
- or each obj is responsible for handling the datastructures in the history - this in the longterm an be limiting as the users can add a history param which won't be automatically exported but a discussion needs to be done if we should allow that anyways out of box - the pro is that without changing a lot of code the users can import/export the histories (which is the main reason I went for it)
2. Importing Exported Data
Following the above, we would ideally need a corresponding import utility that can invert this logic, reconstructing the original data structures from the exported DataFrame. This might be tricky, but I believe it's possible, you've kind of already been writing this anyway. Specifically, if any of the keys had matching prefixes followed by
_0,_1ec. these could be stacked into an array.
Yes1 I was aiming for a method that the users can do import/export at will but maybe we need to limit this also. For eg. we only allow exporting positions and the import functionality automatically populates all the others? (using a mini simulation?) - this will alow users to import data that does not have velocities,etc
3. Dependencies and File Format
It looks like new dependencies might be introduced. My preference is to stick primarily to
pandasfor handling the data frame operations (as you currently do). I also lean towards CSVs as the primary export format, as it aligns with common practices in neuroscience tools (e.g., DeepLabCut) and is generally more novice-friendly. Could the use of parquets somehow be allowed but not encouraged nor required as a dependency.
This is a hard one to solve just using csvs for multiple reasons but main being that csv do not support arrays (we need that for fr in neurons - csv stores them as string!!) and I do believe that even positions should be an array. Loading and writing arrays in csv is super slow and the file tends to be very large.
SLEAP/DLC use hdf5 file format to write data in a matrix.
Also if we are able to do parquets it's much easier to make a general utility function to "solve all"!
ALTERNATIVE APPROACH: Dual History Maintenance
I'm also open to an alternative idea: modifying
Agent.save_to_history()to simultaneously maintain both the existing history dictionary (which is good for plotting, etc.) and ahistory_dataframe. Thishistory_dataframewould contain the same information but exclusively as 1D variables, added directly to a Pandas DataFrame.
I don't think this is necessary and just eats up on the ram. it takes very less compute power to convert and once we decide a format this can be avoided
I completely agree with this approach and basically had 2 choices:-
either we make an overall to the way we save everything in the history (standardising as far as possible)
- if we go this way maybe we should be able to import entire agents in a nice way?(including the params) or we limit ourselves to only things related to "poses"? in which case we should only export the position and the heading (no vel), but here a harcoded solution limiting what users can export be much more useful or each obj is responsible for handling the datastructures in the history
- this in the longterm an be limiting as the users can add a history param which won't be automatically exported but a discussion needs to be done if we should allow that anyways out of box
- the pro is that without changing a lot of code the users can import/export the histories (which is the main reason I went for it)
So I'd say that exporting here is a priority over importing. That's definitely the case for most RiaB users who generate data with the package and then study it elsewhere, but not sure about your needs @niksirbi. If that's the case then I think I greatly prefer the centralised method of having one function which converts any dictionary to a dataframe, living outside the specific classes.
I don't think this necessarily precludes importing which, again, should be done with a centralised utility which loads a csv, converts to pandas then tries to group any columns with the same prefix. Then passes this to the Agent / Neurons who can check the right variables are present. Anyway, this is unlikely to be used by many people imo and does overlap with the existing Agent.import_trajectory() API which is much more simplistic but has done the job.
My suggestion would be to prioritise exporting over importing and do this as cleanly as possible in a way which doesn't limit us further down the line.
Yes1 I was aiming for a method that the users can do import/export at will but maybe we need to limit this also. For eg. we only allow exporting positions and the import functionality automatically populates all the others? (using a mini simulation?) - this will alow users to import data that does not have velocities,etc
This is sort of what is achieved by Agent.import_trajectory(). We could therefore just minority adapt this function to allow importing from a csv/parquet.
This is a hard one to solve just using csvs for multiple reasons but main being that csv do not support arrays (we need that for fr in neurons - csv stores them as string!!) and I do believe that even positions should be an array. Loading and writing arrays in csv is super slow and the file tends to be very large.
SLEAP/DLC use hdf5 file format to write data in a matrix.
Also if we are able to do parquets it's much easier to make a general utility function to "solve all"!
Tbh I'm a little out of my depth here regarding what the community would be satisfied with and what is best practise (different things in my experience). However I am quite happy that riab still only has 4 dependencies so moving to 6 is a bg jump. Not against it, just want ot know it's worth it. Dumb question; could we just save as npz?
So I'd say that exporting here is a priority over importing. That's definitely the case for most RiaB users who generate data with the package and then study it elsewhere, but not sure about your needs @niksirbi. If that's the case then I think I greatly prefer the centralised method of having one function which converts any dictionary to a dataframe, living outside the specific classes.
Regarding our use-case, i.e. loading RiaB-generate Agent trajectories into movement, the only thing we absolutely need is the x,(y, z) positions of the Agent(s) over time. If you export heading as well, I'm confident we can also load that in our data structures, but not absolutely necessary. So basically, as long as we can have Agent positions in any file format that can be read into a dataframe, we are good (parquet is fine as well, and may be even preferable due to the reasons @mehulrastogi mentioned). I'd only ask that the contents of that file are clearly documented somewhere, so we and others can figure out how to parse it.
As to the wider discussion, the needs of RiaB's users are of primary importance here, so feel free to implement export in any way that's maximally useful to them. We in movement will find a way to extract the info we need from the file, as long as the above criterion in met.
@mehulrastogi does the above sound good to you? shall we go ahead and structure it a little in this direction?