Event Logging API / Support
For research projects we should support logging events that happen in Snap!.
This is a mix of Snap! support, and a back-end API to support accepting log messages and placing the somewhere.
Cloud Responsibilities
Some things I think we need on the back end, but this all needs to be scoped.
- a way to define whether a session should be logged. What ways might we want to log?
- Possible scenarios:
- Everything a particular user (set of users) does
- Everything anyone who edits/remixes a particular project
- Specific users and specific projects
- Client IP address? (e.g. Anyone on a campus subnet)
- a way to define a "research project"
- this should have an owner, or possibly co-owners
- probably the ability to define parameters of logging
- notes and any other metadata
- A single endpoint which takes a log entry and forwards it somewhere
Architecture
I think it's important this be separate from the main DB, and maybe the particular endpoint could bypass lapis for pref reasons, if necessary.
I think 1 endpoint that we host that's a forwarding endpoint is probably best, since it minimizes the total sites that data could be logged to. It offers the most protection if the backed were to actually validate whether a should be logged before it's forwarded, though that's probably not necessary.
- Should there be 1 centralized SnapCloud store?
- This kinda puts cost on us, but makes a stronger case for Snap! as a research platform, and enables us to potentially do cross-project analysis / re-use the data.
- What would this datastore be? General object store?
- If so how do how do researchers export only their data?
- In not how do we define where data goes? Is there a single protocol we could use?
- Probably we can just agree on a GET/POST format, probably with an auth bearer.
I suspect we have low enough volume that we have dozens of options for log storage... A separate pg instance is probably a decent option.
@thomaswp @brollb
Since you two have done some logging before -- I'm wondering if you could give some background on how much logging you have typically done/seen. How many events/sec would a typical student generate and how large would those events be?
Can be total ballpark figures... I'm just trying to think about if we can actually support half a dozen research projects all at once. 😄
In iSnap, a class of ~50 students working for ~5 weeks (6 assignments + a project) will generate ~500K logs, which amounts to 1GB uncompressed. As for frequency, that's configurable. I have it set to log at most once per 3s. It's a tradeoff between traffic and avoid data loss if the browser is abruptly closed. Regardless, the number of actual rows is the same, and that's determined by how many edits students make per second. I think ~1/s on average is pretty normal.
A few things to keep in mind there:
- Because of my small N, I'm logging in the easiest way possible, which is to log everything, including full XML snapshots (not diffs) after each edit.
- I do strip out all media files before sending XMLs to the server.
- I believe NetsBlox logs edits/diffs, rather than full snapshots, which probably saves some space, but may have some downsides for analysis.
- This data is highly redundant, that same 1GB compressed only takes up 50MB. That's probably an upper bound on the data savings you'd get from doing diff/edit-based logging.
P.S. I believe the Blackbox dataset, which has been logging way more data than Snap likely will have for the past 5 years is only at ~2TB. So if you're just logging source (no media) that's pretty cheap relatively speaking.
One more note: If you're curious about how to log your data (what to include, how to export it, etc.), I've been part of a sizeable group of researchers working to develop a standard called ProgSnap2. See the:
- Paper describing the spec
- Specification (http://bit.ly/ProgSnap2)
Thank you! For a single class that’s manageable. 1/s even scaled is manageable load for us, though it’s data storage that gets tricky (especially since there’s essentially 0 funding right now.)
But, I think we can do something pretty easy with a dead simple API and S3. I’m definitely looking into ProgSnap2. Certainly it would be nice to have a common format.
Do you ever need to check where to enable tracking on a per-user? Or if people opt-out do you just not use their data?
We usually just remove user data after the fact if they opt out (in part because users can withdraw consent later if they want). I imagine it would be pretty easy though - just have a client- (and probably server-) side check to make sure the user has consented before logging.
@emansishah
Random idea: If logging is a flag on a project, remixing the project should propagate the flag. (Could also be in the XML...)
Yeah, as @thomaswp said, we log edits in NetsBlox so they are less redundant than full snapshots but would need to be reconstructed if analyzing arbitrary "snapshots". Another perk is that edits are saved when they occur (rather than on a standard interval). Replay data is actually saved in the project xml as well as on the server (required for collaborative editing). Saving the replay data in the project can actually be disabled in the project settings. It is nice to have the creation data stored within the project as it doesn't add any complexity for project submission, etc. I could imagine making a teacher dashboard which takes the submitted student project and enables them to easily inspect previous versions or whatever aspect of the project creation they care about.
That said, it is also worth thinking about what actions you would like to log. About a year ago, we added a number of other user actions to our logging including green flag clicks, executing individual blocks, etc (https://github.com/NetsBlox/Snap--Build-Your-Own-Blocks/issues/429).
I still have to look at ProgSnap2 in more detail but I am generally a fan of developing a common spec :)
Another perk about saving the edits in the project is that a flaky internet connection won't result in missing edits on the server (if some of the edits occur while the connection cuts out) and will be saved as long as the project is successfully saved.