Process help channel messages
In https://github.com/python-discord/organisation/issues/341 we discussed the want to store and process messages from help channels. This is so that we can better investigate the usages of our help channels, in order to improve them effectively.
Implementation suggestions (from @Akarys42) are:
- Store content of messages posted in help channels in metricity
- The whole infra is built around data coming from metricity, I think it is the most obvious choice. It won't require too much code change either.
- Both claimant and non-claimant messages should be stored
- Context can be interesting, I don't see a reason not to. Besides if we use metricity it has no way of knowing who claimed the channel.
- We should offer an opt-out
- Dataset is made available through the request to the admins/staff/whoever is handling it. We reserve the right of denying access to anyone nor planning on doing anything actually useful with the data and/or don't show enough knowledge to be able to do anything with it.
- Messages in help channels are public content. I think it is worth making it public, we may have some good data back from the community. One concern would be does that fit our Privacy policy? (cc @jb3)
Dropped this internally, but here is my list of recommendations from a privacy standpoint:
- opt-out system which erases on opt-out
- controlled access to the dataset (admins + select few others)
- holding for no longer than a month or two
- no user data above a discord user ID
- on message delete on the server we remove from our dataset, even if the user is opted in
- data is held and collected by a separate service
I'm very against putting this information in Metricity since we made promises when we started Metricity that it would never collect message content, and I think that's a good rule to abide by (as well as the fact that other communities use metricity).
My recommendation would be we create a new service which tracks help sessions into a database similar to (I'm sure I'm omitting several things here, but you get the gist):

We should definitely keep track of who claimed the channel and sessions individually since we want to analyse help sessions, not all help channel content bundled together.
I don't think making the dataset public is something we should pursue, since we obviously can't erase content from people's local copies once a dataset is made public, and republishing users content in that way is something we would likely need to make an opt-in rather than an opt-out, which makes this data somewhat useless.
A couple comments/questions I had:
- Will the preexisting metricity opt out users (if the list still exists) be opted out of this automatically?
- If we made promises not to store messages in metricity, wouldn’t it be a bit scummy to store them in what is effectively metricity, but under a different name.
- Completely agree on not making the dataset public (side note: not even admins need access to be honest. I imagine you’ll want to analyze those messages and extract some statistics which is all anyone would want to see outside of very limited contexts).
-
No, that list is gone as of Sunday 12UTC. The list was fairly small, and in fact the majority of users on the list had left by the time that it was discontinued on Sunday.
-
I don't think it's scummy, this is just clarity. We've said before that Metricity won't store message contents and I think then later altering that so that it does collect it a bad idea, it's not just that it doesn't meet the goals of the project but that it's odd to integrate message collection into a service where we have said there is none many times. If we proceeed with this we'll be fully clear that this is happening, I'd advise we put out a changelog and even could add it to the embeds in the help channels.
-
Yeah, I was thinking Admins would have access by default since they have Metabase access, but I just figured I can scope that database tighter.
This is my idea for now:
- FastAPI microservice
- Bot posts every message in a help channel to this service
- This service manages storing to a database
- opt-out command in the bot, which posts to an opt out endpoint in the api, which erases all data on that user
- users opted out still have their message posted to the API by the bot, the API manages whether to store them in the DB or not
- This removes the need for any logic in the bot itself
- controlled access to the dataset (admins + select few others)
- Admin endpoint to clean data before a set date
- This can be called in a cronjob to auto delete each month
- No user data above a discord user ID
- Deleting a message in the server deletes it from this service too, even if the user is opted in
This data will be used to produce summaries on what topics are asked about, which topics go most unanswered etc.
users opted out still have their message posted to the API by the bot, the API manages whether to store them in the DB or not
* This removes the need for any logic in the bot itself
TBH I personally am not a fan of this, would it be that hard to not do any processing with the bot if the user has opted out?
I think I'm a fan of the approach where the API checks for opt-out. It's not a major problem.
Both services run in the cluster, all traffic between services on different nodes is encrypted. I'm satisfied that data is safe and so would rather do it the proposed way, it's nice not having to worry about this on bot.
Gotcha, that makes sense.