rename user_id file
Telemetry/analytics are anonymised and can be disabled.
However for unification across Iterative's stack (MLEM, TPI, CML etc.) it's a good idea to put the config in one place. Basically move the existing $CONFDIR/dvc/user_id to $CONFDIR/iterative/telemetry. More explicitly:
import json
from shutil import copy
from pathlib import Path
from appdirs import user_config_dir
# DVC backwards-compatibility
old = Path(user_config_dir("dvc/user_id", "iterative"))
# cross-product path
new = Path(user_config_dir("iterative/telemetry", False))
uid = generate_id() # see above
if uid.lower() == "do-not-track":
return
if new.exists():
uid = new.read_text().strip()
else:
if old.exits():
uid = json.load(old.open())["user_id"]
new.write_text(uid)
@efiop @skshetry - is anyone looking at this? Looks like a small fix, needed to put order in telemetry across products, but no real response here for a long time Can this be assigned and prioritized please? (again, looks really small) If there are outstanding issues please say so
I thought this would come with the new telemetry package instead of us renaming stuff manually. The project hasn't been updated in a long time and not released properly yet https://pypi.org/project/iterative-telemetry/ , so I'm not sure if there is anything actionable that we really need to do here CC @casperdcl
@efiop unifying telemetry into a package reusable by other products seems great ... 😄 BUT Currently DVC uses internal logic - Why would this functional change be blocked by someone picking up iterative-telemetry project? There's also a necessary BC piece here which is specific to the transition to convert from current dvc specific dir to iterative dir
@omesser We can provide backward compatibility when we migrate to the new package. I really can't see any reason to do this intermediate change right now. And it doesn't seem like it is blocking anyone.
- currently, all "new" projects need to create "legacy"
$CONFDIR/dvc/user_idjust in case DVC is subsequently installed - DVC should ergo start writing to the "new"
$CONFDIR/iterative/telemetryasap (reducing the amount of code bloat everywhere) - DVC "properly" migrating to using https://pypi.org/project/iterative-telemetry/ is the ideal goal, I'm fine with skipping (2) above if we can quickly do the proper migration (seems unlikely?)
@casperdcl Makes sense. So telemetry is far away enough that we are doing this manual renaming step in the meantime across producs. I was hoping telemetry would be ready at this point or we would wait for it, but could rename for now, sure...
Any existing projects already using it or older user_id, btw?
I guess we'll look for a new file location and if we don't find it - we'll fall back to the old one and will copy it or create a new one.
Is $CONFDIR/iterative/telemetry format settled now, btw? Also, we follow conventions already and specify author as iterative https://github.com/iterative/dvc/blob/0d71194741f9423d431dfee873005bd1894c0b37/dvc/config.py#L114 , but it is not always used in the resulting path. I see that you've hardcoded iterative/telemetry, but that seems to be breaking conventions. Are we sure we want that specifically?
Ah, alright, I see that you do tweak the user_config_dir already https://github.com/iterative/iterative-telemetry/blob/main/src/iterative_telemetry/init.py#L199
Also the package seems to work, so I would rather partially use it instead of modifying every project to use a new location by hand.
@mike0sv @casperdcl So what's the status of telemetry package? Can we publish it and use it at least partically to retrieve the file location, etc? Seems better than having multiple projects do their own thing by hand. I can also contribute backward-compatibility code there to use old user_id as noted above (this will be useful for years, because some people stick to older dvc version for years).
EDIT: as noted by @omesser there is an issue for it already https://github.com/iterative/iterative-telemetry/issues/13
I would feel comfortable pointing users to a file that is within the repository or bundled in source dist or git-archives. It should be a simple client, do we really need an external project?
@skshetry We'll be using it in multiple projects, so it is easy to point everyone to that one place and handle all feedback/bugs/testing/etc there and not duplicate the code.
@skshetry We'll be using it in multiple projects, so it is easy to point everyone to that one place and handle all feedback/bugs/testing/etc there and not duplicate the code.
Most of the issues that I see with analytics have been with integration, eg: weird interaction with daemon, some issues with pyinstaller and Windows, etc.
It's quite easy to test the other parts of the client. So I don't think it really helps us reduce bugs (i remember only 3-4 minor bugs in last 2.5 years, #4262, #7545, #6407, #3292, #3405).
I find the duplication worth it, given that it makes it easier to audit the package itself and see when and with what it calls home.
@skshetry Auditing is as easy as it was: point to a package with 1 file instead of 1 file in dvc, the difference is very marginal. But the chances of forgetting to modify analytics url or user_id location or something else in one of 3-4-5+ packages is much more error prone. We clearly have some special things like pyinstaller right now (but other packages will probably use it at some point too), but overall it is better to keep everything in one place and maintain it there.
But the chances of forgetting to modify analytics url or user_id location or something else in one of 3-4-5+ packages is much more error prone.
Unittests should easily catch that.
@skshetry Copying unit tests from project to project or creating them from scratch everywhere is suboptimal though, as people will miss things. The whole idea is to consolidate effort and not waste it in every team and suffer from inconsistencies.
The decision to create common telemetry package with common user_id and stuff was joint, it would be really weird if we decide to have our in-house one in dvc just for the fun of it 🙂
@skshetry Copying unit tests from project to project or creating them from scratch everywhere is suboptimal though, as people will miss things. The whole idea is to consolidate effort and not waste it in every team and suffer from inconsistencies.
Inconsistencies can be removed (or will eventually be consistent). I'd say it's worth the (minimal) effort, compared to the things we lose after this, especially control over a sensitive part of DVC.
it would be really weird if we decide to have our in-house one in dvc just for the fun of it 🙂
Not for fun, but for all the reasons that I have given above (sorry don't want to type again, I'm on my phone).
Also I think that having the source in-tree establishes more trust with the users and increases confidence, which is more important for analytics.
@skshetry To your concerns about transparency and user trust - we should make https://github.com/iterative/iterative-telemetry public (see: https://github.com/iterative/iterative-telemetry/issues/30), I don't think that it warrants the duplication personally
Today this package doesn't do much, but the idea is to unify the more important parts of telemetry in the future (not just user_id resolution and config file location). And we see that for every small decision we have 20 opinions, so it makes sense from a process stand point as well.
I believe that our current divergence in various projects (MLEM, GTO, TPI, DVC, VSCODE) showed us that we need this centralized as part of a healthier process around telemetry that situates it as a cross-team effort