floc PING Privacy Review: Transparency / Auditing

There is a lot of concern about the black box nature of a browser controlling the data and determining user grouping. Personally, I think an approach of "trust us, we have people's best interests at heart" here really isn't going to fly. If this is truly about trying to build more privacy protections for end-users, then I think there needs to be some transparency mechanisms built in both for consumers/regulators/and the ecosystem. I don't think every transaction has to be reviewable in the moment that it happens - but the system itself should be auditable.

There is, of course, a valid argument that companies develop browsers to meet their own goals - and that's completely valid. What I would say here, though, is that if Chrome doesn't add in some sort of transparency/auditing capabilities here (or any other browsers if they follow suit), the the feature is just a commercial interest of the browser in question, and should just be presented as such.

Mar 18 '21 19:03 kdeqc

Hopefully some of these concerns can be alleviated by looking at exactly what the server-side can do. The grouping here is done in two phases. First, we use a straight-forward simhash algorithm based on the list of recently visited domains. This is deterministic and does not take any special parameters as input. This is done client side with open-source code. To ensure that each group has at least a sufficient number of users, we merge neighboring cohorts together (those with the same prefix) until they contain enough users. The demarcations of the merging boundaries is provided by a server which keeps track of the size of cohorts and is periodically distributed to the browser. The browser interprets the boundaries and applies them. So while it's true that on the server-side there is some black box work that must be trusted, the most nefarious thing that it could do is set the boundaries for where groups get merged.

Finally, the server side will choose to omit certain cohorts that have been flagged as being too strongly correlated to sensitive sites. We'll publish how this analysis is done. This too is a black-box operation, but the worst that it can do is omit cohorts.

Since these server-side capabilities are constrained by their interpretation in the client code, which is specified and open source, we feel that the clustering is transparent.

Mar 18 '21 19:03 jkarlin

Hi @jkarlin, could you expand a little bit more about what are the inputs for the simhash algorithm or point to me where I can read more about it? From FLoC's whitepaper I could guess every website (url or url + some content?) will be categorized via Content Categories API and each user will be assigned a feature vector which is the mean of each visited website's weighted category vector, right? How many features will this feature vector have? Content Categories API can return up to 620 different categories, so is it safe to assume the vector will be of 620 length? When/how frequently each website categorization will be updated?

Mar 23 '21 12:03 millengustavo

Right now the inputs are just the domains of the sites visited in the last several days. See FlocId::SimHashHistory(). No categories, though we're considering changing from domains to categories in the future.

Mar 23 '21 13:03 jkarlin

the most nefarious thing that it could do is but the worst that it can do is

^ Famous last words

we feel that the clustering is transparent

To me, and I'm sure many others, these statements help with nothing but reinforcing the hollow Google motto of "we know what we're doing, we have your best interests in mind, trust us".

Apr 14 '21 17:04 dczysz

Finally, the server side will choose to omit certain cohorts that have been flagged as being too strongly correlated to sensitive sites. We'll publish how this analysis is done. This too is a black-box operation, but the worst that it can do is omit cohorts.

Since these server-side capabilities are constrained by their interpretation in the client code, which is specified and open source, we feel that the clustering is transparent.

@jkarlin

If user data is being sent to a server and sensitivities analyzed then that server needs to be open source as well, even if its encrypted, and even if its cohort data. Publishing a whitepaper on how it is done is still very 'trust us' because we'd have no way to validate it. After all Google is not trusting other groups to run the server, or even look at the server, so it would not be sensible to trust Google's say so on a white paper by the same logic.

There are censorship concerns, discrimination concerns, and other issues just in the sensitivity calculation alone and that is definitely something that needs to be public and not a black box. Even removing sensitivity, the server is processing data antithesis to floc's ideals of cross domain information not touching a server, so it should be heavily audited.

Do you disagree, or have I misunderstood anything?

Apr 23 '21 19:04 TheMaskMaker