Privacy 2022 queries
Progress on https://github.com/HTTPArchive/almanac.httparchive.org/issues/2891 based on the chapter outline
How websites track (profile) you online
Third-party tracking using WhoTracks.me
- [x] number of websites that have a third-party tracker (any/of a certain category)
- [x] most popular third-party trackers
- [x] number of trackers per website
Cookies
- [x] most common cookies set across websites (including whether they are set by known trackers)
- [x] domains that set cookies most often (including whether these are known trackers)
- [x] most common Max-Age/Expires timings for cookies → see Security chapter
CNAME tracking
- [x] most common CNAME tracking services
- [x] share of pages with CNAME tracking per page rank group
- [x] share of CNAME tracking per TLD
Cookie syncing
- [ ] most common pairs of websites that engage in cookie syncing → too complex to discover
Privacy Sandbox experiments
FLoC
- [x] number of websites accessing the
document.interestCohortproperty- through string search
- [x] number of websites participating in the FLoC origin trial
- [x] number of websites opting out of FLoC computation
FLEDGE
- [ ] number of websites using
navigator.joinAdInterestGroup,navigator.leaveAdInterestGroup,navigator.updateAdInterestGroups,navigator.runAdAuction
Attribution Reporting
- [x] number of websites participating in the
ConversionMeasurementorigin trial - number of websites using
attribution*tags on<a>elements – DOM elements should have been measured as part of a custom metric to catch all sites - [ ] number of websites having
.well-known/attribution-reporting/report-attributionor.well-known/attribution-reporting/trigger-attribution– was not collected; search references in response bodies instead? - [ ] number of websites setting a
attribution-reportingfeature policy on an iframe → retrieve from generic Feature-Policy analysis below
Private Click Measurement
- [ ] number of websites using
attribution*tags on<a>elements – DOM elements should have been measured as part of a custom metric to catch all sites - [ ] number of websites having
.well-known/private-click-measurement/report-attributionor.well-known/private-click-measurement/trigger-attribution– was not collected; search references in response bodies instead?
Trust Tokens
- [x] number of websites participating in the
TrustTokensorigin trial - number of websites having
.well-known/trust-token - [ ] number of websites accessing
document.hasTrustToken
SameSite cookie
- [x] number of
SameSite-unspecified cookies, which will now default to the more privateLaxon browsers supportingSameSite-by-default → see Security chapter - [x] number of
SameSite=Nonecookies, which are likely to be set this way explicitly to allow for state sharing across sites, which could (but may not be) used for tracking → see Security chapter - [x] possibly mention and link to analysis in Security chapter (not an SQL query)
Fingerprinting
- [x] number of websites that use a fingerprinting library
- [x] most popular fingerprinting libraries
- Wappalyzer category 83 (Browser fingerprinting)
- [x] number of websites accessing device sensors
- creating event listeners + Blink features usage
- see The Web's Sixth Sense: A Study of Scripts Accessing Smartphone Sensors
- [ ] most used Web API features and properties commonly related to fingerprinting
Retargeting
Wappalyzer category 77
- [x] number of websites that use a retargeting library
- [x] most popular retargeting libraries
How websites give you a privacy choice
Consent Management Platforms
IAB Transparency Consent Frameworks
- [x] number of websites using any of the IAB framework
- [x] number of websites using a specific IAB framework (TCF v1, TCF v2, USP)
- [x] number of websites using a compliant setup for TCF v1/v2
- [x] most popular CMPs for TCF v2
- [x] most commonly disclosed purposes/legitimate interests for TCF v2
- [x] most common publisher countries for TCF v2
- [x] most common consent strings for USP
Popular third-party consent management libraries
Wappalyzer category 67
- [x] number of websites using a third-party consent management library
- [x] most popular third-party consent management libraries
Privacy policies
- [x] most common privacy policy link texts
- [x] most common keywords for privacy policies
Transparency & controls for targeting
- [ ] number of websites providing Ads Transparency Spotlight metadata
Do Not Track / Global Privacy Control
- [x] number of websites accessing
doNotTrackand/orglobalPrivacyControlproperties- through string search +
.well-known/gpc.json+ headers
- through string search +
Keeping your data private (or not)
‘Sensitive’ resources (camera, microphone, geolocation)
- [x] number of websites accessing
mediaDevices(camera, microphone)/geolocationproperties or requestingPermissions-Policy/Feature-Policystatus- through string search + Blink features usage
- [x] number of websites controlling access to sensitive resources through
Permissions-Policy/Feature-Policyheaders - [x] number of websites controlling access to sensitive resources through iframe
allowtags → see Security chapter - [x] most common directives (i.e., sensitive resources) listed in
Permissions-Policy/Feature-Policyheaders - [x] most common directives (i.e., sensitive resources) listed in iframe
allowtags → see Security chapter - [x] most common directive values listed in
Permissions-Policy/Feature-Policyheaders - [x] most common directive values listed in iframe
allowtags → see Security chapter
Hiding browser-related data
Referrer-Policy
- [x] number of websites explicitly setting
Referrer-Policyon the whole document- via header and/or meta tag
- [x] number of websites that use a certain value for
Referrer-Policy- in particular, less privacy-preserving values (
no-referrer-when-downgradeandunsafe-url)
- in particular, less privacy-preserving values (
- [ ] number of websites that rely on the browser's default for
Referrer-Policy- note various browser defaults in 2022
- [x] number of websites setting
referrerpolicyattribute for individual requests - [x] number of websites setting
rel="noreferrer"attribute for link relations
User-Agent Client Hints
- [x] number of websites using User-Agent Client Hints
- [x] most frequently used
Accept-CHheaders
Other types of tracking such as geolocation-as-a-service
Wappalyzer category 79
- [x] number of websites that use a geolocation service/library
- [x] most popular geolocation services/libraries
Data breaches
- [x] number of accounts with PII leaked in data breaches (over time)
- parse data from Pwned websites (not an SQL query)
HTTPS
- mention and link to analysis in Security chapter (not an SQL query)
@max-ostapenko How are the queries coming along?
@max-ostapenko friendly ping
@bazzadp @rviscomi could you please help appending almanac.whotracksme and overwrite almanac.breaches tables with 2022 data from trackers.csv and breaches.json correspondingly?
Done. Steps to reproduce for next time:
For almanac.breaches:
- Download
breaches.json - Create a new table
almanac.breaches_2022via upload, with autodetected schema - Append the output of this query to
almanac.breaches:
SELECT
DATE('2022-06-01') AS date,
Name,
Title,
Domain,
BreachDate,
AddedDate,
ModifiedDate,
PwnCount,
Description,
LogoPath,
IsVerified,
IsFabricated,
IsSensitive,
IsRetired,
IsSpamList,
TO_JSON_STRING(DataClasses) AS DataClasses
FROM
`httparchive.almanac.breaches_2022`
Similarly for almanac.whotracksme:
- Download
trackers.csv - Upload to
almanac.trackers_2022temp table - Append to
almanac.whotracksmewith this query:
SELECT
*
FROM
`httparchive.almanac.trackers_2022`
Finally, clean up temp tables.
@tomvangoethem could you please have a look at the queries or/and data in sheets. I'll finish FLOC, attribution and Trust Tokens by end of week.
@max-ostapenko @ydimova How possible is it to finish these queries by the end of the week?
@foxdavidj planning to mark as 'Ready for review' today
@max-ostapenko is this ready for review? Nevermind, just saw it was
Are the checkboxes above up to date with what's been implemented?
@foxdavidj yes, the checkboxes are uptodate. I've covered in queries everything that had the data available.
I'll test and try to fix the missing datapoints if possible for future period within next weeks in custom metrics repo.