almanac.httparchive.org icon indicating copy to clipboard operation
almanac.httparchive.org copied to clipboard

Privacy 2022 queries

Open max-ostapenko opened this issue 3 years ago • 7 comments

Progress on https://github.com/HTTPArchive/almanac.httparchive.org/issues/2891 based on the chapter outline

How websites track (profile) you online

Third-party tracking using WhoTracks.me

  • [x] number of websites that have a third-party tracker (any/of a certain category)
  • [x] most popular third-party trackers
  • [x] number of trackers per website

Cookies

  • [x] most common cookies set across websites (including whether they are set by known trackers)
  • [x] domains that set cookies most often (including whether these are known trackers)
  • [x] most common Max-Age/Expires timings for cookies → see Security chapter

CNAME tracking

  • [x] most common CNAME tracking services
  • [x] share of pages with CNAME tracking per page rank group
  • [x] share of CNAME tracking per TLD

Cookie syncing

  • [ ] most common pairs of websites that engage in cookie syncing → too complex to discover

Privacy Sandbox experiments

FLoC

  • [x] number of websites accessing the document.interestCohort property
    • through string search
  • [x] number of websites participating in the FLoC origin trial
  • [x] number of websites opting out of FLoC computation

FLEDGE

  • [ ] number of websites using navigator.joinAdInterestGroup, navigator.leaveAdInterestGroup, navigator.updateAdInterestGroups, navigator.runAdAuction

Attribution Reporting

  • [x] number of websites participating in the ConversionMeasurement origin trial
  • number of websites using attribution* tags on <a> elements – DOM elements should have been measured as part of a custom metric to catch all sites
  • [ ] number of websites having .well-known/attribution-reporting/report-attribution or .well-known/attribution-reporting/trigger-attributionwas not collected; search references in response bodies instead?
  • [ ] number of websites setting a attribution-reporting feature policy on an iframe → retrieve from generic Feature-Policy analysis below

Private Click Measurement

  • [ ] number of websites using attribution* tags on <a> elements – DOM elements should have been measured as part of a custom metric to catch all sites
  • [ ] number of websites having .well-known/private-click-measurement/report-attribution or .well-known/private-click-measurement/trigger-attributionwas not collected; search references in response bodies instead?

Trust Tokens

  • [x] number of websites participating in the TrustTokens origin trial
  • number of websites having .well-known/trust-token
  • [ ] number of websites accessing document.hasTrustToken

SameSite cookie

  • [x] number of SameSite-unspecified cookies, which will now default to the more private Lax on browsers supporting SameSite-by-default → see Security chapter
  • [x] number of SameSite=None cookies, which are likely to be set this way explicitly to allow for state sharing across sites, which could (but may not be) used for tracking → see Security chapter
  • [x] possibly mention and link to analysis in Security chapter (not an SQL query)

Fingerprinting

  • [x] number of websites that use a fingerprinting library
  • [x] most popular fingerprinting libraries
    • Wappalyzer category 83 (Browser fingerprinting)
  • [x] number of websites accessing device sensors

Retargeting

Wappalyzer category 77

  • [x] number of websites that use a retargeting library
  • [x] most popular retargeting libraries

How websites give you a privacy choice

Consent Management Platforms

IAB Transparency Consent Frameworks

  • [x] number of websites using any of the IAB framework
  • [x] number of websites using a specific IAB framework (TCF v1, TCF v2, USP)
  • [x] number of websites using a compliant setup for TCF v1/v2
  • [x] most popular CMPs for TCF v2
  • [x] most commonly disclosed purposes/legitimate interests for TCF v2
  • [x] most common publisher countries for TCF v2
  • [x] most common consent strings for USP

Popular third-party consent management libraries

Wappalyzer category 67

  • [x] number of websites using a third-party consent management library
  • [x] most popular third-party consent management libraries

Privacy policies

  • [x] most common privacy policy link texts
  • [x] most common keywords for privacy policies

Transparency & controls for targeting

  • [ ] number of websites providing Ads Transparency Spotlight metadata

Do Not Track / Global Privacy Control

  • [x] number of websites accessing doNotTrack and/or globalPrivacyControl properties
    • through string search + .well-known/gpc.json + headers

Keeping your data private (or not)

‘Sensitive’ resources (camera, microphone, geolocation)

  • [x] number of websites accessing mediaDevices (camera, microphone)/geolocation properties or requesting Permissions-Policy/Feature-Policy status
    • through string search + Blink features usage
  • [x] number of websites controlling access to sensitive resources through Permissions-Policy/Feature-Policy headers
  • [x] number of websites controlling access to sensitive resources through iframe allow tags → see Security chapter
  • [x] most common directives (i.e., sensitive resources) listed in Permissions-Policy/Feature-Policy headers
  • [x] most common directives (i.e., sensitive resources) listed in iframe allow tags → see Security chapter
  • [x] most common directive values listed in Permissions-Policy/Feature-Policy headers
  • [x] most common directive values listed in iframe allow tags → see Security chapter

Hiding browser-related data

Referrer-Policy

  • [x] number of websites explicitly setting Referrer-Policy on the whole document
    • via header and/or meta tag
  • [x] number of websites that use a certain value for Referrer-Policy
    • in particular, less privacy-preserving values (no-referrer-when-downgrade and unsafe-url)
  • [ ] number of websites that rely on the browser's default for Referrer-Policy
    • note various browser defaults in 2022
  • [x] number of websites setting referrerpolicy attribute for individual requests
  • [x] number of websites setting rel="noreferrer" attribute for link relations

User-Agent Client Hints

  • [x] number of websites using User-Agent Client Hints
  • [x] most frequently used Accept-CH headers

Other types of tracking such as geolocation-as-a-service

Wappalyzer category 79

  • [x] number of websites that use a geolocation service/library
  • [x] most popular geolocation services/libraries

Data breaches

HTTPS

  • mention and link to analysis in Security chapter (not an SQL query)

max-ostapenko avatar Jun 14 '22 18:06 max-ostapenko

@max-ostapenko How are the queries coming along?

foxdavidj avatar Jul 15 '22 17:07 foxdavidj

@max-ostapenko friendly ping

foxdavidj avatar Jul 29 '22 17:07 foxdavidj

@bazzadp @rviscomi could you please help appending almanac.whotracksme and overwrite almanac.breaches tables with 2022 data from trackers.csv and breaches.json correspondingly?

max-ostapenko avatar Jul 30 '22 18:07 max-ostapenko

Done. Steps to reproduce for next time:

For almanac.breaches:

  • Download breaches.json
  • Create a new table almanac.breaches_2022 via upload, with autodetected schema
  • Append the output of this query to almanac.breaches:
SELECT
  DATE('2022-06-01') AS date,
  Name,
  Title,
  Domain,
  BreachDate,
  AddedDate,
  ModifiedDate,
  PwnCount,
  Description,
  LogoPath,
  IsVerified,
  IsFabricated,
  IsSensitive,
  IsRetired,
  IsSpamList,
  TO_JSON_STRING(DataClasses) AS DataClasses
FROM
  `httparchive.almanac.breaches_2022`

Similarly for almanac.whotracksme:

  • Download trackers.csv
  • Upload to almanac.trackers_2022 temp table
  • Append to almanac.whotracksme with this query:
SELECT
  *
FROM
  `httparchive.almanac.trackers_2022`

Finally, clean up temp tables.

rviscomi avatar Jul 31 '22 16:07 rviscomi

@tomvangoethem could you please have a look at the queries or/and data in sheets. I'll finish FLOC, attribution and Trust Tokens by end of week.

max-ostapenko avatar Aug 03 '22 04:08 max-ostapenko

@max-ostapenko @ydimova How possible is it to finish these queries by the end of the week?

foxdavidj avatar Aug 09 '22 14:08 foxdavidj

@foxdavidj planning to mark as 'Ready for review' today

max-ostapenko avatar Aug 09 '22 20:08 max-ostapenko

@max-ostapenko is this ready for review? Nevermind, just saw it was

Are the checkboxes above up to date with what's been implemented?

foxdavidj avatar Aug 12 '22 17:08 foxdavidj

@foxdavidj yes, the checkboxes are uptodate. I've covered in queries everything that had the data available.

I'll test and try to fix the missing datapoints if possible for future period within next weeks in custom metrics repo.

max-ostapenko avatar Aug 13 '22 09:08 max-ostapenko