OpenWPM icon indicating copy to clipboard operation
OpenWPM copied to clipboard

Verify content hashing is deterministic

Open englehardt opened this issue 5 years ago • 2 comments

See #711 and the related PR #715. We should verify that this has fixed the issue after running a large crawl with the new fix.

englehardt avatar Aug 03 '20 22:08 englehardt

Hi! I am running a big crawl for research reasons and I still have the issue with data in DB where I find two different hashes for the same url. I am interested in comparing javascripts across different runs and maybe different top_level_url. The hash is a nice data point to identify same content but hosted in a totally different url.

As a test, I tried substituting the code inside digestMessage() in Extension/src/lib/sha256.ts to a different implementation and still have the problem.

Calling the (custom) hash function on the responseBody returned by ResponseBodyListener in logWithResponseBody() in Extension/src/background/http-instrument.ts resolves the issue for what it concerns .js files.

I still have other urls that result in duplicate hash function though... I reproduced and tested it easilly on bing and youtube.

Giblin91 avatar Apr 22 '23 15:04 Giblin91

Here an update analysing data of the big crawl I did:

I was hoping to identify same script (content) across many websites. To address the issue of content hash not being reliable, once data was collected in unstructured storage, I wrote a script to rehash all, and store the result as key-value pairs OWPM_Hash - MY_Hash.

At first it appeared to be a good enough solution so that I could correlate a script with MY_Hash and use these values.

Recent data analysis showed that even like this, the actual content stored in levelDB was different in some cases. I have found two scripts with different hash that when navigated to, to review content, resulted in content being identical, but both content-hash and "my" content-hash were different.

I do not have the levelDB anymore to investigate further, maybe I will do an addittional crawl aimed at known scenario.

Giblin91 avatar Jul 26 '23 21:07 Giblin91