Verify content hashing is deterministic
See #711 and the related PR #715. We should verify that this has fixed the issue after running a large crawl with the new fix.
Hi! I am running a big crawl for research reasons and I still have the issue with data in DB where I find two different hashes for the same url. I am interested in comparing javascripts across different runs and maybe different top_level_url. The hash is a nice data point to identify same content but hosted in a totally different url.
As a test, I tried substituting the code inside digestMessage() in Extension/src/lib/sha256.ts to a different implementation and still have the problem.
Calling the (custom) hash function on the responseBody returned by ResponseBodyListener in logWithResponseBody() in Extension/src/background/http-instrument.ts
resolves the issue for what it concerns .js files.
I still have other urls that result in duplicate hash function though... I reproduced and tested it easilly on bing and youtube.
Here an update analysing data of the big crawl I did:
I was hoping to identify same script (content) across many websites. To address the issue of content hash not being reliable, once data was collected in unstructured storage, I wrote a script to rehash all, and store the result as key-value pairs OWPM_Hash - MY_Hash.
At first it appeared to be a good enough solution so that I could correlate a script with MY_Hash and use these values.
Recent data analysis showed that even like this, the actual content stored in levelDB was different in some cases. I have found two scripts with different hash that when navigated to, to review content, resulted in content being identical, but both content-hash and "my" content-hash were different.
I do not have the levelDB anymore to investigate further, maybe I will do an addittional crawl aimed at known scenario.