spider: do not add seeds to the found list
Overview
Fix zaproxy/zaproxy#7737
Checklist
- [N/A] Update help
- [x] Update changelog
- [x] Run
./gradlew spotlessApplyfor code formatting - [x] Write tests
- [x] Check code coverage
- [x] Sign-off commits
- [x] Squash commits
- [x] Use a descriptive title
For more details, please refer to the developer rules and guidelines.
Please see https://github.com/zaproxy/zap-extensions/pull/4963/checks?check_run_id=17379176664
The changelog should be updated. Better to have some tests.
Note that the robots.txt and sitemap.xml URLs still appear in the SiteMap.
This makes sense if they really exist, but maybe we should not include them if they are 404s and requested by the spider? I've wondered about this for a while.
That doesnt have to be part of this PR, just a comment 😁
Better to have some tests.
maybe I could add a test like
- spiderController.addSeed(uri, ...)
- assert if the uri does not exist in foundURIs.
Does that makes sense?
BTW, since I'm yet to learn more about the ZAP code, I'm not sure at the moment how I can access 'foundURIs' in SpiderControllerUnitTest.Java? @thc202 Could you give me an example?
maybe we should not include them if they are 404s and requested by the spider?
+1 from a user perspective :)
The changelog should be updated.
Could you let me know where the changelog is? Maybe it'd be better if the information could be found in https://github.com/zaproxy/zaproxy/blob/main/CONTRIBUTING.md#guidelines-for-pull-request-pr-submission-and-processing
https://github.com/zaproxy/zap-extensions/blob/main/addOns/spider/CHANGELOG.md
For the record that's in https://github.com/zaproxy/zap-extensions/blob/main/CONTRIBUTING.md#changelog
maybe I could add a test like
- spiderController.addSeed(uri, ...)
- assert if the uri does not exist in foundURIs.
Does that makes sense?
BTW, since I'm yet to learn more about the ZAP code, I'm not sure at the moment how I can access 'foundURIs' in SpiderControllerUnitTest.Java? @thc202 Could you give me an example?
@thc202 or anyone, could I get any info which will be helpful?
In SpiderControllerUnitTest something like:
@Test
void shouldNotNotifySeedAsFoundUri() {
// Given
URI seed = …
// When
spiderController.addSeed(seed, …);
// Then
verify(spider, times(0)).notifyListenersFoundURI(eq(seed), …);
}
although this is not correct (IMO) we should still notify about the seeds used, the referenced issue was about the count of found URIs so the fix should be to ignore seeds when counting.
@jeremychoi do you still plan to finish this?
@kingthorin will try to find time to finish this week.
@kingthorin @thc202 please review.
@thc202 we should still notify about the seeds used => In that case, IMO it should be implemented in a separate way, instead of calling notifyListenersFoundURI() whose name doesn't fit here, because it's not that seeds are found.
Won't this mean that none of the seeds are marked found, though we only want to skip the "artificial" seeds (robots, sitemap, ds_store)?
@kingthorin you're right. would this work if we checked the found-URI belongs to the aritificial seeds?
I'd be good with that.
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅
I have read the CLA Document and I hereby sign the CLA
recheck
Could anyone help with solving this error?
A failure occurred while executing me.champeau.gradle.japicmp.JApiCmpWorkAction 937 actionable tasks: 99 executed, 311 from cache, 527 up-to-date Detected binary changes. - current: spider-0.11.0.jar - baseline: spider-0.6.0.jar.
@jeremychoi are you able to keep going with this?
@kingthorin Thanks for the heads up. Yes. Sorry for the delay. I'll find time this weekend.
Can we have a clarification of what the final behaviour is going to be?
@thc202
Can we have a clarification of what the final behaviour is going to be?
the file seeds(e.g. robots, sitemap) will not be notified as 'foundURI'.
We are back to the remark in https://github.com/zaproxy/zap-extensions/pull/4963#issuecomment-1784616816.
@thc202 we should still notify about the seeds used, the referenced issue was about the count of found URIs so the fix should be to ignore seeds when counting. => do you mean file seeds should be notified as found URIs?
Please feel free to close this if my last contribution is not good enough. Unfortunately I'm afraid I have no bandwidth to further work on this recently.
We are working on it.
@jeremychoi thanks for getting it started and thanks for letting us know.