Prevent multiple collecting of the same file due to recursive symlinks
Fixes: #624.
The following directory structure will now be collected correctly.
.
├── link -> .
└── test_recursive.py
Hi @bluetech, thanks for the hints. Fair point about the symlink to symlink case, I'll work on it later.
Hi @bluetech, I've looked at the FSCollector._collectfile, but it doesn't seem to be a good fit for this task, as it is responsible for collecting files, not directories. I can't see any other place suitable to filter out the seen directories.
The symlink to symlink case has been fixed and covered by tests.
Given this directory tree
rec/
├── link -> .
├── sub
│ └── test_foo.py
└── test_recursive.py
I tried 3 versions:
Current master:
rec/test_recursive.py .[ 1%]
rec/link/test_recursive.py .[ 2%]
rec/link/link/test_recursive.py .[ 3%]
rec/link/link/link/test_recursive.py . [ 4%]
rec/link/link/link/link/test_recursive.py . [ 6%]
... [snipped] ...
i.e. the catastrophic behavior.
This PR (commit " Remove the hasattr() check "):
rec/test_recursive.py . [ 25%]
rec/link/test_recursive.py . [ 50%]
rec/link/sub/test_foo.py . [ 75%]
rec/sub/test_foo.py . [100%]
Looks like it still collects the files more than once (twice in this scenario).
This patch:
diff --git a/src/_pytest/config/__init__.py b/src/_pytest/config/__init__.py
index 4ff6ce707..62213f27a 100644
--- a/src/_pytest/config/__init__.py
+++ b/src/_pytest/config/__init__.py
@@ -322,7 +322,7 @@ class PytestPluginManager(PluginManager):
self._conftestpath2mod = {} # type: Dict[Any, object]
self._confcutdir = None # type: Optional[py.path.local]
self._noconftest = False
- self._duplicatepaths = set() # type: Set[py.path.local]
+ self._duplicatepaths = set() # type: Set[str]
self.add_hookspecs(_pytest.hookspec)
self.register(self)
diff --git a/src/_pytest/nodes.py b/src/_pytest/nodes.py
index 4c7aa1bcd..b085fca67 100644
--- a/src/_pytest/nodes.py
+++ b/src/_pytest/nodes.py
@@ -595,10 +595,11 @@ class FSCollector(Collector):
keepduplicates = self.config.getoption("keepduplicates")
if not keepduplicates:
duplicate_paths = self.config.pluginmanager._duplicatepaths
- if path in duplicate_paths:
+ resolved_path = str(Path(str(path)).resolve())
+ if resolved_path in duplicate_paths:
return ()
else:
- duplicate_paths.add(path)
+ duplicate_paths.add(resolved_path)
return ihook.pytest_collect_file(path=path, parent=self) # type: ignore[no-any-return] # noqa: F723
result:
rec/test_recursive.py . [ 50%]
rec/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/link/sub/test_foo.py . [100%]
This one only collects once, but has a recursion issue.
So it seems that the _collectfile does work, but needs some fix, if it's possible.
Thanks @bluetech. I've noticed that the double collection was gone from my tests after I handled symlink to symlink case, but it seems that my test case isn't as good as I thought since the issue is still there.
Hi @bluetech, I've changed the _collectfile() method to remember the absolute path with resolved symlinks. That does the trick with collecting a test multiple times but doesn't solve the performance problem.
Here's the output with my changes in _recurse() method commented out:
collected 1 item
test_foo.py . [100%]
=================================== 1 passed in 92.61s (0:01:32) ====================================
With changes in _recurse() in place:
collected 1 item
test_foo.py . [100%]
========================================= 1 passed in 0.19s =========================================
As you see, the difference in execution time is huge. I've tried to profile it - it seems like too many collected directories are causing the slowdown (cProfile snapshots: snapshots.zip).
Basically, because 40 directories were collected, the isfile() method has been called 2024 times which took 67% of the execution time. The _getconftestmodules() method could use some optimization (maybe os.walk() could help here?), but I'm not sure if that's the scope of this PR.
Hi @aklajnert, thanks for the update, I think I see why you changed _recurse now. I'll clear some time to understand the collection code more so I can evaluate the PR better.
Is there a chance you'll be interested to do a little refactoring in this area as well? In particular what you said "maybe os.walk() could help here?" -- I think we should really try to move away from py.path.local's visit methods which are quite hard to follow with all the callbacks. And we can also speed it up by switching to os.scandir instead of the os.listdir which py is using internally.
I imagine we can't use os.walk itself but can do something like this:
- First commit: Import
py'svisitimplementation as-is to_pytest.pathlibas a standalone function, switch all code to use that, make sure all tests pass. - Second commit: simplify the implementation as much as possible -- dropping Python 2 support, dropping unused arguments, stuff like that.
- Third commit: switch it to use
os.scandirand avoid extra syscalls as much as possible (use fields from theos.DirEntryinstead). - Profit!
While I definitely agree with @bluetech's suggestion of moving away from os.path.local.visit, I think this is better done in a separate PR, so it is easier to review the changes which are independent from the original PR.
Hi @bluetech - I'll be happy to work on the refactor, but I agree with @nicoddemus that this should be done in a separate PR. I'll work on it when I find some time (probably within a week or two).
I still haven't dug into the collection code but it occurs to me that the strategy of completely resolving all symlinks and deduplicating them can be problematic. Consider for example the following test tree:
$ tree top/
top/
├── a
│ ├── conftest.py
│ ├── __init__.py
│ └── test_it.py
├── b
│ ├── conftest.py
│ ├── __init__.py
│ └── test_it.py -> ../a/test_it.py
└── __init__.py
With the following file contents:
# top/a/conftest.py
import pytest
@pytest.fixture
def conftest_fixture():
return "a"
# top/b/conftest.py
import pytest
@pytest.fixture
def conftest_fixture():
return "b"
# top/a/test_it.py
def test_it(conftest_fixture):
print(conftest_fixture)
This symlinks the same file to multiple modules, but the outcome is different because module-level fixtures.
Before this PR the output is
$ pytest -s -q top/
a
.b
.
2 passed in 0.02s
After:
$ pytest -s -q top/
a
.
1 passed in 0.02s
This example is a little convoluted but I can imagine real use cases for this and it's seems legitimate.
Hi @bluetech - fair point. The example indeed seems a bit convoluted, but I see it could be useful in some cases.
Maybe the change should just limit to directory symlinks? Or detect recursive symlink? Not sure how to do that without hurting the performance.
Btw, your comment seems a bit broken here, as if you have pasted something twice.
Oops, I fixed the comment now.
Hi @bluetech - I'm sorry that I couldn't work on it earlier. I've tried to rebase and work on it today but I've noticed that you did the refactor mentioned here in #7541.
I also noticed that the recursive symlink problem is now gone in master (the latest release still has it). Not sure it was intentional or accidental, my guess is for the second option. As proof, I've removed all my changes but tests, and they are still passing. Interestingly - the performance problem mentioned in #624 is resolved in the released version but the recursive symlinks is fixed in master.
Now, I'm not sure if the PR should be removed completely, or should we address your observations from the comment above.
I'm sorry that I couldn't work on it earlier. I've tried to rebase and work on it today but I've noticed that you did the refactor mentioned here in #7541.
Oh right I should have mentioned it back here. I did it as part of the effort to remove py.path.local.
I also noticed that the recursive symlink problem is now gone in master (the latest release still has it). Not sure it was intentional or accidental, my guess is for the second option.
Yes I don't remember any change that intentionally changes things here, but presumably it is #7541. I will check this when I get the chance.
Interestingly - the performance problem mentioned in #624 is resolved in the released version but the recursive symlinks is fixed in master.
Maybe related to #5965 / #6523? I procrastinated on this PR over checking what changed in that PR.
As proof, I've removed all my changes but tests, and they are still passing. ... Now, I'm not sure if the PR should be removed completely, or should we address your observations from the comment above.
In light of your comments I will also check what happens in the scenario I described in the comment in master. Depending on that we can merge your tests to solidify that they won't regress.
Hi @bluetech - I've just noticed that this is still open. Should I close it, or you'd like to merge these tests?
Hi @aklajnert, I obviously dropped the ball here and I don't even remember the exact details anymore, sorry about that.
Looking at the tests, they do look good to me (if they do indeed pass in main...). So if you could:
- Rebase them on latest main
- Convert
testdirtopytester(testdiris now soft-deprecated) - See that PR is green
Then I think we can merge this.
@bluetech - seems like the bug is back according to my tests, or I've messed something up. I don't have time to investigate it now, but I'll look into that later.