`add`: `.gitignore` is modified when directory-only pattern is already present
Bug Report
Description
dvc add modifies .gitignore when run on a directory, even when .gitignore already contains a directory-only pattern matching said directory.
Reproduce
-
mkdir bug && cd bug -
git init && dvc init -
mkdir data -
echo '/data/' 1>.gitignore -
dvc add data/ -
cat .gitignore
Output is:
/data/
/data
Expected
If .gitignore already contains a pattern ignoring the directory, dvc add should not modify the file. This works fine for files, but not for directory-only patterns in .gitignore.
Environment information
Installed via the deb file on Ubuntu 18.04.
$ dvc doctor
DVC version: 2.10.2 (deb)
---------------------------------
Platform: Python 3.8.3 on Linux-5.4.0-110-generic-x86_64-with-glibc2.14
Supports:
azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
gdrive (pydrive2 = 1.10.1),
gs (gcsfs = 2022.3.0),
hdfs (fsspec = 2022.3.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.3.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.3.0, boto3 = 1.21.21),
ssh (sshfs = 2022.3.1),
oss (ossfs = 2021.8.0),
webdav (webdav4 = 0.9.7),
webdavs (webdav4 = 0.9.7)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git
$ dvc add --verbose data/
2022-05-26 11:58:19,640 WARNING: 'data' is empty.
2022-05-26 11:58:19,641 DEBUG: Added '/home/user/bug/data' to gitignore file.
2022-05-26 11:58:19,644 DEBUG: staged tree 'object md5: d751713988987e9331980363e24189ce.dir'
2022-05-26 11:58:19,645 DEBUG: state save (1082445, 99914b932bd37a50b983c5e7c90ae93b, 0) d751713988987e9331980363e24189ce.dir
2022-05-26 11:58:19,648 DEBUG: Computed stage: 'data.dvc' md5: 'None'
2022-05-26 11:58:19,650 DEBUG: Preparing to transfer data from 'memory://dvc-staging/1630ef59ac6c2487dd30e7be446f9be77f70435117f7c2da7b0995baff079d5d' to '/home/user/bug/.dvc/cache'
2022-05-26 11:58:19,650 DEBUG: Preparing to collect status from '/home/user/bug/.dvc/cache'
2022-05-26 11:58:19,651 DEBUG: Collecting status from '/home/user/bug/.dvc/cache'
2022-05-26 11:58:19,651 DEBUG: Preparing to collect status from 'memory://dvc-staging/1630ef59ac6c2487dd30e7be446f9be77f70435117f7c2da7b0995baff079d5d'
2022-05-26 11:58:19,654 DEBUG: state save (1080120, 1653577099652015360, 2) d751713988987e9331980363e24189ce.dir
2022-05-26 11:58:19,655 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 95] Operation not supported
------------------------------------------------------------
Traceback (most recent call last):
File "dvc/fs/utils.py", line 28, in _link
File "dvc/fs/base.py", line 263, in reflink
File "dvc/fs/local.py", line 156, in reflink
File "dvc/system.py", line 112, in reflink
File "dvc/system.py", line 96, in _reflink_linux
OSError: [Errno 95] Operation not supported
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "dvc/fs/utils.py", line 69, in _try_links
File "dvc/fs/utils.py", line 32, in _link
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "dvc/fs/utils.py", line 124, in _test_link
File "dvc/fs/utils.py", line 77, in _try_links
OSError: [Errno 95] no more link types left to try out
------------------------------------------------------------
2022-05-26 11:58:19,656 DEBUG: Removing '/home/user/bug/.fEAYfjTwkgEVFTNGw7Y5iD.tmp'
2022-05-26 11:58:19,656 DEBUG: Uploading '/home/user/bug/.dvc/cache/.DTYbyq7XpBk7uzKbcTakmS.tmp' to '/home/user/bug/.fEAYfjTwkgEVFTNGw7Y5iD.tmp'
2022-05-26 11:58:19,657 DEBUG: Removing '/home/user/bug/.fEAYfjTwkgEVFTNGw7Y5iD.tmp'
2022-05-26 11:58:19,657 DEBUG: Removing '/home/user/bug/.dvc/cache/.DTYbyq7XpBk7uzKbcTakmS.tmp'
2022-05-26 11:58:19,664 DEBUG: state save (1082445, 99914b932bd37a50b983c5e7c90ae93b, 0) d751713988987e9331980363e24189ce.dir
2022-05-26 11:58:19,665 DEBUG: Saving information to 'data.dvc'.
100% Adding...|████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 28.56file/s]
To track the changes with git, run:
git add data.dvc .gitignore
To enable auto staging, run:
dvc config core.autostage true
2022-05-26 11:58:19,680 DEBUG: Analytics is enabled.
2022-05-26 11:58:19,680 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp354ecba1']'
2022-05-26 11:58:19,681 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp354ecba1']'
Additional Information (if any):
By reading the source of this function, I believe this is either a bug in the git backend, or a problem in the way the path argument is formatted.
I am able to reproduce. Indeed appears to be something related with dulwich ignore:
$ cat .gitignore
/data/
$ git check-ignore data -v
.gitignore:1:/data/ data
$ python
>>> from dulwich import ignore
>>> a = ignore.IgnoreFilter((ignore.read_ignore_patterns(open(".gitignore", "rb"))))
>>> a.is_ignored("data")
>>> a.is_ignored("data/")
True
I can reproduce with the following code:
import os
from tempfile import TemporaryDirectory
from dulwich.ignore import IgnoreFilter
with TemporaryDirectory() as tmp_dir:
os.chdir(tmp_dir)
for name in ["/absolute", "/absolutetrailing/", "relative", "relativetrailing/"]:
with open(".gitignore", "w") as fh:
fh.write(name + "\n")
print(".gitignore contents:")
with open(".gitignore") as fh:
print(fh.read(), end="")
ignore_filter = IgnoreFilter.from_path(".gitignore")
name = name.removeprefix("/").removesuffix("/")
is_ignored = ignore_filter.is_ignored(name.encode("UTF-8"))
print(f"{name=}, {is_ignored=}\n")
.gitignore contents:
/absolute
name='absolute', is_ignored=True
.gitignore contents:
/absolutetrailing/
name='absolutetrailing', is_ignored=None
.gitignore contents:
relative
name='relative', is_ignored=True
.gitignore contents:
relativetrailing/
name='relativetrailing', is_ignored=None
@daavoo Do you have any sense of the fix in dulwich?
@thinety How does it impact you to have multiple entries in .gitignore?
@daavoo Do you have any sense of the fix in dulwich?
Haven't looked in enough detail but most likely has to be fixed here:
https://github.com/jelmer/dulwich/blob/4756814d6c49fc735cb9cc401fd6443995e38f57/dulwich/ignore.py#L43-L78
@dberenbaum Usually, data/ is inside another directory in the project (dir/ for example), and is ignored by the root .gitignore. When running dvc add dir/data/, another .gitignore is created inside dir/, because the root .gitignore contains /dir/data/ and not /dir/data. I need to delete this new .gitignore on every run of dvc add, which is more of an incovenience than a problem. The root file could also be modified, but in general I prefer the explicitness of the pattern /dir/data/ matching only directories.