dvc icon indicating copy to clipboard operation
dvc copied to clipboard

`add`: `.gitignore` is modified when directory-only pattern is already present

Open thinety opened this issue 3 years ago • 5 comments

Bug Report

Description

dvc add modifies .gitignore when run on a directory, even when .gitignore already contains a directory-only pattern matching said directory.

Reproduce

  1. mkdir bug && cd bug
  2. git init && dvc init
  3. mkdir data
  4. echo '/data/' 1>.gitignore
  5. dvc add data/
  6. cat .gitignore

Output is:

/data/
/data

Expected

If .gitignore already contains a pattern ignoring the directory, dvc add should not modify the file. This works fine for files, but not for directory-only patterns in .gitignore.

Environment information

Installed via the deb file on Ubuntu 18.04.

$ dvc doctor
DVC version: 2.10.2 (deb)
---------------------------------
Platform: Python 3.8.3 on Linux-5.4.0-110-generic-x86_64-with-glibc2.14
Supports:
        azure (adlfs = 2022.4.0, knack = 0.9.0, azure-identity = 1.10.0),
        gdrive (pydrive2 = 1.10.1),
        gs (gcsfs = 2022.3.0),
        hdfs (fsspec = 2022.3.0, pyarrow = 7.0.0),
        webhdfs (fsspec = 2022.3.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.3.0, boto3 = 1.21.21),
        ssh (sshfs = 2022.3.1),
        oss (ossfs = 2021.8.0),
        webdav (webdav4 = 0.9.7),
        webdavs (webdav4 = 0.9.7)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda2
Caches: local
Remotes: None
Workspace directory: ext4 on /dev/sda2
Repo: dvc, git
$ dvc add --verbose data/
2022-05-26 11:58:19,640 WARNING: 'data' is empty.
2022-05-26 11:58:19,641 DEBUG: Added '/home/user/bug/data' to gitignore file.
2022-05-26 11:58:19,644 DEBUG: staged tree 'object md5: d751713988987e9331980363e24189ce.dir'
2022-05-26 11:58:19,645 DEBUG: state save (1082445, 99914b932bd37a50b983c5e7c90ae93b, 0) d751713988987e9331980363e24189ce.dir
2022-05-26 11:58:19,648 DEBUG: Computed stage: 'data.dvc' md5: 'None'
2022-05-26 11:58:19,650 DEBUG: Preparing to transfer data from 'memory://dvc-staging/1630ef59ac6c2487dd30e7be446f9be77f70435117f7c2da7b0995baff079d5d' to '/home/user/bug/.dvc/cache'
2022-05-26 11:58:19,650 DEBUG: Preparing to collect status from '/home/user/bug/.dvc/cache'
2022-05-26 11:58:19,651 DEBUG: Collecting status from '/home/user/bug/.dvc/cache'
2022-05-26 11:58:19,651 DEBUG: Preparing to collect status from 'memory://dvc-staging/1630ef59ac6c2487dd30e7be446f9be77f70435117f7c2da7b0995baff079d5d'
2022-05-26 11:58:19,654 DEBUG: state save (1080120, 1653577099652015360, 2) d751713988987e9331980363e24189ce.dir
2022-05-26 11:58:19,655 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 95] Operation not supported
------------------------------------------------------------
Traceback (most recent call last):
  File "dvc/fs/utils.py", line 28, in _link
  File "dvc/fs/base.py", line 263, in reflink
  File "dvc/fs/local.py", line 156, in reflink
  File "dvc/system.py", line 112, in reflink
  File "dvc/system.py", line 96, in _reflink_linux
OSError: [Errno 95] Operation not supported

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc/fs/utils.py", line 69, in _try_links
  File "dvc/fs/utils.py", line 32, in _link
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "dvc/fs/utils.py", line 124, in _test_link
  File "dvc/fs/utils.py", line 77, in _try_links
OSError: [Errno 95] no more link types left to try out
------------------------------------------------------------
2022-05-26 11:58:19,656 DEBUG: Removing '/home/user/bug/.fEAYfjTwkgEVFTNGw7Y5iD.tmp'
2022-05-26 11:58:19,656 DEBUG: Uploading '/home/user/bug/.dvc/cache/.DTYbyq7XpBk7uzKbcTakmS.tmp' to '/home/user/bug/.fEAYfjTwkgEVFTNGw7Y5iD.tmp'
2022-05-26 11:58:19,657 DEBUG: Removing '/home/user/bug/.fEAYfjTwkgEVFTNGw7Y5iD.tmp'
2022-05-26 11:58:19,657 DEBUG: Removing '/home/user/bug/.dvc/cache/.DTYbyq7XpBk7uzKbcTakmS.tmp'
2022-05-26 11:58:19,664 DEBUG: state save (1082445, 99914b932bd37a50b983c5e7c90ae93b, 0) d751713988987e9331980363e24189ce.dir
2022-05-26 11:58:19,665 DEBUG: Saving information to 'data.dvc'.
100% Adding...|████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 28.56file/s]

To track the changes with git, run:

    git add data.dvc .gitignore

To enable auto staging, run:

        dvc config core.autostage true
2022-05-26 11:58:19,680 DEBUG: Analytics is enabled.
2022-05-26 11:58:19,680 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmp354ecba1']'
2022-05-26 11:58:19,681 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmp354ecba1']'

Additional Information (if any):

By reading the source of this function, I believe this is either a bug in the git backend, or a problem in the way the path argument is formatted.

thinety avatar May 26 '22 15:05 thinety

I am able to reproduce. Indeed appears to be something related with dulwich ignore:

$ cat .gitignore 
/data/
$ git check-ignore data -v
.gitignore:1:/data/	data
$ python
>>> from dulwich import ignore
>>> a = ignore.IgnoreFilter((ignore.read_ignore_patterns(open(".gitignore", "rb"))))
>>> a.is_ignored("data")
>>> a.is_ignored("data/")
True

daavoo avatar May 27 '22 09:05 daavoo

I can reproduce with the following code:

import os
from tempfile import TemporaryDirectory

from dulwich.ignore import IgnoreFilter

with TemporaryDirectory() as tmp_dir:
    os.chdir(tmp_dir)

    for name in ["/absolute", "/absolutetrailing/", "relative", "relativetrailing/"]:

        with open(".gitignore", "w") as fh:
            fh.write(name + "\n")

        print(".gitignore contents:")
        with open(".gitignore") as fh:
            print(fh.read(), end="")

        ignore_filter = IgnoreFilter.from_path(".gitignore")
        name = name.removeprefix("/").removesuffix("/")
        is_ignored = ignore_filter.is_ignored(name.encode("UTF-8"))
        print(f"{name=}, {is_ignored=}\n")


.gitignore contents:
/absolute
name='absolute', is_ignored=True

 .gitignore contents:
/absolutetrailing/
name='absolutetrailing', is_ignored=None

 .gitignore contents:
relative
name='relative', is_ignored=True

 .gitignore contents:
relativetrailing/
name='relativetrailing', is_ignored=None

dtrifiro avatar May 27 '22 09:05 dtrifiro

@daavoo Do you have any sense of the fix in dulwich?

@thinety How does it impact you to have multiple entries in .gitignore?

dberenbaum avatar May 27 '22 15:05 dberenbaum

@daavoo Do you have any sense of the fix in dulwich?

Haven't looked in enough detail but most likely has to be fixed here:

https://github.com/jelmer/dulwich/blob/4756814d6c49fc735cb9cc401fd6443995e38f57/dulwich/ignore.py#L43-L78

daavoo avatar May 27 '22 15:05 daavoo

@dberenbaum Usually, data/ is inside another directory in the project (dir/ for example), and is ignored by the root .gitignore. When running dvc add dir/data/, another .gitignore is created inside dir/, because the root .gitignore contains /dir/data/ and not /dir/data. I need to delete this new .gitignore on every run of dvc add, which is more of an incovenience than a problem. The root file could also be modified, but in general I prefer the explicitness of the pattern /dir/data/ matching only directories.

thinety avatar May 27 '22 16:05 thinety