backintime icon indicating copy to clipboard operation
backintime copied to clipboard

fix: Sort rsync include/exclude options according to specificity (#561, #1420)

Open DerekVeit opened this issue 1 year ago • 13 comments

This branch is to address issues #561 and #1420 and possibly other circumstances that might arise where the existing strategy for ordering the --include= and --exclude= options to rsync don't quite work.

The existing code has ordered the options in groups. There has been some trial-and-error trying to get that ordering correct where fixing it for one case can easily break another. This proposed fix is to solve it with a more general strategy.

The general principle is that the user would expect more-specific configuration to supersede less-specific. For example, if one rule (include or exclude) applies to "/home" and another rule contradicts that for "/home/user", it is straightforward to assume that the user wants the second rule to apply to "/home/user" and the first rule to apply to anything else in "/home". The rsync command follows its include/exclude options as first-match-wins, so adding the options in order from most to least specific, irrespective of whether they are including or excluding, should give us the correct result. This is accomplished by reverse sorting the paths plus some special treatment for glob patterns and relative paths.

The only change is in Snapshots.rsyncSuffix, and the sorting is done in a new method, Snapshots.pathSelection. (It looks like I should rename this new method in snake case for the project's movement toward PEP 8.)

There are integration-level tests (test_rsync_selections.py) that check various cases, including for #561 and #1420. They call the Snapshots.backup method. For now, the revised rsyncSuffix still uses the original strategy unless a SELECTIONS_MODE attribute of the Config object exists and is set to "sorted". The tests use this to test cases with both strategies, original and sorted.

To make it easy to test various cases of existing file structures and include/exclude configurations, there is a test/filetree.py module for describing file structures with a string, e.g.:

"""
    var/
        log/
            dmesg
            syslog
"""

And the test cases are specified in a text format that uses those strings, e.g. in test/selection_cases. This is to make it easy to add more cases and also to make it very clear what each case is testing.

:exclude-name
    includes
        /var
    excludes
        log
    files_tree
        var/
            lib/
                foo
                log
            log/
                dmesg
                syslog
    expected_tree
        var/
            lib/
                foo

The tests use pytest for parametrize and fixtures, and there are fixtures defined in test/conftest.py for the Config object and the Snapshots object.

For some verbose logging from the test functions there is a simple log function in test/logging.py.

The new code does not yet include handling the EncFS feature. This might be as simple as adding encode = self.config.ENCODE and applying that in the right places as in rsyncInclude and rsyncExclude, but I haven't looked at that closely enough to see what I need for testing it, so I have left it alone so far. In this respect, the new code is definitely not ready. (I know the consideration of removing EncFS is only a maybe-sometime idea for now.)

I'm interested in BiT maintainers' thoughts, questions, and opinions.

SELECTION_MODE is probably only for temporary development testing purposes and can be eliminated, and rsyncSuffix might be simplified. The note I put in the docstring might be addressed or removed.

Based on re-reading CONTRIBUTING.md, I will convert the new tests from pytest to unittest. Let me know if you would prefer that I not add the ddt library for the parameterizing.

The pathSelections method is more sprawling than I would like.

The log function is not essential to anything. I find it useful to have easy, verbose logging from my tests, especially while working on them, but it can be removed.

I have a note in pathSelections suggesting that the GUI somehow alert the user if they include and exclude the exact same path. I think the GUI should refuse to accept the configuration update until the conflict is resolved too.

I thought I had read the CONTRIBUTING guidelines already, but I see some items in those for me to address.

Ironic that I have always used mostly single quotes and just earlier this year forced myself to start using mostly double quotes to be more aligned with what I started to think was a bit more generally favored, e.g. with black. Easy fix, though. I do use black on my new code, interactively, not blindly. It does have a setting to not force the double quotes.

DerekVeit avatar Oct 04 '24 18:10 DerekVeit

Thank you very much Derek.

I see you are using pytest-style tests. This is hard to maintain, hard to understand, to much implicit stuff. If you see some resources it would help a lot if you could rewrite them in vanilla "unittest" style. But it isn't a problem if not. In this case I will rewrite them myself.

The tests on CI are failing because your Python is to "modern". We are still at Python 3.9.

Just to inform the others: Derek opened this PR on my request after offering an approach to solve the underlying problem in the Issue comments.

Myself I am not able to review it in the near feature. But keeping this as a PR will make sure that the work is not lost.

buhtz avatar Oct 04 '24 19:10 buhtz

Yes, as noted somewhere in my long comment above I will rewrite the tests in unittest style. Similarly to my ironically having used the wrong quote style (double, should have been single) in this code, I have mostly used unittest in the past too.

DerekVeit avatar Oct 04 '24 19:10 DerekVeit

Updates since 52c7478e:

  • converted the integration tests from pytest to unittest.
  • added ddt for those tests.
  • added pyyaml to replace my made-up format for the test data.

I still need to convert my other pytest tests and address some other things.

DerekVeit avatar Oct 18 '24 14:10 DerekVeit

I see I have a lot to learn with this PR. Very nice.

buhtz avatar Oct 18 '24 15:10 buhtz

Hello Derek, for your info. Before merging your PR and PR #1897 I will do release 1.5.3. This will take some time because I will do a release candidate first hope to find some testers for it.

After 1.5.3 I can work and test yours and PR #1897. When they are merged I will soon target a 1.6.0 release.

Regards, Christian

buhtz avatar Oct 18 '24 18:10 buhtz

OK. Although the tests pass, I'm still working on this pretty much. I'm pushing it here mainly for visibility of what I'm doing. I'm also probably going to revise my tests to use pyfakefs. And it is still not taking EncFS into account. So I don't see it as ready for merging yet.

DerekVeit avatar Oct 18 '24 18:10 DerekVeit

Hello Derek, Just for your information. I do plan the next release (1.6.0) before Christmas.

How is your schedule? Can I plan your PR for this release (1.6.0) or for the release after it (1.6.1)? No need to rush. I just need to know for the overall planing.

Regards, Christian

buhtz avatar Nov 27 '24 15:11 buhtz

No, unfortunately I have been away from it for all of this month, so it still needs the handling of EncFS, etc. I will be getting back to it soon, though.

DerekVeit avatar Nov 27 '24 18:11 DerekVeit

No, problem. Thank you for reaching out.

buhtz avatar Nov 27 '24 20:11 buhtz

Please note that I updated your branch to the latest dev. So you have to update your local branch.

buhtz avatar Dec 29 '24 09:12 buhtz

Just an idea? Is there a good reason against using rsync --dry-run in the unit tests? OK, the files and dirs need to be created in the real file system. But then we can parse rsync's dry-run output without wasting resources on copy something.

buhtz avatar Apr 30 '25 12:04 buhtz

That's worth considering, but I think that given

  • the need to write real files to begin with
  • small number of files and minimal amount of data
  • efficiency of Linux filesystem cache
  • speed of contemporary drives, i.e. SSD

that the test time saved is probably not worth complicating the tests with the dry-run output handling. I don't mean the difficulty or work of parsing it, which would be trivial, but that that output would be a proxy for correct behavior rather than real thing, which adds another layer of consideration (opportunity for error) when making sure that the tests are testing what we think they're testing, especially around special cases. Really copying the files has the simplicity that the resulting files should be the same as the source files, so that there is an apples-to-apples comparison. And no special knowledge about dry-run behavior is introduced. Also, the dry-run strategy would increase how much the tested code is being run in a "test mode," which I think is best to minimize.

I want to address your comments of a few days ago but will have to get back to you maybe later this week. I do want to finish up this PR too.

DerekVeit avatar Apr 30 '25 13:04 DerekVeit

Agreed. 😄

buhtz avatar Apr 30 '25 14:04 buhtz