Audit an append-only mode repo to make sure the client was well behaved
This idea is an attempt to detect/mitigate the following attack scenario involving append-only mode:
Say you're using append-only mode to protect your repo from an untrusted and hacked client deleting / changing your old archives. In addition, you periodically connect to the repo from a special trusted client in normal mode, and issue some prune commands.
Currently, if you do this, the trusted client will happily compact the repo as soon as it tries to write, happily applying all the untrusted client's bad changes permanently. So, the administrator would have to qualify the repo contents before compaction happens.
How could an automated qualification work?
The repo itself is a log-like structure, that means it is a sequence of operations like:
put(1, A), commit, put(1, B), commit
The current value of object 1 is now B, but we still can see it was A before. Until compaction happens and removes the superceded first put().
Untrusted clients are forced into append-only mode by the server, so we see their log of operations. The trusted admin client is not append-only and removes the log as soon as compaction runs.
PUT to id 0: these are manifest updates, we can check if archives are appended and previously present archives are still same (name, date, id). Special case: overwriting a .checkpoint archive of the same name.
PUT to other ids:
- id is new: likely legitimate, done by "borg create"
- id is not new: this means we overwrite already present data with new (different?) data. "borg recreate" legitimately does this when recompressing chunks, but other than that it should not happen. existing-id overwrite maybe could also be denied by server in append-only mode.
DEL: should not happen, only exceptions are .checkpoint archives (? CHECK THIS). A well behaved borg client should reject high level "borg delete" and "borg prune" commands in append-only mode.
This stuff could be a "borg qualify" ("borg audit"?) command and it would exit with success if all looks ok or emit warnings/errors and exit with warning/error status if not. If not, it should also give the last known-good transaction number.
Additionally, the trusted client likely wants to run "borg check" to assert the integrity of the archive (like discovering bad sectors or other corruption / inconsistency).
Optionally, the trusted client could make added archives pass a set of tests (like checking for expected content: is it there, can it be extracted, can it be used?).
This way, a user who worries about their untrusted clients being hacked can feel safe running a prune command automatically on their trusted client with a cron job: they'll still have old backups from before the hack available, so long as they detect it in time.
(by tw based on nadelle's post)
Dealing with .checkpoint archives (in manifest, DELs related to the archives) makes this quite a bit harder than:
- if there were no checkpoint archives at all (not great if your connection is unstable), or
- if we would not delete checkpoint archives (? CHECK THIS) and replace them in manifest (borg prune can get rid of them on the trusted client).
#1772 discusses alternative approaches for a safe backup mode.
- Diff borg list against previous copy, should only have added lines (lines include archive ID)
- Run check to verify integrity
What am I missing?
Edit: I dislike furthering reliance on Repository internals in other parts of the code base.
@enkore guess you mean diff between borg list repo current + previous. and borg check --verify-data repo.
yeah, guess that would do it also (because the archive id is authenticating the metadata chunks list, which is authenticating the metadata chunks, which are authenticating the metadata and the data chunks).
borg check --verify-data is very expansive. So this approach does not really scale well as is. At least some way of only checking what did change would be great for scalability. (denying puts to already existing ids and have the server keep a list of added ids?, chunk verification could then also be done only on a fraction of the chunks)
@textshell yes, there could be some checkpoint/bookmark feature for long running stuff like borg check [--verify-data] for scalability.
For security against attackers this would be only needed to see if chunk content does not correspond to the chunk_id (this could be used to corrupt old backups when overwriting existing chunks in the repo with bad data), so we can have that way cheaper by just denying such overwrites by the server. Then we need --verify-data only to find fs or hw problems on the repo server.
OK, so I guess we reduced this ticket to "introduce a do-not-overwrite-existing-chunks repo server mode".
Basically protected (non-modifiable in AO) mode for named repository objects. I think we can work this ("do not overwrite existing chunks") into AO in 1.1 as a behaviour change that would largely retain backwards compat (e.g. borg 1.0 create would still work, IIRC we don't even have anything that wouldn't work anymore).
Protected named objects -> is where the incremental check information goes.
Or something like that.
I'm wondering the same, how to automate the prune.
If I use a second client that has write access to actually make the prune, the hacked client could have done:
- some bad delete/prune
- sending wrong new backups (while keeping good underlying data) during a long period of time without me noticing
To mitigate the first, I'd create a new append-only mode, that doesn't allow the prune nor the delete. About the second, I don't know if it is possible, or if it is possible to mitigate? (monitoring the hash of the binary? I guess it is outside the scope of borg)
For the first attack, is it what you have in mind? Do you have an idea if it is doable? And how hard would it be to implement such a mode?
Basically protected (non-modifiable in AO) mode for named repository objects. I think we can work this ("do not overwrite existing chunks") into AO in 1.1 as a behaviour change that would largely retain backwards compat
Are there any current plans implementing this feature?
@mist there is no current work on this, AFAIK. not sure whether the idea is detailled enough / proof-checked enough yet.
:(
Another idea for preventing to delete data tagged as deleted by a -ao client while pruning with read/write client could be a --check-pruning-pattern options:
- would look into the transaction log to find the first transaction data
- would check if the archives present in the repo from this date to today are matching the pruning pattern
- would attempt to restore any missing ones.
Without being a protection against low level wizardry, this would protect someone doing automatic pruning against most obvious attacks/mistakes
As you noticed already: not really effective against attacks.
How should one currently do such an append-only audit manually? As in, how can the transaction log be inspected by a trusted client?
One idea behind backups is to restore data from a compromised machine. Before doing such a restore, one must first verify that the compromised machine did not tamper with those backups.
Take this simple scenario. There is only one client, and the server is in append-only mode, and nothing is ever pruned. So no information can be lost. The client becomes compromised and silently becomes malicious. Like deleting and recreating archives in order to make recovery harder. Then, an unknown and possibly long time later, the compromise is detected. How can one discover which transactions might have been malicious, if any, e.g. which transanctions contain a deletion? (Assuming either a new trusted client and/or server access.)
The only procedure I can see now is to iteratively delete the last transaction and do a list command and see whether an archive appears that wasn't there in the previous iteration. If so, then the last transaction must have deleted that archive. This seems like an cumbersome process; is there a simpler way to see what a transaction does? (Would this approach work at all?)
(I'm quite impressed with borgbackup and some of its features require require adapting my mental model. I'm trying to prepare for the inevitable situation when restoring from backup is necessary due to a compromised machine.)
Actually, the time between compromise and detection does not really matter. Scenario: The client becomes compromised and immediately deletes archives on the server and creates new ones with the same names (or at least, it is suspected that the compromised client did that). The compromise is detected quickly, and the time of compromise is accurately determined. Also then it is necessary to inspect the transaction log in order to determine whether "BackupOfLastTuesday" is indeed from last Tuesday and was not overwritten by the compromised machine on Thursday.
E.g. this scenario:
export BORG_PASSPHRASE='abcde'
borg init --encryption repokey --append-only testappend
# Everything is well, daily backups
echo "good1" > mydata.txt
borg create testappend::backup1 mydata.txt
echo "good2" > mydata.txt
borg create testappend::backup2 mydata.txt
# Machine gets compromised, all backups replaced
borg delete testappend::backup1
echo "bad1" > mydata.txt
borg create testappend::backup1 mydata.txt
borg delete testappend::backup2
echo "bad2" > mydata.txt
borg create testappend::backup2 mydata.txt
# Daily bad backups
echo "bad3" > mydata.txt
borg create testappend::backup3 mydata.txt
# Compromise is detected
Now the question is, how to figure out which transaction corresponds to this first delete command? (Spoiler, transaction 7.) Iterating either backwards, removing data files as described above, or forwards, starting with an empty repository and putting data files back both work. Iterating forwards seems easier, as iterating backwards requires deleting the cache etc each time.
This is good enough for me for now, as I've verified that I'll be able to recover from a corrupted append-only repository. But if there is an easier way to determine the contents of a transaction other than replaying it, then I'd love to learn about it.
Some other notes. There are also timestamps for the archives, but I'm assuming these are from the client side and therefore cannot be trusted. It could be that there are other bad transactions, like the compromised machine in the scenario above first creating bad backup3 before deleting backups 1 and 2. And I ignored overwriting blocks.
A lot has changed in borg master branch ("borg2"), so quite some of the stuff that referred to borg 1.x does not apply there anymore.
About auditing: giving --debug now will cause borgstore to output a kind of "access log", so one can easily see what's happening in the repository storage.
#8837 implemented BORG_REPO_PERMISSIONS=read-only env var - most of the repo will be r/o then, except the locks/ directory.
It also implements BORG_REPO_PERMISSIONS=no-delete - with that, destructive operations (like delete and overwrite) will be denied for the archives/ and data/ directories.
This is only implemented for the posixfs backend of borgstore (which is used by borg for file: and ssh: repos).
If one uses some other kind of repo (e.g. cloud), this needs to be implemented by cloud/server-side permissions configuration.