Nebula icon indicating copy to clipboard operation
Nebula copied to clipboard

Adds a per-level persistence system.

Open MistakeNot4892 opened this issue 4 months ago • 13 comments

Description of changes

  • Adds /atom/Serialize() to return an assoc list of relevant values suitable for saving out.
  • Adds /atom/Preload() and /atom/PreloadData() and hooks in SSatoms flush to deserialize atoms during flush.
  • Adds /datum/level_data/load_persistent_data() and /datum/level_data/save_persistent_data() as entrypoints for per-level persistence.
  • Adds a fire() override to SSpersistence to do a periodic level data save.
  • Converts the innards of the previous persistence system to use the new save/load handlers and procs.

General flow of level persistence:

  • Saving:
    • SSpersistence periodically iterates the z-level list, finds levels that want to serde, and calls save_persistent_data()
    • Levels return a list of instances to get_persistent_instances(), instances have Serialize() called and return a list of modified fields.
    • Fields are serialized (to JSON with the default handler) and written to disk.
  • Loading:
    • SSmapping initializes and calls preload_persistent_data() and load_persistent_data() on relevant /datum/level_data z-level objects.
    • load_persistent_data() creates the base instances and (for atoms) sets __init_deserialisation_payload with the data loaded from tile.
    • SSatoms flush calls Preload() on all deserialized atoms which pre-populates vars on the atom.
    • Ssatoms proceeds to Initialize() atoms as normal.

TODO

  • ~~Implement area serde.~~
  • [X] Implement datum serde.
  • [x] Implement non-turf loc for serialized atoms.
  • [x] Test with implementation on tradeship.
  • [x] Test that wall construction serializes.
  • [x] Test that wall removal serializes.
  • [x] Test that removing flooring serializes.
  • [x] Test that adding flooring serializes.
  • [x] Test that roofing turf serializes.
  • [X] Test that changing turf height serializes.
  • [x] Get simplemob serde to work.
  • [x] Test legacy persistence:
    • [x] Graffiti
    • [x] Filth
    • [x] Trash
    • [x] Paper
    • [x] Books
  • [x] Implement some kind of grandfather check to convert old persistence format to new.
  • [x] Test legacy migration.
    • [x] Graffiti
    • [x] Filth
    • [x] Trash
    • [x] Paper
    • [x] Books
  • [ ] Implement first-run level generation (suspend level gen if we initialized a level via serde)
  • [ ] Test level generator reordering (split into 2nd PR?)

Future work

  • Completely integrate /decl/persistence_handler into this system instead of the bespoke serde on SSpersistence.
  • Add a bespoke system for serializing area changes (blueprints etc) on a z-level.

Why and what will this PR improve

Adds a framework for handling persistence in the future.

Authorship

Myself.

Changelog

Nothing player-facing.

MistakeNot4892 avatar Oct 20 '25 14:10 MistakeNot4892

Depends on #5175

MistakeNot4892 avatar Oct 20 '25 14:10 MistakeNot4892

I'm going to call ~~datum and~~ area serde out of scope for this PR, I don't have any clue how to handle them in a clean way.

The atom side of this appears to be working fine, I've tested with a debug item and it all seems pretty good.

EDIT: Went ahead and just let it instance datums with the serde data as a list arg, in case it's needed in the future.

MistakeNot4892 avatar Oct 28 '25 03:10 MistakeNot4892

Slipped and accidentally added turf serde, limited testing on Shaded Hills but needs proper testing before I sign off on it.

MistakeNot4892 avatar Oct 28 '25 09:10 MistakeNot4892

Depends on #5182

MistakeNot4892 avatar Oct 28 '25 09:10 MistakeNot4892

Testing notes:

  • ~~shutter state needs to be serialized on walls~~
  • fluids need to be serialized on floors
  • flooring is busted

MistakeNot4892 avatar Oct 28 '25 10:10 MistakeNot4892

Alright so, first, a few things. Mainly about some of the naive mistakes we've made before that I can spot in here.

I really really really recommend making sure whatever your approach is to make sure it scales well with a lot of things to save all of a sudden. Because, you can easily end up with hundred of thousands of atoms to save. Especially if you're saving in real time (for example, someone blows up the whole station, or everything vents to space). Querying over and over would get more and more costly. It gets surprisingly slow quickly. Especially if you're using byond's IO to disk. So, definitely stress test this early on.

Next, error handling. I see you're just kind of erroring out on any exceptions. And it'll come to bite you in the ass later on. Make sure early on to have a mechanism for tolerating some deserialization errors, and even some serialization errors. Ideally on a per atom basis. Strictly terminating the whole process on a single error is going to make you lose saves over really dumb minor things when running on a real server. Like, if for some reasons a single atom on the map that was edited by admins, had it's init code changed, or just bugged out during the game and caused an exception while serializing. And you lose the whole map if your admins aren't coders themselves and realize they can just delete that one thing breaking the save.

Another thing you probably want to do early is allow for runtime generated levels (for example, mining maps or ruins) to possibly be regenerated and not have their state loaded from save. But that will require handling things from the map template level rather than the level data level.

PsyCommando avatar Oct 29 '25 21:10 PsyCommando

Thanks for your thoughts - will have a proper read later, but just offhand:

Next, error handling. I see you're just kind of erroring out on any exceptions.

The exception handling specifically does not terminate the whole process on an error, that's why I've stuffed in so many try/catch blocks. An individual atom will be skipped but the overall process should complete regardless. Only issue I don't have a good solution for is an atom failing to deserialize and then being effectively removed from the save when the serialization pass runs, but that might be desirable.

Querying over and over would get more and more costly.

I don't really follow what this is referring to. It queries once on roundstart, and then once each save run (which is once an hour on current code, but I'll probably make it a config).

It gets surprisingly slow quickly. Especially if you're using byond's IO to disk.

Likewise, it's only doing I/O once per level per load/save run.

MistakeNot4892 avatar Oct 29 '25 21:10 MistakeNot4892

Depends on #5191, #5190, and #5189.

MistakeNot4892 avatar Oct 31 '25 10:10 MistakeNot4892

Thanks for your thoughts - will have a proper read later, but just offhand:

No problems! I don't really have that much to say so far. I'll have to see what you do with it later to get a better idea and make some more helpful comments.

Next, error handling. I see you're just kind of erroring out on any exceptions.

The exception handling specifically does not terminate the whole process on an error, that's why I've stuffed in so many try/catch blocks. An individual atom will be skipped but the overall process should complete regardless. Only issue I don't have a good solution for is an atom failing to deserialize and then being effectively removed from the save when the serialization pass runs, but that might be desirable.

Yeah, exceptions don't terminate it, but IIRC, PRINT_STACK_TRACE terminates the current proc stack to unwind it and write a stack dump? It calls CRASH. Shouldn't it just use error() or rethrow/throw an exception upwards instead? I remember it being a reason we didn't use PRINT_STACK_TRACE unless it was a completely unrecoverable problem on persistence cause it prematurely ended the save loop.

Skipping atoms that fail to deserialize is fine tbh. A partially loaded atom might cause issues anyways. As long as it doesn't prevents everything else being loaded it should be good enough.

Querying over and over would get more and more costly.

I don't really follow what this is referring to. It queries once on roundstart, and then once each save run (which is once an hour on current code, but I'll probably make it a config).

Yeah, you're right about the querying. I didn't notice the subsystem would fire once every hour. My bad.

It gets surprisingly slow quickly. Especially if you're using byond's IO to disk.

Likewise, it's only doing I/O once per level per load/save run.

About this, what I meant here is just, make sure writing to json isn't going to take 40 minutes to save several z-levels with stuff in them. Cause, we ran benchmarks on our old save system, and the bulk of the saving time was writing to file and string operations. Since we began using SQL we bypassed that bottleneck and It can take anywhere from 2-3 minutes to 5 minutes to save a really big map that would take 30+ minutes to save if we had used byond's file IO.

PsyCommando avatar Oct 31 '25 21:10 PsyCommando

Yeah, exceptions don't terminate it, but IIRC, PRINT_STACK_TRACE terminates the current proc stack to unwind it and write a stack dump? It calls CRASH.

PRINT_STACK_TRACE is wrapped in a proc specifically so the CRASH doesn't end whatever context it's called it. If it was prematurely ending runs on Persistence then something else was wrong.

MistakeNot4892 avatar Oct 31 '25 23:10 MistakeNot4892

Yeah, exceptions don't terminate it, but IIRC, PRINT_STACK_TRACE terminates the current proc stack to unwind it and write a stack dump? It calls CRASH.

PRINT_STACK_TRACE is wrapped in a proc specifically so the CRASH doesn't end whatever context it's called it. If it was prematurely ending runs on Persistence then something else was wrong.

Hmm, never mind that then. I'd have to check in a live context again. But I remember it causing issues before, and not giving a stack trace most of the time.

PsyCommando avatar Nov 02 '25 21:11 PsyCommando

I'm not really sure what to do about randomly generated levels. I might pull the reordering logic out of this PR and just prevent placing more mobs if we deserialized. Otherwise we end up with mask turfs, etc. left on the map.

MistakeNot4892 avatar Nov 06 '25 12:11 MistakeNot4892

I'm not really sure what to do about randomly generated levels. I might pull the reordering logic out of this PR and just prevent placing more mobs if we deserialized. Otherwise we end up with mask turfs, etc. left on the map.

should we not just automatically mark those turfs as serialized?

out-of-phaze avatar Nov 06 '25 16:11 out-of-phaze