hxcpp icon indicating copy to clipboard operation
hxcpp copied to clipboard

Missing Ptrs / Bounty US 1000$ to fix the Garbage Collector

Open sibest opened this issue 4 years ago • 45 comments

Hi,

I've already worked on the GC before. Basically, some pointers are missing probably in MarkConservative or somewhere else, and this is leading my game (http://playruneverse.com) to random crashes.

I offer a 1000$ Bounty to help me fix the GC. mark_crash_2 mark_crash

sibest avatar Feb 11 '21 21:02 sibest

Hi! This looks familiar. Can you try merging https://github.com/proletariatgames/hxcpp/pull/10 into your project and seeing if that fixes it?

waneck avatar Feb 12 '21 00:02 waneck

No luck : (

sibest avatar Feb 12 '21 00:02 sibest

There are a number of reasons why this could be happening - it is either:

  1. a bug in the GC system
  2. the compiler optimizing something in an unexpected way
  3. a missed reference due to an oversight in the haxe generated code
  4. an issue with native integrations
    • threads
    • storing haxe objects.
  5. something new

And the diagnostic tests:

  1. The crash you see here is in the multi-thread marking. The problem is (very probably) not there, but earlier when either a live object overwrites some random memory, or more likely a reference got missed somewhere and this memory is allocated to a different object. To debug this, uncomment the line: #define HXCPP_GC_DEBUG_LEVEL 1 in hx/gc/Immix.cpp This will run a bit slower, but not so bad that you can't leave it on while testing. This will generate a much more informative callstack in the case of a crash.

  2. Does the problem happen in debug mode? If not, it might be an optmization issue.

  3. Are you using generational garbage collection? If so, adding the haxe defines -D HXCPP_CHECK_POINTER -D HXCPP_GC_CHECK_POINTER could help identify the problem early. This may slow things down though.

  4. Do you run many threads or native integrations? One thing to watch out for is callbacks from native code, such as audio sub-systems, or async responses.

  5. We will see

Is it easy in general is it to get it to crash? If the bug happens early, defining HX_GC_FIXED_BLOCKS in Immix.cpp can help with getting consistent crashes at the same memory location, which can be handy for debugging.

The first thing to try is HXCPP_GC_DEBUG_LEVEL and this might point right to the problematic object, and go from there.

hughsando avatar Feb 12 '21 02:02 hughsando

Hi @hughsando

  1. Yes it happens in debug mode too
  2. It's happening both in non-GC and GC almost
  3. It's crashing very early, so not using threads or callbacks

Even with HX_GC_FIXED_BLOCKS it's crashing randomly (maybe in the stack there are previous values even after the program run?)

Anyway using recursive marking (inPtr->__Mark(__inCtx);) the program does not crash. But i've checked the code of MarkContext and MarkChunk and seems correct. Maybe it's the marking order?

Do you have telegram? I will pay your time. Thank you much

sibest avatar Feb 12 '21 14:02 sibest

No, it's crashing with recursive marking too. Just later. <-- NO Sorry, it was crashing because I had FIXED BLOCK enabled and i limited it to 100mb only. With recursive calls it does not crash.

sibest avatar Feb 12 '21 14:02 sibest

I've one question:

In MarkConservative: You manage sgCheckInternalOffset in GetEnclosingAllocType, but this way the sgCheckInternalOffset could be outside the boundary of the vptr block and miss the allocation.

Shouldn't be more appropriate to do something like this: for(int *ptr = inBottom ; ptr<inTop; ptr++) { for(int dx=0;dx<=sgCheckInternalOffset;dx+=4) { void *sptr = *(void **)ptr; void *vptr = (char*)sptr - dx; in MarkConservative?

sibest avatar Feb 12 '21 22:02 sibest

I think these are equivalent because if it is truly a pointer to the middle of an object, then the start and the middle should both be in the same vptr block since real objects so not span blocks. Does this make sense?

It is possible there is a bug in the MarkChunk code, but this has been reasonably well battle tested. Perhaps in some unusual case like a massive linked-list or similar, or maybe a low-memory situation (which it does not look like here).

The ability to reproduce bugs consistently can be influenced by things such as an interactive startup, where a mouse-event can introduce some randomness. It looks like the first bug is coming from an Xml object - sometimes these things can be reproduced more reliably by, for example, putting the xml parsing in a loop and running until it crashes. If this is the case, then you might be able to isolate a small amount of code the reproduces the bug. If you send this to me, I should be able to get to the bottom of it.

But to summarize:

  • it happens in debug mode. This rules out most optimization problems and probably sgCheckInternalOffset too, which usually happens when windows optimizer keeps a pointer to the data pointer in an array, rather than the beginning of the array.
  • It happens in non-generational mode? This rules out quite a few possibilities
  • debug_level 1 fixes things so could be a MarkChunk issue. There are a lot of checks for MAX_MARK_THREADS > 1 in Immix.cpp. If these are changed to >0 then the Chunk code will get run with a single thread. If this reintroduces the problem, then there could be an issue with the MarkChunk logic, it not, then it could be a race condition in the MarkChunk code. But this could also be a side-effect of something else, like memory overwrite.
  • Threading and native code is pretty much ruled out

My best guess now is a bug in the xml objects internal implementations , which include maps and arrays and dynamic arrays and pretty much all the internal types. Next would be a MarkChunk race condition.

If you can get a program to crash in an xml parsing loop, that would be super helpful.

hughsando avatar Feb 13 '21 02:02 hughsando

Can you get it to crash on Linux? We've found that it's very useful to use https://rr-project.org/ to debug GC crashes, since you can trace back what happened to that pointer before

waneck avatar Feb 13 '21 18:02 waneck

Unfortunately my project doesn't compile on linux. Is it possible that Map<ObjectKey,Something> ObjectKey are not marked ? Maybe only Values are marked ?

sibest avatar Feb 13 '21 18:02 sibest

Ok no, I've tested the Hash and it's working properly, plus i've tried forcing the mark of weak pointers too and it's crashing the same way. I've tried to change the MarkChunk thing with a simple list and it's crashing the same way, so there is no fail in MarkChunk. About Xml, sometime it crashes there, sometime a little later, sometime before the Xml.

sibest avatar Feb 13 '21 18:02 sibest

It's crashing also in Mac and iPhone.

sibest avatar Feb 13 '21 22:02 sibest

I think I got it. Random stack data could fool GetAllocTypeChecked: it must check that the pointer is not inside an hole. Moreover allocStart flags must be set to 0 in reclaim when the rowMarker is empty, otherwise GetAllocType can fail with the nursery check. Tomorrow I will issue a pull.

sibest avatar Feb 14 '21 04:02 sibest

https://github.com/HaxeFoundation/hxcpp/pull/946

sibest avatar Feb 14 '21 16:02 sibest

Even if this pull reduces the problem by a lot, the only true solution is to update allocStart everytime (full collect). It's possible that with an outdated allocStart flag, an invalid pointer can be found on the stack pointing to a valid allocated space inside the blocks with a fake valid header leading to a crash. Very very rare, but can happen. Other solutions are appreciated. Maybe the zero thread can update the allocStart flag while not collecting? Or, we can scan the nursery outside holes too and skip the allocStart flag.

sibest avatar Feb 14 '21 20:02 sibest

Checking again:

  • this only happens for generational
  • it also happens in debug mode

In debug mode sgCheckInternalOffset can be set to 0 IIRC. You will also note that this is the case for non MSVC so if the bug happens in mac/ios the internal offset is probably not the cause.

Also, register capture is not normally an issue in debug mode, and again if it happens on Arm64, it is also unlikely to be the issue.

If it is just an generational issue, then there is the HX_GC_VERIFY define that performs a second pass to check for missed object, although this might not help if it is a conservative marking issues.

The general principle is that in nursery holes, allocStart should be zero, and the only way to find an object start is to scan the size links. When a nursery object is marked, it should set the allocStart and then normal logic applies. Maybe this setting should be delayed until the conservative marking is over so it does not affect other objects on the same row, or every object on that row should be promoted, or use an explicit check for inHole like you suggest. I also note that there are some thread primitives in this code and a bug would fit with the symptom of not breaking in a single-threaded recursive mark situation.

hughsando avatar Feb 15 '21 02:02 hughsando

Actually it's simple to understand:

0                             Row 0                             127
+---------------------------------------------------------------+
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 
+---------------------------------------------------------------+
                        32 allocStart flags

In the first time two allocations are made in Row 0:

0                             Row 0                             127
+---------------------------------------------------------------+
|1| | | | | | | | | |1| | | | | | | | | | | | | | | | | | | | | | 
+---------------------------------------------------------------+
| Allocation 1 = 40 bytes
| Allocation 2 = 88 bytes - full row
+---------------------------------------------------------------+

2x allocStart flag

  • Then a NON-full collect goes on ad frees the entire row (but not the allocStart flags)
  • A new allocation takes place:
0                             Row 0                             127
+---------------------------------------------------------------+
|1| | | | | | | | | |1| | | | | | | | | | | | | | | | | | | | | | 
+---------------------------------------------------------------+
| Allocation 1 = 96 bytes 
+---------------------------------------------------------------+

As you can see the second allocStart flag is still set to 1

MarkConservative then kicks in, in the stack a valid pointer that is pointing to the old second allocation is found AND for for a coincidence the 96 bytes allocation (which is for example a small ByteArray) contains data at the place of the second old allocation WHICH is considered a valid header <-- For some reason the markId is considered valid. But it's not a real allocation, so it tries to MarkAllocObject and then CRASH.

VisitBlock is also flawed as it only relies on allocStart flag to check if an allocation is starting there, BUT with previously outdated set allocStart the same thing goes on.

sibest avatar Feb 15 '21 06:02 sibest

All reclaims (full or not) should ZERO_MEM all the allocStarts for the hole, so it should not be possible to allocate an object that covers an existing allocStart mark. https://github.com/HaxeFoundation/hxcpp/blob/master/src/hx/gc/Immix.cpp#L1081

Because it is expensive to update the allocStarts, their meaning is "There was an object allocated here within the last N non-generational collects" rather then "there is certainly an object here".

The starts are only for conservative object detection. The row marks are for free row and hole detection.

Event with an allocStart, an object could be stale (the header is good, but the contents are bad) if it is not from the current allocation cycle. So an additional check is needed for the allocation cycle, time==gByteMarkID. But only a few bits are used for storing this time, so the allocStarts are updated before the time can "clock over". This just removes stale object marks, but does not free the row. At the time that this is run we know rowMarked[r] is true - ie, so at least something on the row is alive.

The nursery objects are different because they do not get a allocStart mark until we are certain we are keeping them. Instead we must follow the linked-list of sizes to see if it lands on an object pointer.

So I'm thinking either marking one conservative object messes with the size linked-list somehow and prevents other objects from being retained, or the HxAtomicExchangeIf is not working as I intended or there is something wrong at a higher level.

hughsando avatar Feb 15 '21 08:02 hughsando

You are right, it's something else. It's the size linked-list which is broken. Maybe a write overflow somewhere. I will see..

sibest avatar Feb 15 '21 11:02 sibest

Yes, I think I see the problem - it is with allocStart as you suggest, when a previous row has an object at the end, and the next row is the start of a hole. The fix is pretty simple because we know the trailing object is invalid.

hughsando avatar Feb 16 '21 08:02 hughsando

I see but did you change only the code when sgCheckInternalOffset is active? Because it crashes on mac too

sibest avatar Feb 16 '21 11:02 sibest

Yes indeed. May need to look for a second bug because this is specific to offsets that can span rows. I have added a define HX_GC_ZERO_EARLY which can help with some reproducibility issues. But back to the diagnostics - does it happen in debug mode, is it only generational? If debug fixes it, it could be that mac now needs sgCheckInternalOffset too.

hughsando avatar Feb 16 '21 12:02 hughsando

Unfortunately it's crashing in debug mode too on Mac

sibest avatar Feb 16 '21 12:02 sibest

Step 1: Write objects into HOLE 0 Inside the nursery Step 2: Collect. HOLE0 is now free again Step 3: Collect, MarkConservative scan for the nursery and find some valid header inside Hole0 of older objects (the hole was not zeroed). Can this give a problem ?

sibest avatar Feb 16 '21 12:02 sibest

So debug mode mostly rules out sgCheckInternalOffset and register capture. Next would be to put on HX_GC_ZERO_EARLY and HXCPP_GC_DEBUG_LEVEL 1. I will see if I could reproduce something on mac.

hughsando avatar Feb 16 '21 12:02 hughsando

Maybe the allocStart flag should be set for nursery object too, and the header checked to verify if it is nursery or marked object?

sibest avatar Feb 16 '21 12:02 sibest

It should not be the problem because the MarkConservative happens after the writes are finished and before they are zeroed. The blocks have a flag to say whether they are zeroed - could check this at runtime. They should be zeroed before they are used unless there is a bug. The HX_GC_ZERO_EARLY should make sure this is true, so if this fixes things, there may be a logic bug.

hughsando avatar Feb 16 '21 12:02 hughsando

It's fixed in my test build now, i tell you what changes I made in a second.

sibest avatar Feb 16 '21 12:02 sibest

Once it gets to a point where I am out out of ideas, the next step is to add code to verify the assumptions. For example, in some kind of debug mode, create a separate "nurseryAllocStart" array and set this for all objects. Then in the collection code, loop over everything and verify that these flags exactly match the GetAllocType logic - or at least for every conservative object. This should pinpoint the problem if it is there.

hughsando avatar Feb 16 '21 12:02 hughsando

About the allocStart for nursery object i was thinking at something like this: Step 0: Hole 0 is used to write an object that refers to another object in Hole 1. Step 1: Hole 0 and Hole 1 are reclaimed - but the nursery data is still there. Step 2: Hole 1 is used and zeroed by another block of some kind. Step 3: Some pointer found on stack goes on Hole 0 and revives the old object, and marks it, it's true to say that that object never stopped living, but the othe objects refered to him did. So crash.

I think that a nurseryAllocStart flag is necessary at this point, what do you think ?

sibest avatar Feb 16 '21 12:02 sibest

This really should not happen because after Step 1, the data might be there for a little while, but by the time Step 3 happens one of these things should be true:

  1. The row with the data on is kept alive by another object, in which case it will not be in an hole range so should not show up as a nursery object.
  2. The row was not kept alive, so it should be zero and/or resused with valid objects
  3. The row was reused and the stack accidentally points to new object - this is ok, the valid object will stay alive longer.

Point 1 seems the most likely to be wrong. It would need to have a valid allocStart flag and a valid time code but have been missed on the stack at least once so its contents have been collected.

hughsando avatar Feb 16 '21 13:02 hughsando