core icon indicating copy to clipboard operation
core copied to clipboard

Bugs fix and improvements to Node.Start() and Node.Stop()

Open hackintoshrao opened this issue 3 years ago • 17 comments

Context: I love the project and what it stands for. It's very important for an impactful and complex project like this to have great coverage of tests. I started working on writing unit and integration tests for the entire codebase. While I work on tests I'm fixing things as I find them. Here is the summary of the PR

  • Fixes https://github.com/deso-protocol/core/issues/413
  • Making Node.IsRunning flag atomic.
  • Renaming test utility functions for better readability.
  • Utility functions for changing the status of the node.
  • Unit tests for the utility functions.

hackintoshrao avatar Nov 06 '22 12:11 hackintoshrao

Wow great work! We're reviewing and should get back shortly.

diamondhands0 avatar Nov 07 '22 20:11 diamondhands0

@AeonSw4n @lazynina : I made the changes as per your suggestions, thank you for the great inputs and the timely reviews. Here's the summary of my changes.

I find "get/set" more intuitive than "load/change" (though it's just my personal preference). Since this getter isn't holding a mutex, maybe we rename it to GetStatusWithoutLock

Changed LoadStatus->GetStatus and ChangeStatus-> SetStatus

also maybe we make this method private and wrap it with some other GetStatus() that does actually hold the status mutex? would be best if all the mutex logic lives within these functions instead of handled by the callers.

created private functions with suffix withoutLock, and wrapper GetStatus and SetStatus which uses locks inside them. Moved the locks which were earlier held at multiple places to a few set location at the beginning of function definition.

It feels like you're using statusMutex very similarly to how runningMutex is used. How about we just get rid of runningMutex and replace it with the statusMutex? In this case, we would also get rid of node.statusMutex.Lock()/Unlock() in here, since the mutex will be held inside of node.Stop()

Got rid of runningMutex and replaced it with statusMutex. Now, statusMutex is held inside of node.Stop() and node.Start(), this eliminates the use of locks at multiple places.

let's move the node.statusMutex.Lock()/Unlock() and the if !node.IsRunning(){...} condition inside the node.Stop(). Since Stop() should never be called when node is already stopped it feels reasonable to isolate everything inside Stop. Same goes for the NodeErase case.

Node.Stop() automatically takes care of validating the current status of the node before stopping. Hence, removed the !node.IsRunning check.

This is the setter for the node status, similarly to my comment about the getter LoadStatus(), maybe we rename it to setNodeStatusWithoutLock()?

Renamed changeNodeStatus to changeStatusWithoutLock, and created a wrapper ChangeStatus which holds the lock internally.

big piece from my review is that we don't want to have these mutexes flying around all the calling functions, but should instead handle the lock/unlock in the helper functions as much as we can.

Mutexes don't fly anymore :) 🚀

hackintoshrao avatar Nov 15 '22 12:11 hackintoshrao

@hackintoshrao thanks for making the updates! Just wanted to make a quick update on my end and let you know reviewing and testing all your code is a high priority for me, and will get to it once I finish being tunnel-visioned on access groups & group chats.

AeonSw4n avatar Nov 24 '22 02:11 AeonSw4n

Thanks @AeonSw4n . Just that this PR will enable me to do tons of contributions. I'm sweeping through the codebase writing tests, but I can't move forward till this is merged.

May be I can help with access groups and chats by writing tests and manually testing them.

hackintoshrao avatar Nov 24 '22 02:11 hackintoshrao

Oh, that's so awesome @hackintoshrao I'm so excited to see your other contributions! Okay, I'll look into your code in the evening. Speaking of tests, I've actually been working on a new transaction unit testing framework lately. I'll PR it out of my current code so it's easier to read lol -- figured that might be interesting to you.

So cool to see your readiness to help with access groups! If you feel like checking out the code the latest branch is in this PR #412. Let's wrap this PR, and I'll think of what could be the best use of your time. :)

AeonSw4n avatar Nov 24 '22 02:11 AeonSw4n

Thanks @AeonSw4n . Getting this merged will also help me fix tests for the #412 PR. Eager to check out the transaction test framework.

(On a different note, so cool to see Badger! I used to work at Dgraph, the company behind Badger.)

I like the idea of the transaction test suite. I've been thinking of a framework with parallel tests to benchmark and stress test the core's performance. But, many things need to clean up before getting there. Hence, I'm running across the codebase with tests.

I'm curious to monitor the Go routine count and graceful exits while stress-testing the transactions, I have some early suspicions of Go-routine leaks. PS: I have OCD for an unusually excessive amount of testing😉

Overall this new transaction test framework can be used for testing a lot of things around leaks, performance, and correctness. Keen on checking it out and contributing to it.

hackintoshrao avatar Nov 24 '22 09:11 hackintoshrao

Just reviewed your latest code, and I've got some minor change requests, but giving you an approve because your code is really awesome! One more comment I have is, do we need these .DS_Store files? We didn't have them before and they feel a bit redundant.

Thank you @AeonSw4n :) I'll remove the .DS files, they are redundant.

Once again thank you for this contribution and your flexibility when faced with our feedback 😃 @lazynina @diamondhands0 can you guys also take a look so we can merge?

I learned quite a bit from the reviews :)

hackintoshrao avatar Nov 25 '22 00:11 hackintoshrao

@AeonSw4n Found the transaction test framework inside the block_view_test.go. Working on some refactoring on the suite to improve its usability, for instance, to try parallel transaction scenarios with multiple accounts and verify their sanity.

Is there a better channel to communicate while I work on the codebase than Github comments😅

hackintoshrao avatar Nov 25 '22 04:11 hackintoshrao

@AeonSw4n Found the transaction test framework inside the block_view_test.go. Working on some refactoring on the suite to improve its usability, for instance, to try parallel transaction scenarios with multiple accounts and verify their sanity.

Is there a better channel to communicate while I work on the codebase than Github comments😅

Prallelizing tests/transactions could be fun @hackintoshrao :) Though keep in mind they will always be processed sequentially in the mempool. We are planning to build actual prallelization for transaction processing a little further on the horizon -- probably after/concurrently with PoS. I've also refactored the testing framework into another PR #422. Before it's merged I'm planning on adding a state verification for before & after connect-disconnect to Postgres tests (currently this only works for Badger tests) and just documenting everything a little better.

Re communication channel: certainly! I think you were chatting with Dylan over Discord, is that a good channel? I'll reach out to you soon (almost done with access groups 🤞).

AeonSw4n avatar Nov 30 '22 22:11 AeonSw4n

@AeonSw4n Discord is fine, I just need to map the Github accounts to the Discord user names!

I'll look into #422 and look into opportunities to contribute.

hackintoshrao avatar Dec 01 '22 01:12 hackintoshrao

@lazynina The PR is ready as per the suggestions, do let me know if there's anything more needs to be done for this to be merged :)

hackintoshrao avatar Dec 11 '22 15:12 hackintoshrao

@lazynina The PR is ready as per the suggestions, do let me know if there's anything more needs to be done for this to be merged :)

@hackintoshrao I have been testing this out by running and stopping a node, but ran into a case where one of the status change validations returns an error and the node will not switch to running. Trying to figure out exactly what's going on and to test out more stop + restart situations to make sure we don't have nodes that get stuck. I have only reproduced it once but it's a little unclear what is happening. As I have more details, I'll share them

lazynina avatar Dec 11 '22 15:12 lazynina

Thank you @lazynina . Please do report the edge scenarios when you have them.

hackintoshrao avatar Dec 12 '22 11:12 hackintoshrao

Thank you @lazynina . Please do report the edge scenarios when you have them.

I can't reproduce it - maybe it happened in a dream! @diamondhands0 for final review.

lazynina avatar Dec 15 '22 16:12 lazynina

Thanks @lazynina . Will wait for the review by @diamondhands0 .

hackintoshrao avatar Dec 16 '22 13:12 hackintoshrao

@hackintoshrao - can you provide more details on how to reproduce the bug ( #413 ) and what specifically in this PR is fixing that bug? @diamondhands0

lazynina avatar Dec 22 '22 16:12 lazynina

The bug occurs when the shouldRestart flag is true during the first run of the server.

In this scenario, nodeMessageChan receives a lib.NodeErase message, but at this point, node.IsRunning is still set to false.

Hence, instead of safely recovering, it panics here.

Here is a quick screen recording explainer https://www.loom.com/share/410232d7a6ab4d1e8610c9259de1cbf6 .

hackintoshrao avatar Dec 22 '22 17:12 hackintoshrao