Popular-RL-Algorithms
Popular-RL-Algorithms copied to clipboard
About RL+LSTM
Hello, Markov property means that the current action of the agent is only related to the current state s_t, but the input of the policy network in your open-source code rl+lstm is s_ t and a_ t-1, does this type of algorithm converge in the training process? Thank you again for your open-source code.
I see that this problem has been explained in the references you give.