Policy-Gradient-Methods
Policy-Gradient-Methods copied to clipboard
Query on SAC2018.py file
Could you give reference to paper as to why you chose to make two soft-q networks because they are independently working and you are taking the minimum of both while calculating value-loss?