Training mode and Q-policies

Open FMalerba opened this issue 4 years ago • 0 comments

Q policies have no way of knowing whether the Q-Network underlying them should be called with the training mode flag = True or False. By default they are implicitly always calling the network with inference mode, but it is not cear to me why one would want this to be the default. This implies that in a standard scenario like

tf_env = ...
tf_agent = DqnAgent(...)
collect_policy = tf_agent.collect_policy
replay_observer = [replay_buffer.add_batch]

collection_step = DynamicEpisodeDriver(tf_env, collect_policy, observers=replay_observer).run(num_episodes=100)

one would end up with a replay buffer filled with transitions that were sampled on inference mode (which may be undesirable given that most likely the agent will be trained on them).

May 30 '21 09:05 FMalerba